I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.
As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:
Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.
Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.
Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.
As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.
Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.
One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…
if they dont care about copyright from thirdy parties they shouldnt care if an employee leak their training data and/or code.
No AI was used for the generation of this post
and I have to go find a job flipping burgers or something.Come on now, you're a techie! That means you probably have some qualifications or at least some background in some technical field such as development or systems management or other things like that.
So no worries! Besides, we don't need you meatbags flipping our burgers in the future, we have AI robots for that which can do a much better and more efficient job of it.
Please don't say that you need a free pass to steal so that progress can happen, that's been proven multiple times in the past to actually mean "progress for the few, suppression for the many". Let's not have another dark ages again? OK?
On a lighter note, Liam, all those humanisms of yours only prove your authenticity
Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.Maybe they should've thought about that before they messed up the copyright system to cover things for a hundred-odd years at a time? Just a thought.
I think I need to learn how to fish…Alas, it's not as easy as it is in Stardew Valley.
genAI waffle is milquetoast boring to read, it's usually wrong in some meaningful way and doesn't add any insights (how could it?). I don't want to read paraphrased press releases or transcriptions of marketing videos. I don't have time for that shit.
No this type of AI should be much better suited to parse through boring data (like say every single medical study every done) and work as expert systems (like a glorified google for doctors to try their patients symptoms against all known knowledge) and things like this.
After all, AIs are just doing on a way much larger scale what humans have already been doing for millennia : any creative work is inspired (consciously or not) by many other works that preceded it. If the border between inspiration and plagiarism is so difficult to define, it may be because it does not make much sense to create one. If AIs can help the world understand that and promote other ways to generate money from creative work, it could be for the better.
not provide AI systems that meet the needs of today’s citizens.
Thankfully not, if the "needs of today's citizens" is having no skills at all and telling AI to write anything, including a crime novel or a PhD thesis, or draw anything including a photorealistic compromising image of the girl next door who isn't interested in you and whose life you want to ruin and have it spat out in seconds.
Last edited by damarrin on 9 January 2024 at 3:32 pm UTC
It would be bad enough if it only meant more trash on the Internet, but you just know this widespread corruption of language is also going to influence people, especially language learners, who will be much more frequently exposed to mistranslations and unnatural expressions and adopt them naively. The effect on English speakers will likely be lesser or slower to manifest (English being the "native" tongue of most AI and being a simple language to begin with), but I weep for just about every other language.
...it's not hard to write anybody can do it, and it's the pennical of laziness to...*Pinnacle ;)
Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them, so we’re seeing tons of poorly translated content filling up search engine results in content farms with country-specific domains (e.g. "english-sounding-name.fr"). Logically, the AI is going to continue to get trained on junk content that it (or another AI) wrote or translated itself and perpetuate, if not amplify, the errors in it by seeing them as commonly used. A race to the bottom indeed.Duolingo recently did this. Translations now by AI, with only a few checking them over. Expect more of this sort of thing over time.
And for what? LLMs are just complex guessers. Sure, they guess with context, but they're still just guessing based on all the billions of documents they consumed during their (extremely intensive) training. You can't use them for research because they make shit up... because they're just guessing. It's a mess.
I'm hoping that 2024 might see some of this novelty wear off as consumers realise how bland and uninspiring AI generated content generally is, but I suspect that real, lasting damage will have been done by then.
It has a use in enterprise settings, properly controlled, with targeted outcomes. As it stands? Total shit show.
Well, AI may be bad right now, but I'll eat my hat if it doesn't become much less bad very quickly.I've noticed that in this current degenerate age, most people don't even have hats. How do I know you will really eat it? You have no credibility, sir!
More seriously, I'm not sure it will improve that much that fast. This seems like a new technology because of the way it burst on the scene, but the research into this basic schtick has been going on for decades, staying quiet until they got the whole thing looking promising enough that someone was willing to sink in the cash to scale it up to really big data sets. And with these things, the size of the data set is key. So while it looks new, it may actually already be a fairly mature technology, not subject to the kind of rapid improvement you might expect from something genuinely new.
Last edited by Purple Library Guy on 9 January 2024 at 5:38 pm UTC
P.S. According to LanguageTool, three commas were needed in the article.Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.
Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them
Given the choice between a corrupt human translation and an AI translation, which one will you choose?
Canonical recently had to take down the Ubuntu 23.10 release because a corrupt translator vandalized the Ukrainian translation. Although it's perfectly understandable why they would do so, it is no less inappropriate and disrespectful to the authors of the original text.
The Anime industry has recently come under fire for that sort of localization vandalism, too. Apparently it's gotten so bad, that people will celebrate when a human translator is fired from the Anime industry, and replaced with an AI.
if you are going to let AI systems steal copyrighted content then it is also OK for the ReactOS and wine teams to use leaked windows source code to build ReactOS and wine if they did that Microsoft will DMCA strike the projects faster that you can say Microsoft
the problems with GitHub copilot is 1. AI models getting trained on source code that is source available but not open source for example the windows research kernel
2. having a project on GitHub and not having the option not to let the AI models train on there project but who gets the final decision the project lead or is it like trying to change the license of a project where you need most of the contributors to agree to the license change
P.S. According to LanguageTool, three commas were needed in the article.Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.
All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.
This joke?
A comma is the difference between:
- Let's eat, Grandma!
and
- Let's eat Grandma!
LLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that. That means bad data is also getting into many of the LLMs too. I've been using CoPilot to write code for a while now. It is absolutely useful, but it also gets things wrong on a regular basis too.
See more from me