OpenAI say it would be 'impossible' to train AI without pinching copyrighted works

By Liam Dawe - 9 Jan 2024 at 12:47 pm UTC
Last updated: 9 Jan 2024 at 2:53 pm UTC

I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.

As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.

Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.

As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.

Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.

One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…

Article taken from GamingOnLinux.com.

Tags: Editorial, Misc

26 Likes

About the author - Liam Dawe

I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly checked on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly. You can also follow my personal adventures on Bluesky.
See more from me

Some you may have missed, popular articles from the last month:

DOOM + DOOM II adds multiplayer mod support but it's broken on Steam Deck / Linux

Get ready for a smashing time as Wreckfest 2 comes to Early Access in March

Spooky drone-operating derelict spaceship explorer Duskers is getting a sequel

Valve use a little code from Godot for Half-Life 2, Counter-Strike: Source, Day of Defeat: Source, Team Fortress 2

The comments on this article are closed.

All posts need to follow our rules. For users logged in: please hit the Report Flag icon on any post that breaks the rules or contains illegal / harmful content. Guest readers can email us for any issues.

64 comments

Page: 1/4 »

elmapul 9 Jan 2024

Link

i can understand using it for research purposes, but as an commercial product?
if they dont care about copyright from thirdy parties they shouldnt care if an employee leak their training data and/or code.

7 Likes, Who?

doragasu 9 Jan 2024

Link

I like AI technology, I think that well used it can generate great value. But we have to stop the rampaging copyright and license infringement AI companies are perpetrating. A product that needs to systematically break the law to work is not acceptable by any means.

No AI was used for the generation of this post

0 Likes

BlackBloodRum 9 Jan 2024

Link

Supporter Plus

and I have to go find a job flipping burgers or something.

Come on now, you're a techie! That means you probably have some qualifications or at least some background in some technical field such as development or systems management or other things like that.

So no worries! Besides, we don't need you meatbags flipping our burgers in the future, we have AI robots for that which can do a much better and more efficient job of it.

2 Likes, Who?

grigi 9 Jan 2024

Link

Supporter Plus

Agree with the sentiment here. If you can't do business without stealing, then you should be described as organised crime and get eradicated for the sake of progress.
Please don't say that you need a free pass to steal so that progress can happen, that's been proven multiple times in the past to actually mean "progress for the few, suppression for the many". Let's not have another dark ages again? OK?

On a lighter note, Liam, all those humanisms of yours only prove your authenticity

14 Likes, Who?

Pengling 9 Jan 2024

Link

Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Maybe they should've thought about that before they messed up the copyright system to cover things for a hundred-odd years at a time? Just a thought.

I think I need to learn how to fish…

Alas, it's not as easy as it is in Stardew Valley.

11 Likes, Who?

neffo 9 Jan 2024

Link

The personal touch is why I read here, and follow you on the socials.

genAI waffle is milquetoast boring to read, it's usually wrong in some meaningful way and doesn't add any insights (how could it?). I don't want to read paraphrased press releases or transcriptions of marketing videos. I don't have time for that shit.

7 Likes, Who?

F.Ultra 9 Jan 2024

Link

Supporter

I think that in general the main issue is the use of AI in creative markets (aka content creation) because that is the field where it have to steal from others in every thing it creates (since it isn't a real AI, it is just a mathematical model that predicts outcome based on the input). And while this might be where all the brouhaha in the public is at the moment it should really be outlawed as the copyright infringement as it truly is.

No this type of AI should be much better suited to parse through boring data (like say every single medical study every done) and work as expert systems (like a glorified google for doctors to try their patients symptoms against all known knowledge) and things like this.

6 Likes, Who?

junibegood 9 Jan 2024

Link

Although I share your concerns about what the web might when AIs are (even more) all over the place, I see something interesting there as well : AI has the potential to render the whole idea of intellectual property obsolete. And if it's OK for AIs to disregard it, why would it be any different for humans ?

After all, AIs are just doing on a way much larger scale what humans have already been doing for millennia : any creative work is inspired (consciously or not) by many other works that preceded it. If the border between inspiration and plagiarism is so difficult to define, it may be because it does not make much sense to create one. If AIs can help the world understand that and promote other ways to generate money from creative work, it could be for the better.

2 Likes, Who?

damarrin 9 Jan 2024

Link

Supporter Plus

not provide AI systems that meet the needs of today’s citizens.

Thankfully not, if the "needs of today's citizens" is having no skills at all and telling AI to write anything, including a crime novel or a PhD thesis, or draw anything including a photorealistic compromising image of the girl next door who isn't interested in you and whose life you want to ruin and have it spat out in seconds.

Last edited by damarrin on 9 Jan 2024 at 3:32 pm UTC

3 Likes, Who?

damarrin 9 Jan 2024

Link

Supporter Plus

Well, AI may be bad right now, but I'll eat my hat if it doesn't become much less bad very quickly.

1 Likes, Who?

Purple Library Guy 9 Jan 2024

Link

Personally, I don't care about copyright per se. Current copyright laws suck. But what's at issue here really is that the current ChatGPT-type "AI" thingies (which are not AI) are being used mostly by outfits like Google, who are creating this gateway thing where their "AI" restates the internet to you so you don't have to go to any actual websites, and all the ad revenue stays with Google. Left unchecked, this will strangle the internet and all the creators on it. Copyright is maybe the only plausible weapon right now to block this, so fine I'll back copyright this one time.

7 Likes, Who?

Salvatos 9 Jan 2024

Link

Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them, so we’re seeing tons of poorly translated content filling up search engine results in content farms with country-specific domains (e.g. "english-sounding-name.fr"). Logically, the AI is going to continue to get trained on junk content that it (or another AI) wrote or translated itself and perpetuate, if not amplify, the errors in it by seeing them as commonly used. A race to the bottom indeed.

It would be bad enough if it only meant more trash on the Internet, but you just know this widespread corruption of language is also going to influence people, especially language learners, who will be much more frequently exposed to mistranslations and unnatural expressions and adopt them naively. The effect on English speakers will likely be lesser or slower to manifest (English being the "native" tongue of most AI and being a simple language to begin with), but I weep for just about every other language.

...it's not hard to write anybody can do it, and it's the pennical of laziness to...

*Pinnacle ;)

5 Likes, Who?

Liam Dawe 9 Jan 2024

Link

Admin

Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them, so we’re seeing tons of poorly translated content filling up search engine results in content farms with country-specific domains (e.g. "english-sounding-name.fr"). Logically, the AI is going to continue to get trained on junk content that it (or another AI) wrote or translated itself and perpetuate, if not amplify, the errors in it by seeing them as commonly used. A race to the bottom indeed.

Duolingo recently did this. Translations now by AI, with only a few checking them over. Expect more of this sort of thing over time.

3 Likes, Who?

scaine 9 Jan 2024

Link

Contributing Editor
Mega Supporter

I'd love to know where the money is being made with this shit. AI is not cheap to run, and while it's not proof-of-work coin-mining bad, it's still pretty bad for the environment overall, given that all the compute is running on hot-ass tensor cores guzzling electricity and cooling before it melts. Microsoft stuffed over $10B (yep, billion) into OpenAI, with another billion coming from multiple rounds of fund-raising, with OpenAI also apparently wooing the middle-east for another $8B-$12B. Meanwhile, Meta is pushing Llama2, Google is pushing Bard and Gemini, while Amazon, Google and others are all out (also to the tune of around $6B) on Anthropic.

And for what? LLMs are just complex guessers. Sure, they guess with context, but they're still just guessing based on all the billions of documents they consumed during their (extremely intensive) training. You can't use them for research because they make shit up... because they're just guessing. It's a mess.

I'm hoping that 2024 might see some of this novelty wear off as consumers realise how bland and uninspiring AI generated content generally is, but I suspect that real, lasting damage will have been done by then.

It has a use in enterprise settings, properly controlled, with targeted outcomes. As it stands? Total shit show.

7 Likes, Who?

Purple Library Guy 9 Jan 2024

Link

Well, AI may be bad right now, but I'll eat my hat if it doesn't become much less bad very quickly.

I've noticed that in this current degenerate age, most people don't even have hats. How do I know you will really eat it? You have no credibility, sir!

More seriously, I'm not sure it will improve that much that fast. This seems like a new technology because of the way it burst on the scene, but the research into this basic schtick has been going on for decades, staying quiet until they got the whole thing looking promising enough that someone was willing to sink in the cash to scale it up to really big data sets. And with these things, the size of the data set is key. So while it looks new, it may actually already be a fairly mature technology, not subject to the kind of rapid improvement you might expect from something genuinely new.

Last edited by Purple Library Guy on 9 Jan 2024 at 5:38 pm UTC

5 Likes, Who?

Purple Library Guy 9 Jan 2024

Link

P.S. According to LanguageTool, three commas were needed in the article.

Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

4 Likes, Who?

Talon1024 9 Jan 2024

Link

Supporter Plus

Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them

Given the choice between a corrupt human translation and an AI translation, which one will you choose?

Canonical recently had to [take down the Ubuntu 23.10 release](https://www.techradar.com/pro/ubuntu-2310-launch-hit-by-malicious-translation-changes) because a corrupt translator vandalized the Ukrainian translation. Although it's perfectly understandable why they would do so, it is no less inappropriate and disrespectful to the authors of the original text.

The Anime industry has recently come under fire for that sort of localization vandalism, too. Apparently it's gotten so bad, that people will celebrate when a human translator is fired from the Anime industry, and replaced with an AI.

0 Likes

MadWolf 9 Jan 2024

Link

hi
if you are going to let AI systems steal copyrighted content then it is also OK for the ReactOS and wine teams to use leaked windows source code to build ReactOS and wine if they did that Microsoft will DMCA strike the projects faster that you can say Microsoft

the problems with GitHub copilot is 1. AI models getting trained on source code that is source available but not open source for example the windows research kernel

2. having a project on GitHub and not having the option not to let the AI models train on there project but who gets the final decision the project lead or is it like trying to change the license of a project where you need most of the contributors to agree to the license change

0 Likes

scaine 9 Jan 2024

Link

Contributing Editor
Mega Supporter

P.S. According to LanguageTool, three commas were needed in the article.
Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.

This joke?

A comma is the difference between:
- Let's eat, Grandma!
and
- Let's eat Grandma!

7 Likes, Who?

EagleDelta 9 Jan 2024

Link

I get really annoyed by this idea that LLMs are "stealing" data. It's literally the automation of what people manually do..... what we have always done in tech. My job is built around automating monotonous tasks to improve stability and reliability.

LLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that. That means bad data is also getting into many of the LLMs too. I've been using CoPilot to write code for a while now. It is absolutely useful, but it also gets things wrong on a regular basis too.

1 Likes, Who?

1 /4 »

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!