OpenAI say it would be 'impossible' to train AI without pinching copyrighted works

By Liam Dawe - 9 Jan 2024 at 12:47 pm UTC
Last updated: 9 Jan 2024 at 2:53 pm UTC

I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.

As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.

Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.

As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.

Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.

One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…

Article taken from GamingOnLinux.com.

Tags: Editorial, Misc

26 Likes

About the author - Liam Dawe

I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly checked on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly. You can also follow my personal adventures on Bluesky.
See more from me

Some you may have missed, popular articles from the last month:

Try the new demo and trailer for Rift Riff, a fast and "strategically tight" tower defense game

Physics building game Besiege gets a big 10 year anniversary update

Deckbuilder fans should check out this new Humble Bundle

Please don't distract me I'm trying to dig a big hole over here

The comments on this article are closed.

All posts need to follow our rules. For users logged in: please hit the Report Flag icon on any post that breaks the rules or contains illegal / harmful content. Guest readers can email us for any issues.

64 comments

Page: «2/4 »

pleasereadthemanual 9 Jan 2024

Link

I agree with the sentiment that our public domain is not as valuable as it should be. As ever, OpenAI representatives write with the assumption that they are entitled to do whatever they want, regardless of the laws. Why do they feel the need to phrase it like that?

LLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that.

Sure, but that doesn't mean OpenAI employees are now allowed to download millions of copyrighted works that have been distributed on trackers/DDL sites without permission from the copyright holder. If ChatGPT were only using Common Crawl, that's one thing, but we know they're not.

Supposedly ChatGPT's training content is carefully curated, FWIW.

Last edited by pleasereadthemanual on 9 Jan 2024 at 11:03 pm UTC

2 Likes, Who?

Kithop 9 Jan 2024

Link

Supporter Plus

I'd love to know where the money is being made with this shit.

Venture capitalists pouring money into it in the hopes that there'll be a bump in all this 'interest' in LLMs, so they can dump it when it peaks. It's just another dot-com bubble, or 2008 bubble all over again.

I wouldn't be surprised to learn all these AI companies are just burning cash in the hopes that they're the ones 'on top' when it all comes crashing down. As to how to 'monetize' it and actually make money off of all that wasted electricity and questionable results? That's the next guy's problem.

3 Likes, Who?

Purple Library Guy 10 Jan 2024

Link

P.S. According to LanguageTool, three commas were needed in the article.
Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.

But they turn pandas homicidal!
("Eats shoots and leaves" --> "Eats, shoots and leaves")

2 Likes, Who?

Purple Library Guy 10 Jan 2024

Link

I get really annoyed by this idea that LLMs are "stealing" data. It's literally the automation of what people manually do.....

That is not as persuasive a statement as you think it is. I want to keep on doing some things manually, thanks very much. And I very much hope my wife agrees.

But in any case, that's not all it is. The fact is that these AIs essentially end up restating things that actual people said . . . which is fine in and of itself. But they are being used to redistribute revenue from the people who initially said the things, to the people who made the AI programs, by using the things the people said as input. That is not benign--and when the people saying the things go out of business and the AIs are reduced to restating each other's statements plus the one major source of statements on the internet that needs no revenue--propaganda--the results ain't gonna be pretty.

Last edited by Purple Library Guy on 10 Jan 2024 at 1:10 am UTC

9 Likes, Who?

Nod 10 Jan 2024

Link

[Here](https://www.niemanlab.org/2023/12/the-robots-will-make-us-more-human/) is a counter opinion that might give you a bit of hope.

But there will be silver linings to The Great Robot Spam Flood of 2024. It will drive us into healthier online communities. It will spotlight and boost the value of authored creativity. And it may help give birth to a new generation of independent media.

Robots will make the internet more human.

Essentially he argues that AI content will turbo charge the already dire enshittification of content on the internet such that the experience is so bad that it drives people towards sites just like this one. Ones that prioritize content "by humans, for humans".

2 Likes, Who?

junibegood 10 Jan 2024

Link

[Here](https://www.niemanlab.org/2023/12/the-robots-will-make-us-more-human/) is a counter opinion that might give you a bit of hope.

But there will be silver linings to The Great Robot Spam Flood of 2024. It will drive us into healthier online communities. It will spotlight and boost the value of authored creativity. And it may help give birth to a new generation of independent media.

Robots will make the internet more human.

Essentially he argues that AI content will turbo charge the already dire enshittification of content on the internet such that the experience is so bad that it drives people towards sites just like this one. Ones that prioritize content "by humans, for humans".

I wish it were true but I don't believe it.

When both smartphones and social networks appeared, internet was suddenly flooded with pictures (and later videos) shot with cheap cameras by people who had almost never taken a picture before. That was human work, sure, but I think it's comparable to the rising of AI because we saw a brutal increase in quantity and decrease in quality. Did that raise the interest for quality pictures by photographs ? Maybe for a very small fraction of humanity, yes, but the rest of us takes, posts and watches even more crap pictures and videos than we did 15 years ago...

Last edited by junibegood on 10 Jan 2024 at 9:15 am UTC

1 Likes, Who?

LoudTechie 10 Jan 2024

Link

I agree with the sentiment that our public domain is not as valuable as it should be. As ever, OpenAI representatives write with the assumption that they are entitled to do whatever they want, regardless of the laws. Why do they feel the need to phrase it like that?

LLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that.
Sure, but that doesn't mean OpenAI employees are now allowed to download millions of copyrighted works that have been distributed on trackers/DDL sites without permission from the copyright holder. If ChatGPT were only using Common Crawl, that's one thing, but we know they're not.

Supposedly ChatGPT's training content is carefully curated, FWIW.

A. Totally agree. We've to play by the rules or run. They've to play by the rules or run.
B. I would actually go further than that and call them unauthorized hosters of to the copyright holders choice copyrighted content or even of a derivative work of all copyrighted content in their training set.
i. Storing unauthorized copyrighted is illegal independent of whether or not you distribute it. This is how pirates were hunted at first until it proved too inefficient.
Alos copyright is format independent and it has been proven multiple times that training data can partly to fully be recovered from llms. The same has been successfully said of jpg, png and other data formats. Its just easier.
ii. It's a derivative work, because it has been made with the copyrighted data. Would've been different without it. Has been made to mimic properties of the copyrighted data.(The drawback of this argument is that it uses the same argument as the arguments against fan fiction, but they've held up in court and most fan fiction organizations tend to accept that they exist by the grace of their often pretty graceful authors.)
C. This's actually the main difference between the development method of of actual data available AI(often FOSS, not always) and proprietary AI like Bard and OpenAI. Source available AI aggressively curates their data, because it gives a great training speed advantage and requires less data. Proprietary AI tends to use lots of training layers with lots of parameters, due to the low development cost.

1 Likes, Who?

Arehandoro 10 Jan 2024

Link

Supporter

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Yet their own platform, LLM, code, etc is copyrighted and not released under an open-source licence. I could potentially believe their shit if AI was to benefit everyone, not just them and/or a few companies.

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

That's like saying that I cannot profit massively without destroying the environment and exploiting employees. Oh, wait...

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Today's citizens needs are not a fascinating AI built on top of other people's works. Citizens today need job stability, affordable housing, a public, free and of quality education and health, and the list could be virtually endless.

3 Likes, Who?

Lachu 10 Jan 2024

Link

Microsoft needs to sell copyrighted GNU/GPL code as public domain to ones customers? I think, yes. I must reminder you, that some organization decided to take to court of Microsoft, because MS is stilling Free Software. Looking at history of this company, it always stole. MS stole GUI from Xerox, stole network protocols (kerbos), stole many solution from Linux desktop and now sold way to stole GNU/GPL code without need to break law.

1 Likes, Who?

JustinWood 10 Jan 2024

Link

Bold move to say the quiet part loudly. Stupid move too, but then again when has that ever stopped this particular brand of pond scum.

1 Likes, Who?

TheRiddick 11 Jan 2024

Link

" I have to go find a job flipping burgers or something. "

Sorry that job has already been taken by AI. What we do have is a job where rich folk need someone to live in their toilet to wipe their asses! Requires a degree!

Last edited by TheRiddick on 11 Jan 2024 at 4:36 am UTC

2 Likes, Who?

whatever 11 Jan 2024

Link

genAI is like an averaging filter, the internet has been gooified by it, a gray goo of mediocrity, everything is bland, a blanket of blandness is now covering everything, in all domains of artistry.
well, not really everything, GamingOnLinux is a corner of interesting stuff made by humans for humans, and these little corners must be preserved for the sake of humanity.

2 Likes, Who?

14 13 Jan 2024

Link

Supporter Plus

I said it before, I think companies' models are going to become the sweet sauce. If OpenAI loses this copyright case, they will need to start paying for copyrighted content to be ingested into their model. It will be a content subscription just like streaming media companies.

I think there is an argument that reading copyrighted material is same as a human doing so and then writing their own creative work, however it doesn't stand against companies' acceptable use policies which often deny or limit scraping by bots. This is exactly the same as bot scraping, where the difference between typical usage is a machine doing it as well as the volume (resource expense).

1 Likes, Who?

scaine 14 Jan 2024

Link

Contributing Editor
Mega Supporter

I think there is an argument that reading copyrighted material is same as a human doing so and then writing their own creative work

When people do this, they pay for the privilege, or access libraries and can only check out books and copyrighted materials for private use. OpenAI and others aren't doing that, they're just consuming all the content, even pirated material and context behind paywalls, on the internet, and using it to train their model.

Of course, proving that will be the court battle.

1 Likes, Who?

LoudTechie 14 Jan 2024

Link

I think there is an argument that reading copyrighted material is same as a human doing so and then writing their own creative work

Look up the legal standing of fan fiction. Than repeat that statement.
Using copyrighted "aspects" is enough to be considered a copyright violation.

0 Likes

pleasereadthemanual 14 Jan 2024

Link

I think there is an argument that reading copyrighted material is same as a human doing so and then writing their own creative work

Look up the legal standing of fan fiction. Than repeat that statement.
Using copyrighted "aspects" is enough to be considered a copyright violation.

I suggest looking up [Marion Zimmer Bradley](https://en.wikipedia.org/wiki/Marion_Zimmer_Bradley#Literary_career).

For many years, Bradley actively encouraged Darkover fan fiction. She encouraged submissions from unpublished authors and reprinted some of it in commercial Darkover anthologies. This ended after a dispute with a fan over an unpublished Darkover novel of Bradley's that had similarities to one of the fan's stories. As a result, the novel remained unpublished and Bradley demanded the cessation of all Darkover fan fiction

The fan threatened to take Marion Zimmer Bradley to court for infringing on the fan's copyright. The fan holds the copyright to their own prose. The fan clearly does not hold the copyright to the characters. But should the author of the original work use prose from a fan work...well, things get dicey.

You'd also expect to face some legal trouble if you ripped some fan subs and tried to pass them off as your own translation (which has been done before).

Of note is the [Organization for Transformative Works](https://www.transformativeworks.org/faq/), which works to protect fan works and has this to say:

Copyright is intended to protect the creator’s right to profit from her work for a period of time to encourage creative endeavor and the widespread sharing of knowledge. But this does not preclude the right of others to respond to the original work, either with critical commentary, parody, or, we believe, transformative works.

In the United States, copyright is limited by the fair use doctrine. The legal case of Campbell v. Acuff-Rose held that transformative uses receive special consideration in fair use analysis. For those interested in reading in-depth legal analysis, more information can be found on the Fanlore Legal Analysis page.

And:

While case law in this area is limited, we believe that current copyright law already supports our understanding of fanfiction as fair use.

We seek to broaden knowledge of fan creators’ rights and reduce the confusion and uncertainty on both fan and pro creators’ sides about fair use as it applies to fanworks. One of our models is the documentary filmmakers’ statement of best practices in fair use, which has helped clarify the role of fair use in documentary filmmaking.

It's certainly not as cut and dry as you might think.

0 Likes

BlackBloodRum 15 Jan 2024

Link

Supporter Plus

One thing that occurred to me, they are claiming this, according to the article.

it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Okay. So then, in their terms of service, under "Using Our Services", source:
https://openai.com/policies/terms-of-use

It states, and I quote:

What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:

Use our Services in a way that infringes, misappropriates or violates anyone’s rights.

Modify, copy, lease, sell or distribute any of our Services.

Attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law).

Automatically or programmatically extract data or Output (defined below).

Represent that Output was human-generated when it was not.

Interfere with or disrupt our Services, including circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services.

Use Output to develop models that compete with OpenAI.

Don't you think, it's a little amusing they are claiming that you need to "copy" others to create "AI". While at the same time, trying to strictly forbid anyone else from doing the same?

4 Likes, Who?

Purple Library Guy 15 Jan 2024

Link

One thing that occurred to me, they are claiming this, according to the article.

it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Okay. So then, in their terms of service, under "Using Our Services", source:
https://openai.com/policies/terms-of-use

It states, and I quote:

What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:

Use our Services in a way that infringes, misappropriates or violates anyone’s rights.

Modify, copy, lease, sell or distribute any of our Services.

Attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law).

Automatically or programmatically extract data or Output (defined below).

Represent that Output was human-generated when it was not.

Interfere with or disrupt our Services, including circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services.

Use Output to develop models that compete with OpenAI.

Don't you think, it's a little amusing they are claiming that you need to "copy" others to create "AI". While at the same time, trying to strictly forbid anyone else from doing the same?

Hee!
As usual, the real corporate rationale is "Whatever makes us money is OK, whatever might reduce our money is not!"

1 Likes, Who?

Pengling 15 Jan 2024

Link

Hee!
As usual, the real corporate rationale is "Whatever makes us money is OK, whatever might reduce our money is not!"

And copyright is forever-and-a-day for their stuff, but if they want to use your stuff they will regardless of what the law says.

1 Likes, Who?

LoudTechie 17 Jan 2024

Link

I think there is an argument that reading copyrighted material is same as a human doing so and then writing their own creative work

Look up the legal standing of fan fiction. Than repeat that statement.
Using copyrighted "aspects" is enough to be considered a copyright violation.
I suggest looking up [Marion Zimmer Bradley](https://en.wikipedia.org/wiki/Marion_Zimmer_Bradley#Literary_career).

For many years, Bradley actively encouraged Darkover fan fiction. She encouraged submissions from unpublished authors and reprinted some of it in commercial Darkover anthologies. This ended after a dispute with a fan over an unpublished Darkover novel of Bradley's that had similarities to one of the fan's stories. As a result, the novel remained unpublished and Bradley demanded the cessation of all Darkover fan fiction
The fan threatened to take Marion Zimmer Bradley to court for infringing on the fan's copyright. The fan holds the copyright to their own prose. The fan clearly does not hold the copyright to the characters. But should the author of the original work use prose from a fan work...well, things get dicey.

You'd also expect to face some legal trouble if you ripped some fan subs and tried to pass them off as your own translation (which has been done before).

Of note is the [Organization for Transformative Works](https://www.transformativeworks.org/faq/), which works to protect fan works and has this to say:

Copyright is intended to protect the creator’s right to profit from her work for a period of time to encourage creative endeavor and the widespread sharing of knowledge. But this does not preclude the right of others to respond to the original work, either with critical commentary, parody, or, we believe, transformative works.

In the United States, copyright is limited by the fair use doctrine. The legal case of Campbell v. Acuff-Rose held that transformative uses receive special consideration in fair use analysis. For those interested in reading in-depth legal analysis, more information can be found on the Fanlore Legal Analysis page.
And:

While case law in this area is limited, we believe that current copyright law already supports our understanding of fanfiction as fair use.

We seek to broaden knowledge of fan creators’ rights and reduce the confusion and uncertainty on both fan and pro creators’ sides about fair use as it applies to fanworks. One of our models is the documentary filmmakers’ statement of best practices in fair use, which has helped clarify the role of fair use in documentary filmmaking.

It's certainly not as cut and dry as you might think.

Thnx.
I'm happy to be proven wrong about the legal standing of fan fiction.
I like fan fiction and it having the option of being legal is a breath of fresh air.
I don't think it helps OpenAI, because I argue they "adversely affect the sale of the original", but I admit that is a matter of interpretation.

1 Likes, Who?

«2 /4 »

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!