OpenAI say it would be 'impossible' to train AI without pinching copyrighted works

By Liam Dawe - 9 January 2024 at 12:47 pm UTC

I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.

As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.

Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.

As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.

Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.

One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…

Article taken from GamingOnLinux.com.

Tags: Editorial, Misc

26 Likes

About the author - Liam Dawe

I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly came back to check on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly.
See more from me

Some you may have missed, popular articles from the last month:

Linux hits exactly 2% user share on the October 2024 Steam Survey

Create completely ridiculous weapons in survive-em-up roguelite shooter NIMRODS: GunCraft Survivor

Retro racing throwback Classic Sport Driving adds Linux support

GOG launch their Preservation Program to make games live forever with a hundred classics being 're-released'

68 comments

Page: «6/7 »

LoudTechie Jan 21

Link

Quoting: 14
Quoting: LoudTechie
Quoting: 14I think I was misunderstood. We all watch movies, play video games, and read books, right? That influences our imagination. So when you make your own creative work, it is influenced by all those things. This is undeniable. Sigh.

Many know you meant that(it's a common argument made in this kind of discussions).
Penglin argued in reaction that although we do as you described, we pay for the privilege of reading, watching and playing things before getting inspired, which OpenAI didn't do(they just downloaded the content from piracy sites).

I argued that making things that contain unlicensed copyrighted elements is always illegal including for entities and "proofed" that with the legal standing fan fiction.
Someone correctly pointed out to me that in the case of fan fiction it's sometimes not illegal(if you can show you didn't negatively affect the sale of the original).
I argued that OpenAI still wouldn't be able to claim that, because they did negatively affect the sale of the original.

Someone else(too lazy to check who) argued that, that is only a persuasive argument to the law if you treat the AI as a separate entity capable and the law only treats citizens and some kinds of companies as entities. The AI is neither of these things.
This is an effect of the context dependence of the law, which I find hard to understand, so I'm assuming you also think so.
This my attempt at explaining it.
You know how wNK3c5Z5 is a random generated string and thus useful for security, wNK3c5Z5 isn't secure, because it's a copy of the first one and thus not randomly generated.
The strings are identical and still one is secure and the other isn't. It's because of the context one is randomly generated, while the other isn't.
The law deals with these kind of differences all the time.
If I make an AI that exactly behaves and looks likes exactly like some human with a driving license the AI still can't be allowed to drive alone, because it isn't an adult citizen with a driving license and thus can't be held responsible for its deeds.
If I smash a soft wax stamp in a statue and use the stamp as a mall to make more of that statue I'm also violating copyright although I didn't have to make any of the movements the artist.
Sorry you went through the trouble to make a large post, but appreciate it.

I don't have a strong opinion in favor of AI here. But I like trying to understand why each perspective somehow makes sense to the owner of that perspective. My choice words I think there is an argument is me saying there is a compelling-enough argument to have an argument, but I don't think I'd be the one playing representative because I haven't chosen a side.

Out of the handful of debatable elements you pointed out, in my own words, I think the most compelling argument content creators of any kind have against current AI usage in terms of copyrighted material is that AI chat bots can effectively become a proxy to same information and harm creator profits by eliminating sales of said content as well as ad traffic for "free" content. Acting as a proxy is like hijacking in a way... mm, let's say mimicking or miming. In another way, you could say it is redistribution, which is a clear topic in copyright law as far as I know. Yeah, I think if lawyers can convince judges that AI falls under redistribution of copyrighted material, that is winnable.

First. No, issue.
I like writing long blog posts.
Second.
Good attempt. Something that could help is the blog post itself it contains some legal arguing form OpenAI.
Third.
You're certainly correct. The actual legal defense of OpenAI in their post is aimed at the exact strategy you describe in this post. They claim "fair use", which means they claim they don't harm the sale of the original.

Yes, this basically means denying the allegation.
I'm curios how they will defend against Penglin's argument(downloading without paying from recognized piracy sites is piracy).
The OpenAI blog post seems to really limit their ability to defend against such an accusation and it matters, because it's the one allegation I'm certain can get them their computers sized(those things are expensive and could reveal most of their database to the public in the inevitable court case).

Last edited by LoudTechie on 21 January 2024 at 6:12 pm UTC

0 Likes

Purple Library Guy Jan 21

Link

Quoting: LoudTechie
Quoting: 14
Quoting: LoudTechie
Quoting: 14I think I was misunderstood. We all watch movies, play video games, and read books, right? That influences our imagination. So when you make your own creative work, it is influenced by all those things. This is undeniable. Sigh.

Many know you meant that(it's a common argument made in this kind of discussions).
Penglin argued in reaction that although we do as you described, we pay for the privilege of reading, watching and playing things before getting inspired, which OpenAI didn't do(they just downloaded the content from piracy sites).

I argued that making things that contain unlicensed copyrighted elements is always illegal including for entities and "proofed" that with the legal standing fan fiction.
Someone correctly pointed out to me that in the case of fan fiction it's sometimes not illegal(if you can show you didn't negatively affect the sale of the original).
I argued that OpenAI still wouldn't be able to claim that, because they did negatively affect the sale of the original.

Someone else(too lazy to check who) argued that, that is only a persuasive argument to the law if you treat the AI as a separate entity capable and the law only treats citizens and some kinds of companies as entities. The AI is neither of these things.
This is an effect of the context dependence of the law, which I find hard to understand, so I'm assuming you also think so.
This my attempt at explaining it.
You know how wNK3c5Z5 is a random generated string and thus useful for security, wNK3c5Z5 isn't secure, because it's a copy of the first one and thus not randomly generated.
The strings are identical and still one is secure and the other isn't. It's because of the context one is randomly generated, while the other isn't.
The law deals with these kind of differences all the time.
If I make an AI that exactly behaves and looks likes exactly like some human with a driving license the AI still can't be allowed to drive alone, because it isn't an adult citizen with a driving license and thus can't be held responsible for its deeds.
If I smash a soft wax stamp in a statue and use the stamp as a mall to make more of that statue I'm also violating copyright although I didn't have to make any of the movements the artist.
Sorry you went through the trouble to make a large post, but appreciate it.

I don't have a strong opinion in favor of AI here. But I like trying to understand why each perspective somehow makes sense to the owner of that perspective. My choice words I think there is an argument is me saying there is a compelling-enough argument to have an argument, but I don't think I'd be the one playing representative because I haven't chosen a side.

Out of the handful of debatable elements you pointed out, in my own words, I think the most compelling argument content creators of any kind have against current AI usage in terms of copyrighted material is that AI chat bots can effectively become a proxy to same information and harm creator profits by eliminating sales of said content as well as ad traffic for "free" content. Acting as a proxy is like hijacking in a way... mm, let's say mimicking or miming. In another way, you could say it is redistribution, which is a clear topic in copyright law as far as I know. Yeah, I think if lawyers can convince judges that AI falls under redistribution of copyrighted material, that is winnable.

First. No, issue.
I like writing long blog posts.
Second.
Good attempt. Something that could help is the blog post itself it contains some legal arguing form OpenAI.
Third.
You're certainly correct. The actual legal defense of OpenAI in their post is aimed at the exact strategy you describe in this post. They claim "fair use", which means they claim they don't harm the sale of the original.

Yes, this basically means denying the allegation.
I'm curios how they will defend against Penglin's argument(downloading without paying from recognized piracy sites is piracy).
The OpenAI blog post seems to really limit their ability to defend against such an accusation and it matters, because it's the one allegation I'm certain can get them their computers sized(those things are expensive and could reveal most of their database to the public in the inevitable court case).

Moving briefly from the technical/legal to the political/economic, it seems to me fairly clear that whatever can be proved and whether or not they are succeeding/will succeed, the intent of, say, Google in deploying AI is precisely to gain revenue by moving it from content creators to themselves, by making it unnecessary for people to go to the actual websites producing information. Where the eyeballs are is where the advertising revenue is, so if Google keep the eyeballs for themselves they keep the revenue as well. And there doesn't seem to be anything else in the wind that would make it worth all the expense of making the AI.

If discovery drags out examples of them planning this, it could become a significant issue, both in the actual court case itself and in the broader political environment with respect to whether governments decide to make laws and regulations just to stop corporations using AI from cannibalizing the broader web.

And one clear thing about copyright law and the rhetoric around it is, it is ultimately an instrumental set of laws. They exist as they are for the purpose of achieving policy goals, such as encouraging artistic production and enriching monopolistic corporations (some of those goals conflict). There are major and ongoing attempts to craft a morality around those laws, to encourage adherence to them, but if the laws change we will end up with attempts to craft morality around whatever laws we end up with. But let's not forget that copyright as a concept is recent (unlike, say, murder, or even theft). There was no copyright in the Middle Ages or even the Renaissance. It was a concept eventually generated by the printing press and capitalism, presumably because with the printing press in a capitalist economy, you could make money printing things, and so there was a need to regulate things to make that process smoother for capitalists. So my point there is, the morality was built around the policy, not the other way around, and this will continue to be the way things work; we will end up with the policy that works (for the strongest political interests) and any moral issues will be adjusted to fit.

But AI are quite different things from printing presses; even the internet is quite different from printing presses. The strains on copyright law are increasing, and the possible policy and societal interests that could be served by different approaches are multiplying. It seems very likely that there will be things done to regulate the use of AI, that will probably have broader implications around copyright issues.

Last edited by Purple Library Guy on 21 January 2024 at 7:09 pm UTC

0 Likes

LoudTechie Jan 21

Link

Quoting: Purple Library GuyBut let's not forget that copyright as a concept is recent (unlike, say, murder, or even theft). There was no copyright in the Middle Ages or even the Renaissance.

That depends on your definition of old the printing press and copyright law were introduced in the late middle ages (1400-1500)
As such it's older than the USA, western hegemony, most European countries smaller than Portugal, the entire concept of a constitution, all still existing fully democratic governmental systems and the general solution to volume calculations.

Quoting: Purple Library GuyAnd there doesn't seem to be anything else in the wind that would make it worth all the expense of making the AI.

There is another reason it's worth it.
Few entities have the budget to do it.
A problem with software companies is that in theory their entire advantage exist in having more and better employees and ideas than the rest, but that is a really fleeting position. Monopolizing ideas is temporary, expensive and hard to maintain international and employees move a lot.
In practice they also have things like hosting costs, the network effect, etc, but for this entire list of things is true that they depend on a lot of other parties.
The risk of some young upstart screwing up your entire business still exists and happens to Facebook all the time.
LLMs allows you to buy tools that nobody else can get without a massive investment and profit from it in software space.
This notion is decaying though training a full competing LLM can nowadays be done with just 100.000 euros of hardware.

1 Likes, Who?

Purple Library Guy Jan 22

Link

Quoting: LoudTechie
Quoting: Purple Library GuyAnd there doesn't seem to be anything else in the wind that would make it worth all the expense of making the AI.

There is another reason it's worth it.
Few entities have the budget to do it.

Uh, yeah, and that makes it worth it how if there's no revenue associated? I was pointing out where the revenue seems to be coming from. You counter that not by pointing out that the stuff is expensive, which just makes the point that they better have some revenue coming in, but by pointing out an alternative source of revenue, and suggesting some reason they'd want that source instead rather than having both. (Except you shouldn't, because I'm right

)

Last edited by Purple Library Guy on 22 January 2024 at 1:47 am UTC

1 Likes, Who?

Purple Library Guy Jan 22

Link

Quoting: LoudTechie
Quoting: Purple Library GuyBut let's not forget that copyright as a concept is recent (unlike, say, murder, or even theft). There was no copyright in the Middle Ages or even the Renaissance.

That depends on your definition of old the printing press and copyright law were introduced in the late middle ages (1400-1500)

There were printing presses for some time before there was copyright.
Wikipedia:
"The British Statute of Anne 1710, full title "An Act for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned", was the first copyright statute."

I'd say that, as I stated, is past the Middle Ages or even the Renaissance. Maybe not quite capitalism, but certainly starting to head that way. And 1710 did not mark full blown arrival of the copyright regime, when you consider that this is the first such statute, it was very limited compared to later ones, only applied to Britain, and I wouldn't be surprised if it took a while before anyone really started paying attention.

But even if you had been right about that detail, that would still be a rather small smidge of human history and my overall point would remain sound.

Last edited by Purple Library Guy on 22 January 2024 at 2:11 am UTC

0 Likes

14 Jan 22

Link

View PC info

Supporter Plus

Quoting: Purple Library GuySo my point there is, the morality was built around the policy, not the other way around, and this will continue to be the way things work; we will end up with the policy that works (for the strongest political interests) and any moral issues will be adjusted to fit.

I don't agree with this completely. The concept of the laborer being worthy of his hire is in religious text which is pretty old. The printing press example shows how policy lags behind technology. There is a period of self governing until enough people complain of unfair or hurtful practices (which seem inevitable unfortunately), then rules get invented.

I don't want to bring religious morals versus political and economical influence into the conversation; I merely used it to point out how old the idea of protecting someone's rightful pay is.

1 Likes, Who?

Purple Library Guy Jan 22

Link

Quoting: 14
Quoting: Purple Library GuySo my point there is, the morality was built around the policy, not the other way around, and this will continue to be the way things work; we will end up with the policy that works (for the strongest political interests) and any moral issues will be adjusted to fit.
I don't agree with this completely. The concept of the laborer being worthy of his hire is in religious text which is pretty old. The printing press example shows how policy lags behind technology. There is a period of self governing until enough people complain of unfair or hurtful practices (which seem inevitable unfortunately), then rules get invented.

I don't want to bring religious morals versus political and economical influence into the conversation; I merely used it to point out how old the idea of protecting someone's rightful pay is.

Well, except frankly copyright was always about benefits for publishers, not really for authors. So the worker being worthy of their hire is a bit of a smokescreen. It is, after all, the right to make copies, which authors were not originally capable of doing; you needed a printing press. But, sure, OK, that's a real ethic getting deployed to motivate copyright, true enough--it's not completely artificial.

Still, it's very much an artifact of the particular economic system. It only makes sense if publishing is done by outfits which are both multiple and privately owned, and the authors are paid by them. So for instance, say there were a theocracy where the church was the only publisher, rights for authors wouldn't have much to do with copyright because nobody else could copy at all. Or, say the government paid authors, based on some measure of their total cultural reach (number of copies sold, amount of derivatives in other media if any, amount of criticism and other discussion, presence in education) and publishers did not, they were just allowed to publish anything they thought they could sell, then attribution would be important but copyright would not. Even across relatively minor changes in how our economy works, our sense of copyright and the ethics around it has changed quite a bit just in the last few decades. For instance, in 1970 nobody would have connected copyright with property, they were distinct concepts and the moral ideas surrounding property had not been imported into the copyright concept. And I would say that copyright, patent and trademark were much less "grouped" as concepts than they became after the invocation of "intellectual property" as a metaphor for all of them.

1 Likes, Who?

pleasereadthemanual Jan 22

Link

View PC info

Quoting: Purple Library GuyWell, except frankly copyright was always about benefits for publishers, not really for authors.

Just cutting in here, without the slightest amount of tact, to say that while this was definitely true for a long time, publishing options for authors have expanded greatly. Traditional publishers require you to sell the rights to your book, but you could choose a hybrid publisher and retain your rights; you just need to pay them for their services. You can also self-publish on KDP and many other sites. Eragon is famously self-published.

So, while there weren't benefits for authors before, the landscape has changed a lot.

Quoting: Purple Library GuyFor instance, in 1970 nobody would have connected copyright with property, they were distinct concepts and the moral ideas surrounding property had not been imported into the copyright concept.

It was actually a series of GNU articles on "intellectual property" that taught me about this. I found those very enlightening.

And on a completely different subject, there was no copyright in the 1600s. Don Quixote was a really popular book at the time, but Cervantes was either taking a while to write a sequel or didn't want to. Someone (Alonso Fernández de Avellaneda) wrote an unauthorized sequel; an early example of fan fiction. It was bad. So bad, in fact, that Cervantes mocked it in the eventual sequel he wrote 10 years later. An astonishing amount of meta-fiction for the period...

It's possible that without that unauthorized sequel, Cervantes may never have written an official sequel. So the idea that the absence of copyright leads to worse work and less motivation from authors to write seems to ring false to me. What it really enforces is that authors be good at what they do, lest fans or opportunists take their audience from them. Cervantes was very good at what he did, and Avellaneda was not, so Cervantes did not struggle to capture an audience with his late sequel.

I've talked to a few fans of RWBY; it's rather astonishing how many of them have said, "yeah, I read a lot of RWBY fics but I haven't seen the series in a long time." I'm one of them, actually. I find the fan content better written than the Rooster Teeth series. I've even bought a Not This Time, Fate fan art print... It makes me wonder how different RWBY would be if copyright was weakened or didn't exist. Would Rooster Teeth feel compelled to do a better job? Would Coeur Al'Aran be producing his own RWBY series?

I'd be happy if copyright were just reduced to its original 28 year maximum term, and they can keep the later amendment to make copyright implicit to prevent stuff like Night of the Living Dead's untimely fall into the public domain from happening.

2 Likes, Who?

dvd Jan 22

Link

View PC info

Maybe i'm wrong but to me what seems especially funny is how quick they went from 'non-profit' to profit-oriented.

0 Likes

LoudTechie Jan 22

Link

Quoting: Purple Library Guy
Quoting: LoudTechie
Quoting: Purple Library GuyBut let's not forget that copyright as a concept is recent (unlike, say, murder, or even theft). There was no copyright in the Middle Ages or even the Renaissance.

That depends on your definition of old the printing press and copyright law were introduced in the late middle ages (1400-1500)
There were printing presses for some time before there was copyright.
Wikipedia:
"The British Statute of Anne 1710, full title "An Act for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned", was the first copyright statute."

I'd say that, as I stated, is past the Middle Ages or even the Renaissance. Maybe not quite capitalism, but certainly starting to head that way. And 1710 did not mark full blown arrival of the copyright regime, when you consider that this is the first such statute, it was very limited compared to later ones, only applied to Britain, and I wouldn't be surprised if it took a while before anyone really started paying attention.

But even if you had been right about that detail, that would still be a rather small smidge of human history and my overall point would remain sound.

The first copyright privilege granted to a publisher was in Venice in 1489 over the work "Rerum venetarum ab urbe condita opus". Certainly back than copyright functioned more like patent in that you had to apply for a copyright instead of getting it automatically.
also 1710 is still older than the US constitution(the oldest constitution still in force) 1787.
Also my statement about it being older than the concept constitution is wrong. That was introduced by Aristotle in 350BC.

Last edited by LoudTechie on 22 January 2024 at 9:53 am UTC

1 Likes, Who?

« 1 «6 /7 »

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!