Everyone is Committing Copyright Infringement

2024 through the Lens of the NYT Lawsuit

Jan 01, 2024

Looking forward, 2024 portends more change in the AI industry, with the ripples of new stones colliding with the unsettled waves from last year. Today I’d intended to go a little deep into the first herald of a new year: the NYT lawsuit against OpenAI. However, the legal and technical issues surrounding that lawsuit are already promising to bend our attention in other directions in 2024.

Here’s a broad view of the issues surrounding the NYT lawsuit, from hot takes around the web.

A surprisingly comprehensive and aggressive layman’s editorial on copyright

Avram Piltch (EIC, Tom’s Hardware) writes:

The question isn’t just “do they plagiarize my work verbatim” but “is the very act of using these materials to create AI software an act of infringement?”
The New York Times certainly thinks that the act of “training” itself is infringement. On page 60 of the lawsuit filing, it writes that “by storing, processing, and reproducing the training datasets containing millions of copies of Times Works to train the GPT models on Microsoft’s supercomputing platform, Microsoft and the OpenAI Defendants have jointly directly infringed The Times’s exclusive rights in its copyrighted works.”

A twitter rebuttal of the concept of LLMs “storing” and “reproducing” copyrighted works

Kevin Bryan (University of Toronto) writes that famous NYTs articles (“common, unique text”) are as impossible for LLMs to ignore as for a U.S. third grader to not remember the pledge of allegiance:

NYT/OpenAI lawsuit completely misunderstands how LLMs work, and judges getting this wrong will do huge damage to AI.… LLMs don't and can't "memorize" facts or training text; they just predict next words based on all English text; commonly-repeated unusual text is easy to predict next word; copyright is poorly suited for this.

In other words …

Zvi Mowshowitz

clarifies:

In practice, one can think of this as ChatGPT committing copyright infringement if and only if everyone else is committing copyright infringement on that exact same passage, making it so often duplicated that it learned this is something people reproduce.

But it’s not going to be easy to explain all the red text to a jury

Cecilia Ziniti (CEO, General Counsel AI) writes:

My take? OpenAI can't really defend this practice without some heavy changes to the instructions and a whole lot of litigating about how the tech works. It will be smarter to settle than fight. The visual evidence of copying in the complaint is stark. Copied text in red, new GPT words in black—a contrast designed to sway a jury. See Exhibit J here.

And the shape of settlements are already being determined

From the Atlantic:

Although OpenAI has an agreement to use archival material from the Associated Press, Axel Springer is reportedly the first publisher to provide ongoing news stories.… All of the websites, all of the writing: It’s plumbing for a digital faucet. With fresh content surging, ChatGPT will become more viable as a one-stop shop through which to experience the internet.

But what if NYT wants its pound of flesh?

Zvi continues sagely:

[An OpenAI settlement] presumes that The New York Times is in a settling mood and will accept a reasonable price, in a way that sets a precedent that OpenAI can afford to pay out for everyone else. If that was true, then why were things allowed to get this far? So I presume that the two sides are pretty far apart. Or that NYT is after blood, not cash.
I would be careful if I was the Times. Their reputation and that of journalism and legacy media in general is not what it once was. ChatGPT provides a lot more value to more people than it is taking away from a newspaper.… What harm is being done to the New York Times? Yes, there were times when ChatGPT would pull entire NYT articles if you knew the secret code. But those [secret] codes become invalid if a lot of people use them, so the damage will always be limited.
The flip side is that the public is very anti-AI, and most people aren’t using ChatGPT.… You need a way to support creators. You need to respect property.
Ideally we find a way that works for all, both for books and for data.
Early action says if this does not settle then NYT will likely win. I think that’s the wrong question, though? What matters is the price.

Artificial Legal

Discussion about this post