Cointime

Download App
iOS & Android

THE MODELS ARE YOURS: THE PUBLIC'S LEVERAGE IN AI

Matt Prewitt

January 8, 2024

On December 27, the New York Times filed a lawsuit, claiming that Microsoft and OpenAI infringed the Times’ copyrights by using its writings to train GPT-4 and other AI models. It follows a series of similar lawsuits from the Authors Guild, writers Michael Chabon and Sarah Silverman, and more.

These cases are asking one monumental question: whether training an AI model on copyrighted material violates the copyright. If copyrights are infringed by training models on them, then everyone with a digital footprint has probably already had their rights violated, and will enjoy significant leverage over the future of this transformative technology.

WHAT’S AT STAKE IN THE NEW COPYRIGHT BATTLES

It is a starkly binary legal question: yes or no. Training AI models on copyrighted material either infringes the copyright or doesn’t. And the two possible outcomes point toward very different future worlds.

What would it mean if the AI companies’ lawyers prevail, and courts find that AI models may freely ingest copyright-protected material? Having been trained on authors’ work, AI models will sooner or later be able to do almost exactly what all those authors do, and more, millions of times faster. So AI models’ owners, and not authors, will own most of the fruits of creative labor.

At that point things could get very weird. Disenfranchised authors might start sharing their work only in the shadows by forbidding recordings, banning reviews, and so on. Luddite subcultures might form around efforts to keep creative work off the record and out of the systems.

On the other hand, suppose that courts find that the AI companies have violated authors’ rights. Their potential liability, civil and possibly even criminal, could be as unprecedented as the technology itself. This is because copyright provides for steep statutory damages: a minimum of $750 damages per copyright violation.[1] Given the unimaginable reams of copyrighted works that have presumably already been incorporated into systems like GPT-4 and Claude, you don’t even need to do the math. AI companies, even gigantic ones, could be bankrupted by the damages they owe to—well, all of us.

Of course, destroying the companies is not what most stakeholders want. But the public and the government should appreciate just how much leverage they might have to achieve public-interested outcomes, like a grand settlement resulting in some kind of public governance rights or equity stake.

Leaving aside what companies might already owe, if copyrights are infringed by AI training, the future simply looks different. Content creators, including ordinary people producing copyrightable digital footprints (students, employees, social media users, etc.), could have huge leverage over the future of the technology. If they organize and bargain collectively (instead of getting “picked off” by individual agreements) they will hold the strings to datasets that are necessary ingredients to the world’s most powerful AIs. The public will have a seat at the table.

Europe has already given us one sketchy glimpse of what that might look like. Drafts of the EU’s AI Act, now jeopardized by stalled negotiations, have suggested the bloc may give copyright holders the ability to programmatically “opt-out” of their works’ use in AI training. The artists Holly Herndon and Mat Dryhurst have already set up an organization through which many artists have done just that. It could be a sign of things to come and the EU’s regulations are an important factor in this conversation.

Another possibility must be noted. If it becomes clear that AI cannot be lawfully trained on all publicly available information, it could create an opening for actors beyond the reach of the law. Given the possible military applications of the technology, state actors will not want that to happen. This would nudge the state security apparatus even further into the AI business.

WHY TRAINING ON COPYRIGHTED MATERIAL IS INFRINGEMENT: IT’S LIKE PLAYER PIANOS

With all those considerations lurking, how will courts resolve the key question?[2] Namely: under US law, does training AI on copyrighted materials constitute infringement, or is it fair use?

Courts look at four factors to determine whether a use of copyrighted material is excused as “fair use”. They are:

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
  2. the nature of the copyrighted work, that is, whether it is more “expressive” or factual in nature;
  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  4. the effect of the use upon the potential market for or value of the copyrighted work.

First, some relevant facts. Large Language Models (the technology which underpin the commercially available generative AI products), can be thought of as a type of data compression (researchers have even experimentally shown than LLMs can simply serve as compression audio files, much like MP3s).[3] When a model trains on a copyrighted text, it stores information about the text in the form of statistical weights relating “tokens” (words, letters, or phrases) to one another. These weights embody information about the statistical relationships between those tokens in the text. This information takes the form of numbers representing the probability that, say, word C will appear after word B if word B is preceded by word A. This data is not stored in silos corresponding to particular texts; instead the model as a whole simply uses the information from each text to modify the information culled from all the other texts it trained on. Humans cannot directly make sense of this statistical data, but they can, with a little effort, use it to reconstruct something very close to the original. The providers of these models install secondary safeguards designed to make this kind of exact reconstruction more difficult; but the fundamental capacity to perform such reconstructions is latent in the technology.

A trained model thus contains information amounting to a “lossy” compression of all of the copyrighted input. This is significant because other forms of “lossy” information compression are clearly copies. For example, an MP3 file compresses the information in a master tape, discarding much of the original recording data. And a human cannot read the binary code of an MP3 file and recognize it as a song. But MP3s are obviously not “fair uses” of recordings. The compressed file can be decompressed, or played back, in a form that sounds similar to the master tapes, a fact sufficient to prove that MP3s are copies.

Returning to the fair use factors now. The third factor, “substantiality”, clearly weighs against the AI companies because whole unaltered texts are fed into the models. The second factor, “expressivity”, does too, since all sorts of works are used in training, including paradigmatically expressive ones.

The fourth factor looks scarcely better for the AI companies. Their models have obvious potential to harm the market for original works. Users can consult models trained on an authors’ work to obtain not only information about those works’ contents, but also a rich experience of their style and character. In many cases, consulting a model might be more efficient and satisfying than consuming the source material myself. This can and does constitute a reason not to buy the book, or subscribe to the magazine, or (soon) watch the movie: core expressive aspects of works can be substantially appreciated through the models alone. Search engines headed off a similar copyright issue when publishing excerpts of news articles, by saying the search engine was driving traffic and revenue to the original authors. But in the case of generative AI, the competitiveness with the original is clearer, and it would be surprising if any court were persuaded that the authors’ works were not being in some important sense superseded.

Now to the first and most important factor. Obviously the uses of models are commercial; but are they transformative in character? This is the argument AI companies will likely end up relying upon.

On the surface, AI models may seem to have transformed expressive training material into bleep-bloop numerical arrays inside an AI model. But these numerical arrays are compressed copies of original works, capable of being transformed right back into works of a near-identical character, just as MP3s are copies of master tapes.

Courts should not be confused by the fact that AI companies package their models with secondary safeguards designed to frustrate exact reconstructions of the source material. The outputs in the chatbox are pastiches, usually “transformative” ones, but those pastiches are not the relevant “copies”. The “copies” are the models themselves: the extraordinarily powerful pastiche-generators capable of rendering outputs that supersede their source material, whose power depends on containing compressed copies of that source material.

New information compression technologies have always affected the nature of whatever they compress. For example, when music recording was invented, music itself changed. In the early 1900s, player piano rolls and phonograph recordings were not legally recognized as unlawful “copies” of protected musical writings. A composer’s work consisted only of musical notation and lyrics; only the sheet music was subject to rights related to reproduction. The Supreme Court affirmed as much in the 1908 case of White-Smith Music Publishing Co. vs. Apollo Co. But even at the time, that didn’t make sense: the case is widely remembered as a judicial misfire. And the copyright law’s period of adjustment to new technology was mercifully brief. Recognizing that piano rolls and audio recording had changed the nature of musical expression, Congress responded, passing the Copyright Act of 1909, which gave musical authors rights in recordings, so-called mechanical rights.

Generative AI is actually very much like player pianos—even down to the eerie, mistaken attribution of disembodied agency. Just as in 1908 recordings were emerging as the predominant artifacts of musical production, generative AI outputs are now emerging as the definitive artifacts of all recorded human expression. This shift will intensify rapidly.

Is our legal and political system still capable of rapidly responding with laws that guarantee authors (even nonprofessional ones) a seat at the table?

A NEW STRATEGY—A NEW COPYRIGHT LAW

As I said, the courts will find either that AI training violates copyrights, or that it doesn’t. Either way, a radically new copyright doctrine will need to be worked out legislatively. But it would be an auspicious start for courts to find against the AI companies now—first, because this is the best interpretation of the current law, and second, because it will rightfully strengthen authors’ bargaining position in any subsequent political settlement.

Across society, we should be organizing to meet the moment and guide our politicians. The tech companies’ lawyers certainly are; the rest of us can’t afford to be years behind them. How should we organize?

First, coalition-building. SAG-AFTRA and the Writers’ Guild have brought this issue to national attention; they should not be fighting alone. Where are the school systems, the universities, the religious organizations, the podcasters, the political movements, and others who have great influence and stake in important and protected data? They should be joining forces, collaborating with Spawning and others. This isn’t a partisan cause, and it isn’t anti-AI; it’s a simple matter of public empowerment.

Second, lawyers, academics, and technologists need to come together to debate and draft the legal resettlement we need. What are the deep principles and common values that we want intellectual property law to protect? Have we, perhaps, been underestimating the diffuse social contributions to “individual” intellectual work for some time; and can we devise a sensible way for the law to now correct this error? Can automatically-created mechanical licenses to musical recordings serve as a template for a new regime that gives everyone a stake in the AI models based upon their work?

The tech lobbyists will surely hand finished text to our representatives. Where is the countervailing proposal?

Special thanks to Lucas Geiger for helpful comments and edits

Notes

  1. Maximum per-violation damages are capped at $30,000, or $150,000 if the infringement was willful. The latter is not out of the question. Willful violations need not be “knowing”, they can be merely “reckless”. And the AI companies have taken many actions indicating that they knew they might be violating copyrights, such as falsely claiming that they did not train on copyrighted material. This is evidence of recklessness. ↩︎
  2. There have been some early setbacks for plaintiffs in these lawsuits, but these are mostly procedural; it is much too early to say that the AI companies will defeat the claims I am focusing on here. ↩︎
  3. Language Modeling Is Compression, https://huggingface.co/papers/2309.10668 ↩︎
Comments

All Comments

Recommended for you

  • Robinhood Chief Legal Officer Dan Gallagher Says He Won't Become SEC Chairman

    According to market news, Dan Gallagher, the Chief Legal Officer of Robinhood, stated that he would not serve as the Chairman of the US Securities and Exchange Commission.

  • Cosine: After a user used GPT to write a bot with a backdoor code, the private key was sent to a phishing website

    SlowMist Yu Xian stated in a post on the X platform that a user used GPT to write a bot with code and sent the private key to a phishing website. The reason why the private key was stolen was because it was directly sent to the phishing website in the HTTP request body. Yu Xian reminded that when using LLM such as GPT/Claude, one must pay attention to the common fraudulent behavior of these LLM. It was previously mentioned that AI poisoning attacks were carried out, and now this is a real attack case targeting the crypto industry.

  • U.S. Supreme Court rejects Facebook's attempt to avoid shareholder securities fraud lawsuit

     US Supreme Court rejected Facebook's attempt to avoid shareholder securities fraud lawsuits under the META umbrella.

  • The final value of the US one-year inflation rate in November is expected to be 2.6%, the expected value is 2.7%, and the previous value is 2.60%

     the expected final value of the US one-year inflation rate in November is 2.6%, with an expected value of 2.7% and a previous value of 2.60%. The expected final value of the US five-to-ten-year inflation rate in November is 3.2%, with an expected value of 3.1% and a previous value of 3.10%.

  • Polymarket Blocks French Users Amid Government Investigation into Gambling Law Compliance

    Polymarket has blocked users from France following reports of an investigation by the country's gaming authority for compliance with gambling laws. The ban was not stated in Polymarket's terms of service, but French users attempting to access the website using a VPN from a French server were met with a digital blockade. The ANJ, France's national gaming authority, began investigating Polymarket after a French trader placed large bets on Donald Trump winning the 2024 US Presidential election.

  • U.S. stocks open, most crypto stocks open lower

     the US stock market opened with the Dow Jones up 0.19%, the S&P 500 up 0.05%, and the Nasdaq up 0.01%. Most cryptocurrency stocks opened lower, with Coinbase (COIN.O) down 0.06%, MicroStrategy (MSTR.O) up 0.4%, and Riot Platforms (RIOT.O) down 2.6%. Previously, Bitcoin had risen above $99,000 before falling back.

  • Amazon to invest an additional $4 billion in Anthropic, OpenAI's rival

     Amazon is deepening its cooperation with Anthropic and will add an additional $4 billion investment to the company. In September of this year, Anthropic, an artificial intelligence startup, was seeking a new round of financing with a valuation of up to $40 billion. Anthropic was founded by former OpenAI executives in 2021 and focuses on creating interpretable, secure, and controllable artificial intelligence systems. The company's flagship AI model, Claude, operates based on "Constitutional AI," which uses predefined principles to guide its output, avoiding some erroneous or discriminatory output reactions.

  • Family Offices Evolve into Powerful Investment Entities with Innovative Strategies and Advanced Technologies

    Family offices, which traditionally focused on conservative investment strategies, have transformed into powerful investment entities with a focus on alternative investments, private equity, co-investments, venture capital, and impact investing. This shift has been driven by innovative financial solutions and modern investment strategies, responding to technological advancements and an evolving global financial landscape. Family offices are taking a more active role in direct investments and co-investments, particularly in high-growth companies and startups, enhancing their control and flexibility. They are also diversifying further into private markets and real assets due to geopolitical and macroeconomic uncertainties, while embracing innovative financing solutions and cutting-edge risk management techniques. Additionally, family offices are implementing AI technologies to improve their decision-making processes, particularly in investment analysis, reflecting their commitment to innovation and strategic planning.

  • The Evolution of Family Offices: Embracing Innovative Investment Strategies and Technology

    Family offices have shifted from conservative investment strategies to more active roles in direct investments and co-investments, thanks to innovative financial solutions and modern investment strategies. They are now leaders in alternative investments, private equity, co-investments, venture capital, and impact investing, leveraging their capital through non-recourse and limited-recourse financing to expand their investments across sectors and regions. Family offices are also adopting sophisticated risk management strategies, diversifying further into private markets and real assets, and integrating advanced technologies such as AI-driven platforms to enhance decision-making processes. A family office in the UAE, International Venture Investments Holding, takes an active investment approach, emphasizing operational autonomy and forming dedicated management teams for specific projects. The UBS Global Family Office Report 2024 shows that 78% of family offices plan to invest in generative artificial intelligence in the next two to three years.

  • XEX officially launched the Slerf/USDT perpetual contract at 19:00 (UTC+8) on November 22

    On November 22nd, XEX officially launched the Slerf/USDT perpetual contract at 19:00 (UTC+8).