This is the second in a two-part series exploring intellectual property laws, their issues and how they are being impacted by the development and applications for generative AI systems.
As generative artificial intelligence technologies continue to grow, developers and copyright holders remain at odds regarding the use of copyrighted material in training data.
Copyright owners believe they should be fairly compensated for use of their work — as they contribute to the capabilities and quality of generative AI models — while some technologists argue imposing the cost of actual or potential copyright liability on the creators of AI models will "kill or significantly hamper" AI development.
We previously considered the issues around web scraping for protected data in AI training datasets and determined, pending rulings on whether web scraping qualifies as reproduction, a fair use defense may still be necessary for companies accused of violating copyright for this use of protected data.
To evaluate the fair use exception specific to AI, we consider a court's analysis of the traditional four factors through the lens of the recent case, Thomson Reuters v. Ross Intelligence. This case provides pivotal insights into how future courts may approach copyrighted works used in AI training models. We also consider the questions around copyright protections for authorship by generative AI models, or works created by a combination of human and AI systems.
Does current fair use analysis apply to copyrighted materials used to train AI?
A court's review of a fair use defense considers four factors collectively: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect of the market.
So far, courts have seen limited cases for digital content that may be useful to apply to fair use considerations for AI training data. In one that may be applicable, the U.S. Court of Appeals for the Second Circuit found Google's scanning of books and snippet display functions were sufficiently transformative because the service augments "public knowledge … without providing the public with a substantial substitute" for the original works.
However, fair use analysis is always case-dependent, so this prior outcome does not guarantee the same conclusion for the present questions. The prior court reasoned Google's use made it easier to find original works, but current generative AI programs generate outputs without direct reference or attribution to their sources.
A broader review of the U.S. Supreme Court's historical fair use analyses around transformative uses also provides few clues to indicate a specific outcome for AI. The court's foundational transformative use analysis in the 1994 case of Campbell v. Acuff-Rose Music found that, while transformative use is not mandatory to establish fair use, "the goal of copyright … is generally furthered by the creation of transformative works." The court did say the fair use doctrine "permits [and requires] courts to avoid rigid application of the copyright statute" when it would stifle the desired creativity.
On the other hand, in its most recent fair use case — Warhol v. Goldsmith — the Supreme Court determined transformative use is a matter of degree and not dispositive over the other three factors, especially in an analysis with commercial implications. For now, however, the challenge falls to various lower courts to respond to the ongoing litigation over how to apply this analysis to generative AI models.
Multiple lawsuits have been initiated against generative AI companies regarding copyrighted data used for model training, alleging the defendants infringed copyright by ingesting works for AI models that are subsequently capable of generating outputs that mimic, compete with or reproduce their works.
To establish infringement, plaintiffs must demonstrate two key elements: they must possess the valid copyright in the work — often not the point at issue — and the defendant copied their work, by showing the defendant likely had access to the copyrighted work and created something substantially similar.
To establish likely access, plaintiffs are relying on the lack of detailed disclosure of training data and the scale of web scraping to show a high probability of access to their work. That is, LLMs are known to be trained on mass collection of data, and many do not provide sufficient information to support exclusion of particular sources.
However, proof of copying is where plaintiffs struggle the most regarding generative AI. As discussed previously, the technology around data collection for training does not correlate to a traditional understanding of what it means to copy an original work. If no image or text is directly collected or retained by design, the other option is to consider the substantial similarity of outputs, a detailed process in traditional copyright law.
It is clear generative AI models can create artworks or writings in the style of a specific artist or author, but since copyright does not protect style, this is not an infringement. The unique capabilities of these models make it hard to meet the threshold required for substantial similarity of output to a singular original input. Already, some claims in lawsuits regarding model outputs have been disposed of on these grounds. To solve this issue at scale, a one-by-one comparison of individual outputs will not be sufficient. A focus on the training stage of AI models and the common aspects of training data writ large will be needed.
The central question, therefore, is whether using mass quantities of copyrighted material to train these models constitutes direct copyright infringement of each individual input and, if so, whether that use could be considered a fair use exception. In 2023, the U.S. District Court for the District of Delaware was the first to address this precise issue, via summary judgment in the case of Thomson Reuters Enterprise Centre et al. v. ROSS Intelligence.
Reuters alleged a generative AI model developed by ROSS Intelligence was trained on proprietary headnotes from Reuters' legal research service, Westlaw. A summary judgment means the court ultimately said a jury must decide whether the inputs, or headnotes, were protected by copyright and, if so, whether ROSS's violation was nevertheless covered by fair use. But the court still included an extensive fair use discussion, which provides the first clear insights into how courts may approach this question.
Factor 1: Purpose and character of the use. Citing Warhol v. Goldsmith, Reuters argued commercial use weighs heavily against a fair use defense.
The court, however, found the Google v. Oracle case to be more applicable in dealing with new technology. In Google v. Oracle, the Supreme Court ruled Google's limited copying of Oracle's Java application programming interface constituted transformativeness in fair use. The court said, while the current use was indeed commercial, ROSS's copying was transformative to a significant extent "if ROSS's AI only studied the language patterns in the headnotes to learn how to produce judicial opinion quotes."
Similar analyses might indicate future readings in favor of the AI training context if, for example, courts decided generative AI programs studied the copyrighted training data to learn how to produce similar types or categories of outputs, such as fiction or impressionistic art.
Factor 2: Nature of the copyrighted work. The more creative the work, the more this factor favors the plaintiff, the original creator.
In the Reuters case, it will be up to the jury to decide how creative the case headnotes are, but the court indicated that, in its own view, this application would likely support allowing a fair use exception. The court deemed the Key Number System involved was more informational than creative, "merely a way to arrange 'informational' material. So the system inherently involves significantly less creative or original expression than traditionally protected materials …"
This weighing would therefore vary greatly across the broad variety of training data in generative AI systems — some of which might be considered informational, but much of which would surely be given the recognition as almost wholly creative works. It might be difficult, therefore, to state categorical rules that cover mass quantities of inputs obtained from huge numbers of sources.
Factor 3: Amount or substantiality of the portion used. Although a general copyright analysis says the greater the percentage of the original used, the more likely it is to be protected, the Reuters' court relied on the prior 2nd Circuit decision in Authors Guild v. Google, which stated even verbatim copying has consistently been upheld under fair use if the copy is not revealed to the public.
Since generative AI training data is normally confidential, this might carry forward to the questions about whether such collection — even if considered a copy — is still allowable.
This court also directed that the jury must determine, if headnotes are protected, whether the scale of copying was necessary to advance the transformative goals. They seem to expect it might be.
This analysis might be applied similarly to future generative AI applications, given the parallel between inputs and outputs more generally.
Factor 4: Effect on potential market. Traditionally, direct competition between the new material with the original item almost always weighs in favor of the original creation. The more diverse the purpose of the new content from the original, the more likely a fair use exception will be allowed. In this case, the court considered two relevant markets in its analysis: the market for Westlaw as a legal research platform and the demand for Westlaw's data.
The court theorized ROSS would not be a market substitute for Westlaw if it created a new platform that served a different commercial purpose. However, it left it to the jury to determine if ROSS's use would affect the potential market for Reuters' Westlaw content. This factor is possibly the hardest to predict in terms of impact on generative AI services.
The ability for more people to generate content will almost certainly impact the overall market demand for particular or specialized artwork, but the overall purpose of the large language model is not to create commercial — or even personal — art, nor to only generate text such as fictional writing or documentary analysis, and so on. The platforms are designed to support diverse purposes across almost infinite market options, many of which do not pose any relevance at all to the full scope of their inputs, much less direct competition.
There is no clear conclusion to be reached from this case that can be simply applied to future training data challenges. But it is at least clear it won't be a slam-dunk to protect copyrighted training data from all such uses.
Therefore, if courts want to find ways to support the spirit of copyright protections — to enable and protect human creators — they might have to look for other strategies. It may be that effectively protecting markets for human content creators requires other options rather than relying on restricting training data based on copyright protections.
One of those strategies might be limiting applications or use cases for AI outputs, starting with disallowing them from copyright protections of their own.
Another significant copyright question raised by the advent of generative AI models and LLM platforms has to do with what protections will be applied or afforded to their outputs. If not found to infringe on existing copyrighted works, they would constitute new material — a created work.
The incentive under the Intellectual Property Clause in the U.S. Constitution is to protect authors' writings and inventors' discoveries. However, the U.S. only recognizes human creativity. The Copyright Act defines an eligible work as "original works of authorship." While the term human is not used, the U.S. Copyright Office and courts have limited "works of authorship" to human authors, refusing to offer copyright protection to monkeys and divine beings for example.
AI-generated works have conflicting authorship claims. Some argue AI-generated works were authored by the human that designed the prompts necessary to create the output. Artists use tools like Adobe Photoshop and Illustrator to create visual art, so it may be challenging to create clear definitions and boundaries to show how generative AI art differs.
Further, many artists and writers may take the output of the AI model and continue to adapt, edit or refine it. Nevertheless, the concept that an AI cannot be granted a copyright is the existing status in the U.S. and may be a policy decision affirmed in the future because it supports larger social goals.
Currently applicable reviews of technology and authorship include:
Other countries are considering these questions as well, with widely divergent decisions. For example, in the U.K., the Copyright Designs and Patents Act allows protection for computer-generated works without a human author. In addition, in Japan, Article 30-4 of the Copyright Act would allow AI to use copyrighted works for training data without permission since this falls within "the purpose of information analysis." As long as the use is "not intended for the enjoyment of ideas or emotions expressed in a work," the work can be used without prior approval.
Courts are just starting to grapple with applying the fair use doctrine to emerging generative AI technologies. While courts have recognized transformative uses for web scraping in one context, it is unclear whether training data for LLMs will fit within the same analysis.
The recent case of Thomson Reuters v. ROSS Intelligence speaks to the complex intersection of generative AI, copyright law and fair use exceptions, where the court weighed the transformative use of AI against the market impact and did not find clear support to disallow this use.
As specific cases are decided, however, courts will continue to seek balance between promoting human creativity for the betterment of society with the capabilities offered by new technologies.
Originally published on the IAPP website on August 7, 2024