Blog

Generative AI and intellectual property: The evolving copyright landscape

Brenda Leong
Ekene Chuks-Okeke
Natalie Linero
July 31, 2024

*This is the first in a two-part series exploring intellectual property laws, their issues and how they are impacted by the development and applications of generative AI systems.

The Intellectual Property clause in the U.S. Constitution empowers Congress to advance innovation by protecting authors' writings and inventors' discoveries. Emerging generative AI capabilities present a unique challenge to this long-standing legal framework.

While technology expands scientific progress, it impacts authors and artists as a potentially competitive, nonhuman provider of original content, sparking global debates around the nature of creativity and the role of humans in art, along with the laws that protect them.

International IP law protects the creations of the human mind, such as "inventions; literary and artistic works; designs; and symbols, names, and images used in commerce." Copyright, patent, trademark and trade secret categories represent long-standing and globally respected IP rights, and others are recognized as well.

IP contributes to business value as an intangible asset and can be harnessed strategically to a business's competitive advantage. IP rights protect business assets, but also provide protection for individual creators not associated with a business enterprise.

Artificial intelligence systems are specific types of computer programs commonly associated with tasks that traditionally require human intelligence to accomplish. They have been developed for over seven decades, but exploded into the mainstream after the public launch of ChatGPT and other generative AI tools.

Generative AI is a new and revolutionary subset of AI and machine learning technology that generates original outputs, but also raises a variety of broad legal, ethical and privacy concerns. It has, among other things, created havoc in the legal and commercial understanding of IP and the associated rights of creators.

Fundamentals of copyright

Copyright protects "original works of authorship fixed in any tangible medium of expression," including literary, dramatic, musical and artistic works. It does not protect facts, ideas, systems or methods of operation but may protect the form of their expression.

Copyright is not a single right but a bundle of exclusive, time-limited rights granted to authors to control the reproduction, adaptation, distribution, performance and display of their work. Performance of these actions without permission or in limited legal contexts may constitute copyright infringement unless exceptions apply.

Copyright protects the works of people who are not professional or even intentional creators. In addition to traditional media, this protection has long been established to include digital content such as blog posts, journal entries and photo images.

Fixing a work means making it available in a tangible medium that can be perceived. Writing on a napkin, printing or posting a photo, writing blogs, reporting news and all other digitally accessible versions of traditional media are included under copyright law. Notably, voices, live performances and the characteristics of particular sounds cannot be copyrighted, although the specific recording of a speech or performance is covered because it has been "fixed" in the recording.

Since original digital content is protected by copyright, let us consider if using such protected content to train AI is lawful or if it infringes on copyright protections — and on what basis.

IP rights for content used as training data

Training data is the foundation of any AI models' performance, accuracy and reliability.

Developers of these systems publicly disclose the broad categories of data sources they use, which includes massive public datasets as well as data collected via targeted web scraping — a tool for quickly and accurately extracting relevant data from a website that can be exported in a structured format.

While it would seem simple to demand developers clearly identify these sources, detailed disclosure of training data may be limited at least partially due to the model developers' own IP rights, specifically around trade-secret protection for their dataset.

Nevertheless, both copyright claims and pending legislation may require developers to make further disclosures. For example, the EU AI Act will require regulated AI providers to document a summary of any copyrighted information from their training data.

Without such further transparency disclosures, artists have limited ability to know when and where their work may have been collected. Resources like haveibeentrained.com help artists search databases, identify if their work has been used and enable them to flag their works for removal.

Evidence from this site has been accepted in cases like Silverman et al v. Open AI, a lawsuit filed by several authors for vicarious copyright infringement. The court ruled the summaries produced by ChatGPT are not considered copyright infringement because the output is not substantially similar to the plaintiff's material. The output is a mix of "expressive material derived from many sources," making it protected under copyright law.

Web scraping for data training

From the copyright bundle, the right of reproduction is the aspect at issue around the use of protected materials for training data in AI models. The protections around reproduction are what control the making of copies of the original work, including photocopying, scanning, and uploading or downloading content.

Intuitively, collecting data via web scraping seems to presuppose some form of reproduction has occurred. However, technical experts on training data and large language models challenge the assumption that this is a reproduction in a traditional sense because, while humans see the original expression in a copyrighted work, AI and machine learning systems process them as raw material for computing, including vectors, tokens and data points, not as unique expression embodied in text or image.

AI systems use the data extracted from copyrighted works in a machine-readable format, and the question of whether this constitutes a copy of the original remains to be resolved.

Further, AI models do not, by design, retain training data in the traditional sense. Information is not transferred or copied into the model. However, research shows LLMs may effectively memorize an extensive amount of training data in the context of useful output for certain queries. Unfortunately, there is no way to ascertain whether this has occurred without its discovery via response experimentation.

Jurisdictions like Japan and the EU already provide copyright exceptions that allow the reproduction and extraction of works for text and data mining, or web scraping, purposes.

In the EU, certain organizations can carry out text and data mining of works that they can lawfully access for scientific research. For nonscientific research purposes, text and data mining is permitted unless rights holders expressly reserve their rights. This requires respecting opt-outs from rights holders, which can be challenging given the lack of transparency around training data.

In the U.S., a 2006 District Court held that Google did not infringe on the plaintiff's copyright by indexing and caching his online story with a web crawler, but whether this will hold for broader use cases as necessitated by generative AI remains to be seen.

Potential exceptions to copyright protections: The fair use defense

If collection via web scraping, or other manner of obtaining training data, is determined to constitute a copy of the original content, there may still be limits to the copyright protections for the original work.

Copyright law has evolved to acknowledge societal benefits, such as promoting speech, education, creation of new works and cultural expression. In so doing, it has created exceptions for use, such as by libraries, and exceptions under the fair use defense in the U.S.

In the U.S., the fair use defense for claims of copyright violations grants defendants protection from unauthorized use of copyrighted works when that use meets certain specified criteria. A fair use allowance for protected content cannot be assumed beforehand. Courts evaluate four factors collectively and grant the fair use exception on a case-by-case basis.

Factor 1: Purpose and character of the use. Courts consider whether the particular use is transformative, noncommercial, educational or necessary for criticism/commentary. Transformative uses give new expression or meaning to the original work. Nonprofit and educational uses are often favorably considered, while commercial uses typically require permission from the copyright owner but are not automatically infringing. For generative AI, training for a proprietary model with licensed services is commercial use. An open-source model that does not charge for access may also be considered commercial, given the economic benefits that accrue.

Factor 2: Nature of the copyrighted work. This factor distinguishes between factual and highly creative or imaginative works. Factual works are more likely to allow for fair use exceptions than highly creative or unpublished works. Thus, including creative works like art and literature in training datasets is likely less favorable for fair use than scraping corporate data. Facts are not protected by copyright and may be reproduced. Still, facts compiled into databases are protected as "compilations" under U.S. copyright law if the process of selecting and compiling the information involves a sufficient level of creativity or originality. Databases enjoy more robust protection under European copyright law and the World Intellectual Property Organization Copyright Treaty.

Factor 3: Amount and substantiality of the portion used. This factor assesses how much of the original work was used in relation to the purpose of the use. Using a relatively small portion of the original, or using only what is necessary for achieving the transformative purpose, tends to support a fair use interpretation. If a substantial part of the work is used or if the part used is deemed central to the original work, this would weigh against fair use. LLMs are trained on immense troves of data and therefore any one element of training data would be relatively insubstantial to the whole. But this factor also considers how much of the original content is included, so if the entirety of the original work is present in training data, then even as a minute part of the overall dataset, at the individual level, it would be a large or central percentage. Unless protected data is revealed as an output, it would be difficult for individual creators to establish how much of their work was used for training or the significance of their own work to the training dataset.

Factor 4: Effect on market potential. The final factor evaluates the impact of the new content on the market for the original work. If the new use seems to directly compete with or diminish the market for the original work, it is less likely to qualify for a fair use exception. For generative AI, artists and authors could reasonably argue using their works to train systems affects the market for their original creations, but there are likely limitations as the systems are more likely to be competitive at the general market level and not competition for the specific work at issue.

Conclusion

Companies involved in developing and training generative AI technologies face increasing IP challenges surrounding potential copyright infringement.

However, the specifics of how copyright protections will be weighed and applied in this new context remain uncertain. Both technical aspects, such as the ways in which AI models access, interpret and retain protected data, as well as how the new uses rank under existing copyright exceptions and tests, will need careful legal and policy assessments.

Legal copyright protections exist to support the overall social value of human creators adding to our stores of art and knowledge. Whether those legal frameworks are the best way to protect that value from the impacts of new technologies remains to be determined.

Originally published on the IAPP website on July 31, 2024