Blog

A Guide to Red Teaming GenAI, Part 1

Andrew Burt
Brenda Leong
June 20, 2024

At Luminos.Law, we are deeply involved in de-risking generative AI models for our clients—large and small and in nearly every sector. A huge percentage of that work focuses on red teaming, involving direct stress-testing of GenAI systems, as well as teaching stakeholders within our client companies how to do the same. This sort of testing is required up front, but also during ongoing governance once the systems using these models are operational.

In that process, we’ve discovered some approaches and strategies that are most successful in assessing GenAI performance and reliability, along with some of the most common mistakes as companies seek to effectively and efficiently manage the risks these systems can introduce. We’re taking to our blog to share some of our learnings! 

This is the first in a series of “lessons learned” from our experiences helping our clients manage GenAI liabilities—in particular, red teaming GenAI systems. For context, we find ourselves testing these at every level, from foundational models, to products that incorporate GenAI systems, as well as fine-tuned open-source systems, models and products from vendors, and more.

Note that we are publishing these blog posts to help organizations adopting GenAI systems more robustly manage and test their systems and to ensure their safeguards are sufficient and working as intended. No information included in this blog series directly relates to, or identifies, any Luminos.Law client or client matter. 

Getting Started: Degradation Objectives

How do you know what to test for? Setting the right degradation objectives is one of the most critical steps in red teaming—the more precise the degradation objectives, the more successful the red teaming is likely to be. We think of degradation objectives as the most concerning harms the system might cause. Red teaming is intended to assess and ultimately mitigate each of these harms. These objectives are so critical because the “attack surface” of GenAI systems is so large that not every vulnerability can be assessed—indeed, these are probabilistic systems that are guaranteed to behave in unintended ways. This means that identifying the foreseeable risks and prioritizing the most important sets the foundation for successful red teaming. 

For that reason, degradation objectives should be carefully outlined at the beginning of any red teaming, forming a prioritized list of harms that could negatively impact the business, impact its customers, or create other legal liabilities. In practice, this means that the degradation objectives should be determined based on inputs from all relevant parties: lawyers, data scientists, project managers, procurement teams, and the business units (or others) adopting and deploying each GenAI system. 

There are many possible degradation objectives. This is a list of some of the most common that we’ve found ourselves prioritizing with our Clients: 

  • Copyright and IP Issues—Model Output: GenAI systems are intended to generate new content—in text, code, audio, images, or a combination. But because of the way they are trained, it’s not uncommon for them to generate model output that potentially violates copyright, as when a model generates all or some portions of copyrighted text (an article, for example, or marketing materials or images from a competitor). It’s still unclear how much is too much when it comes to what degree of similarity is legally permissible for GenAI to reflect in its outputs—the international intellectual property laws are still developing as to whether one or two sentences are permissible, or if two highly similar images are permissible, etc. But for the operating company, testing for the model’s potential to create this content is important in minimizing the risk and associated liability once it’s deployed.
  • Copyright and IP Issues—Training Data: It’s not just the outputs that raise concerns. Because GenAI systems are trained on vast amounts of data, it is almost certain that the training data will include copyrighted or otherwise protected information. As with the outputs, it is still unclear from a legal standpoint the extent to which using this type of data for training may constitute a “fair use exception.” But even that defense may not help in potential trademark or patent violations, so companies should have a clear understanding of these issues and test their systems accordingly. If, for example, they decide that some datasets should not be included in model training, it can be important for red teams to test that this strategy was successful by testing the system to determine if prohibited data was used to train the model. (There are a variety of attacks that can be used to test for this, but we’ll leave that subject for another blog post.)
  • Privacy—Personal Identifying Information: Many data protection frameworks are contingent on the ability to connect data to specific individuals, meaning that if the model generates data that includes nonpublic information that could be used to identify specific persons, privacy laws may apply. There are two major sources of liability in this regard—regarding both inputs and outputs. First, the question of whether there is any identifying information reflected in model output. Many years ago, for example, a popular South Korean chatbot was shut down for exactly this issue after it was found exposing sensitive personal information during conversations with its users. The second type of liability arises from including personal or sensitive data in the training sets for GenAI systems without the consent of the individuals whose personal data is included or some other legal basis for its use. This is what led to the algorithmic disgorgement of one model by the FTC in 2021, forcing a company to stop using an AI model entirely because it was training on personal data without consent.
  • Privacy—Location Awareness: Other data protection violations can involve a model’s ability to determine its users’ location without their consent. Indeed, many models have explicit protections that prevent them from knowing a user’s location without that user first telling the model where they are. These limitations are frequently conveyed to the users themselves in the form of privacy policies, contracts, or direct wording on the user interface explaining that the model does not have access to user locations. Nevertheless, many models deduce where users are and undercut these safeguards—being able, for example, to provide directions to the nearest McDonald’s even when the user hasn’t divulged their location. Testing for this type of vulnerability is important in evaluating how effective the safeguards are.
  • Bias and Fairness: Inequities are so deeply embedded in our society that nearly every aspect of model development and deployment involves unfairness issues. For example, training data from historical sources is likely biased toward particular demographic groups, either by not sufficiently representing these groups (as when training data does not include sufficient information about women, for example, as is common in the healthcare space) or when training data doesn’t portray specific demographic groups appropriately (as when training data is scraped from oftentimes biased internet comment boards). There are so many ways that GenAI systems can lead to liabilities related to impermissible bias that we could write a separate blog post—or law review article, or even book—so we can only highlight here how easily these liabilities can arise. From a company’s particular standpoint, however, red teams should outline and prioritize the specific fairness issues related to their model that can serve as the most important degradation objectives to test. Among the few public examples of this type of work is a public audit we conducted with In-Q-Tel of the language model RoBerta that contained bias testing focused on language models.
  • Unfair and Deceptive Acts and Practices (UDAP): The FTC has been increasingly focused on GenAI systems, and in particular its authority is based on evaluations of the deception or unfairness that can occur when using these models. UDAP issues oftentimes arise related to the disclosures made to consumers, or around violations of stated policy related to GenAI models (such as when the fact that the GenAI system is not a human is not fully disclosed), or when companies make promises about their models’ capabilities or performance that are simply not true (something we see over and over related to marketing claims about GenAI systems). From a red teaming context, companies should examine the information they are publicly disclosing about the model—ranging from what their policies say about model use to marketing materials and language provided in the UI—and ensure that these statements are all consistent with model behavior as confirmed by targeted testing. 
  • Other Harms: In addition to the above degradation objectives, it’s worth noting that there are many other possible harms that companies should consider. These include potential reputational harms that are not necessarily connected directly to legal concerns, as well as issues related to hallucinations. Importantly, we’ve seen concerning outcomes that are not directly related to laws or regulations lead to increased attention from agencies like the FTC, which in turn increases the chance of other types of legal oversight and concrete legal penalties.

If you’re interested in learning more about understanding and identifying degradation objectives and how important they are to successful red teaming of GenAI systems, please reach out to us at contact@luminos.law or click on the “Get Started” button above. We are experienced specialists in managing GenAI liabilities and we’re happy to share our knowledge with you!