Joshua Nathaniel Williams
Understanding Representations of Humans in Generative Image Modeling Through Discrete Counterfactual Prompt Optimization
Abstract
Text-to-image (T2I) models are a common, publicly accessible class of generative model. Due to their widespread use, it is crucial to develop tools and methods that allow us to better understand how these models decide to represent their subjects, particularly human subjects. By comparing generated images across sets of carefully constructed prompts, we may uncover patterns in how these models represent various groups of people. These analyses often show specific prompts that elicit representational asymmetries, such as the prompt: "A person with glasses." being more likely to generate a male-presenting person than female-presenting.
While many such patterns are innocuous, some harmful representational biases emerge that require an intervention by developers. These approaches that rely on predefined prompt templates or fixed identity categories are effective for benchmarking known issues, yet they may unintentionally create blind spots shaped by the researchers' own background and experience. While one person's life experiences may lead them to expect (and therefore design experiments to evaluate) specific representations by the model, another person may expect a completely different set of representations and harms that the former would not consider these differences in experience result in a wide range of potential blindspots in safety evaluations.
This thesis develops a variety of approaches, grounded in counterfactual and contrastive analyses, that act as general tools for surfacing new hypotheses related to representational asymmetries and harms in generative modeling that address these blindspots and complement existing evaluations. We first demonstrate that effective explanations for simple classifiers requires incorporating knowledge of the underlying ground-truth data distribution, without which, explanations and discoveries are prone to spurious insights. We posit a simple change to the implicit graphical model that underlies counterfactual explainability and propose a new metric that explicitly incorporates this distributional awareness.
The insights from this method then guides our approach to counterfactual explainability methods in the T2I setting. By reviewing a variety of discrete prompt optimization methods, we show how to define and encode this distributional awareness of captioned data in the optimization process. We support these methods by introducing an approach for multiobjective optimization across multiple language models, each with discrete tokenizers and text embeddings. Using the insights and methods developed throughout this thesis, we conclude by presenting an unsupervised strategy for discovering candidate prompts that encode representational asymmetries, many of which have not yet been discussed in the broader literature. Understanding and relating the learned speech and writing patterns of generative models to their outputs, allows to better understand why models represent people the way that they do and improves our ability to target specific behaviors as we train and evaluate generative models.