Doctoral Thesis Proposal - Andy Zou
— 12:00pm
Location:
In Person
-
Traffic21 Classroom, Gates Hillman 6501
Speaker:
ANDY ZOU,
Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://andyzoujm.github.io/
Generative models now mediate information access, software development, and mission critical workflows, yet their security and safety properties lag their rapid deployment. This thesis develops a comprehensive approach to improving the safety of aligned language and agentic systems.
First, we show that alignment fine tuning leaves structural vulnerabilities: the Greedy Coordinate Gradient attack learns universal and transferable suffixes that trigger harmful behaviors across open source models, achieving high transfer rates to proprietary models and revealing shared non-robust features in model representations. Second, we advance security measurements that span standardized and live evaluation. HarmBench establishes reproducible static robustness benchmarks, while the Gray Swan Arena and the resulting Agent Red Teaming benchmark capture human, adaptive adversaries whose discoveries continually refresh static tests. Together they demonstrate the fragility of current agents and provide a continuous feed of vulnerabilities. Third, we introduce Representation Engineering (RepE), a new class of approaches that probe and control population-level representations encoding safety-relevant concepts.
We apply RepE methods to safety-critical concepts such as honesty and harmfulness. In particular, we present Circuit Breaking, an alignment algorithm which suppresses harmful thought processes in the representation space to combat adversarial misuse. Looking forward, we will continue scaling the capabilities of automated red teaming agents and develop environments that allow for co-evolution of attacker and defense agents. For mitigation at the model representation level, we plan to extend RepE monitoring to contextual policy violations. We believe that treating safety as a property of training, evaluation, and internal computation yields principled mechanisms for securing generative systems.
Thesis Committee
Matt Fredrikson (Co-Chair)
Zico Kolter (Co-Chair)
Graham Neubig
Nicholas Carlini (Anthropic)
Additional Information
For More Information:
matthewstewart@cmu.edu