Computer Science 5th Year Master's Thesis Presentation

— 12:00pm

Location:
In Person - Wean Hall 5328

Speaker:
SUHAS KOTHA, Masters Student, Computer Science Department, Carnegie Mellon University
https://www.andrew.cmu.edu/user/suhask/


Fine-tuning Does Not Remove Language Model Capabilities

Fine-tuned language models catastrophically forget tasks outside the fine-tuning distribution. On the flip side, fine-tuning is often used to remove unsafe behavior such as toxic content generation. Both this failure mode and success require that fine-tuning removes a capability from the model. We show that fine-tuning does not remove such capabilities, which is encouraging for reducing forgetting, and pessimistic for defending jailbreaks. 

Via synthetic experiments, we hypothesize that language models implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers in-context learning abilities lost via instruction tuning and natural reasoning capability lost during code fine-tuning. More concerningly, conjugate prompting can recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT. Can algorithms like fine-tuning and input defenses reliably remove unwanted behavior? We find that the best fine-tuning and input defenses can not enforce one of the simplest, perfectly defined behaviors: do not output the word "purple". 

Both forgetting and jailbreaking demonstrate that fine-tuning currently does not fully remove/change model capabilities. We propose future directions on improving capabilities by investigating length generalization and reliably removing capabilities via machine unlearning. 

[1]  Understanding Catastrophic Forgetting in Language Models via Implicit Inference
[2]  Jailbreaking is Best Solved by Definition

Thesis Committee

Aditi Raghunathan (Chair)
Daphne Ippolito

Additional Information