CSD Researchers Discover Vulnerability in Large Language Models
Tuesday, August 1, 2023 - by Ryan Noone
Generally, chatbots such as ChatGPT, Claude and Google Bard won't create offensive content, and hacking them requires effort and ingenuity. But researchers at Carnegie Mellon University's Computer Science Department, the CyLab Security and Privacy Institute, the Center for AI Safety in San Francisco, and the Bosch Center for AI have uncovered a new vulnerability posing a simple and effective attack method that causes aligned language models to generate objectionable behaviors at a high success rate.
In their latest study, 'Universal and Transferable Attacks on Aligned Language Models,' Computer Science Department faculty members Matt Fredrikson and Zico Kolter, CSD Ph.D. student Andy Zou, and ECE alum Zifan Wang found a suffix that, when attached to a wide range of queries, significantly increases the likelihood that both open- and closed-source LLMs will produce affirmative responses to queries that they would otherwise refuse. Rather than relying on manual engineering, their approach automatically produces these adversarial suffixes through a combination of greedy and gradient-based search techniques.
"At the moment, the direct harms to people that could be brought about by prompting a chatbot to produce objectionable or toxic content may not be especially severe," said Fredrikson, an associate professor in the Computer Science Department and Software and Societal Systems Department. "The concern is that these models will play a larger role in autonomous systems that operate without human supervision."
For more information, Contact:
Aaron Aupperlee | 412-268-9068 | email@example.com