Artificial Intelligence Seminar
In Person and Virtual - ET - Newell-Simon 3305 and Zoom
ASHIQUE KHUDABUKHSH , Assistant Professor, Golisano College of Computing and Information Sciences, Rochester Institute of Technology
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit LLMs
How safe is generative AI for disadvantaged groups? This paper conducts a bias audit of large language models (LLMs) through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs the LLM to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing the LLM to generate more toxic content than the previous iteration until the safety guardrails (if any) throw a safety violation or it meets some other halting criteria (e.g., identical generation or rabbit hole depth threshold). Our experiments reveal highly disturbing content, including but not limited to antisemitic, misogynistic, racist, Islamophobic, and homophobic generated content, perhaps shedding light on the underbelly of LLM training data, prompting deeper questions about AI equity and alignment.
Ashique KhudaBukhsh is an assistant professor at the Golisano College of Computing and Information Sciences, Rochester Institute of Technology (RIT). His current research lies at the intersection of NLP and AI for Social Impact as applied to: (i) globally important events arising in linguistically diverse regions requiring methods to tackle practical challenges involving multilingual, noisy, social media texts; (ii) polarization in the context of the current US political crisis; and (iii) auditing AI systems and platforms for unintended harms. In addition to having his research been accepted at top artificial intelligence conferences and journals, his work has also received widespread international media attention that includes coverage from The New York Times, BBC, Wired, Times of India, The Indian Express, The Daily Mail, VentureBeat, and Digital Trends. I
n Person and Zoom Participation. See announcement.