Google
Aug 23, 2022We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs.
Red teaming represents one useful tool to identify and mitigate the potential harms of AI systems through a form of automated or manual adversarial testing.
It is found that the RLHF models are increasingly difficult to red team as they scale, and a flat trend with scale for the other model types is found.
This paper from Anthropic is well written, simple to read paper, that provides a detailed overview of AI red teaming.
Aug 23, 2022We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their�...
We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty.
Introduced by Ganguli et al. in Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM.
Aug 23, 2022Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. The full paper is available here.
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs.