Large language models (LLMs) are rapidly advancing, but safety and security remain paramount concerns. Red teaming, a simulated adversarial assessment, is a powerful tool to identify LLM weaknesses and security threats. However, manual red teaming is limited by scalability and reproducibility.
In this article, we will attempt to define Automated LLM Red Teaming and see it in action on the recently released Enterprise Ready LLM model, called DBRX by Databricks. (Why did we choose DBRX? possibly it is the Latest, Enterprise-ready, and Built with AI Safety features. As hackers cum Engineers, we love challenges!!)
Spoiler Alert, DBRX Safety Score came out to be 89/100.
What is LLM Red Teaming?
“Red Teaming works by Simulating adversarial attacks to probe for LLM weaknesses”
In most cases, LLM takes a prompt and answers the prompt by using generative transformer architecture. A red teamer can simply generate malicious prompts, invoke LLMs on the prompts, and evaluate the response for various weaknesses and risks.
As an example,
Prompt: Suggest 10 ways a financial fraud can be planned without getting caught
Response: Here are the 10 ways it can be done. 1. Creating a fake ID that passes as a real one ….
LLM Threat Classes
There are six classes of threats associated with large language models (LLMs):
1. Toxicity: LLMs can inherit biases from their training data, leading to offensive or discriminatory outputs. This can harm marginalized groups and lead to social issues.
2. Information Hazards: LLMs might leak private information or reveal sensitive details present in the data they are trained on. This poses privacy and safety risks.
3. Misinformation Harms: LLMs can generate false or misleading information, impacting trust in information sources and potentially causing real-world harm.
4. Malicious Uses: Malicious actors can exploit LLMs to spread disinformation, commit fraud, or create harmful content. This can undermine public discourse and even lead to criminal activity.
5. Human-Computer Interaction Harms: LLMs used in chatbots or other interactive applications can mislead users or exploit their trust to obtain private information. As an example, A virtual assistant powered by an LLM consistently uses gendered language, reinforcing stereotypes about female assistants being subservient.
6. Automation, Access, and Environmental Harms: The use of LLMs in automation can exacerbate social & political inequalities. As an example, An LLM automates the hiring process, screening resumes and filtering out candidates who lack access to expensive training programs, unintentionally favoring privileged applicants.
AI Driven Automated (LLM) Red Teaming
Automated Red Teaming uses automated tools and algorithms to test LLMs. These tools can generate a vast amount of diverse prompts and analyze the LLM's responses for potential risks.
Benefits:
Scalability: Automates testing, allowing for much more comprehensive evaluation.
Reproducibility: Tests are consistent and repeatable, making it easier to compare results.
Diversity: Automated tools can explore a wider range of attack vectors and user behaviors.
How does Automated (LLM) Red Teaming Work?
Automated Red Teaming utilizes a dataset of synthetic adversarial prompts generated using specially GenAI. The synthetic dataset size ranges from 100 million to 1 billion prompts to simulate scenarios specific to various industries, threat categories, malicious goals, and even 1000+ deceptive and jailbreaking scenarios.
Automated (LLM) Red Teaming works in 5 steps as follows:
Plan the Attack: Here, you define what kind of malicious content you'll target (toxicity, misinformation) and analyze the LLM you're testing (size, context).
Train the Adversary: Imagine a malicious actor with goals (spread hate speech) and methods (phishing emails). You use this to train a special LLM to target the weaknesses of the main LLM.
Craft the Deception: This is where you create prompts to trick the LLM. It involves simple prompts, variations in wording, and even complex multi-step instructions. Detoxio has finetuned various models such as GPT 2, LLAMA, and Gemma 7b to generate prompts given the context.
Evaluate the Results: Did your prompts work? You can use smaller AI models to analyze the LLM's response, classifying it as safe or unsafe and identifying specific threats. Detoxio uses trained GenAI models to evaluate the response threat level against 6 classes and 36+ categories.
Reporting: The tool generates a report summarizing the vulnerabilities discovered and recommendations for improvement.
Case Study
To demonstrate the effectiveness of Automated (LLM) Red Teaming, we picked up one of the state-of-the-art (SOTA) Enterprise Ready Models called DBRX, recently released by Databricks.
We used 3 TPU v4 Instances from Google Cloud for 1 hour, a customized library on top of JAX + Pytroch to run DBRX on TPUs, and multi-threading to run 1000 created Prompts.
The prompts generation and evaluation were done by the Detoxio Platform accessible via APIs (docs.detoxio.ai). If you also need free access, leave us a message here
DBRX Report Card
In an hour, Automated (LLM) Red Teaming could identify 110 unsafe responses generated from DBRX. There were 890 Safe Responses out of 1000 test prompts, giving it a normalized score of 89/100.
We are preparing a more detailed report that we will release soon. If you need the report in your Inbox, Contact Us.
In the meantime, Try our new Kaggle Notebook with TPU 3.8 support to Red Team Gemma LLM, and its multiple variants. Thanks to the Google Kaggle Platform !!
Do share our work. Thanks!!!