This ongoing research makes use of the Security pillar of AIMI for Gen AI (which also has pillars for Fairness, Toxicity and Privacy). It tests a range of leading language models, both large proprietary models and easily accessible, smaller open models and is regularly updated with new AI models.
The Security pillar tests models & associated guardrails for nefarious activity across a range of harm categories. If safe, AI systems should not comply with the nefarious, harmful prompts.
Contemporary models aim to provide a layer of AI safety, meaning that the developers of the models have built a layer into the models to detect and reject nefarious activity. Deployers of AI systems also add a layer of safety controls to the deployed AI system, outside of the model, aimed at catching and blocking nefarious activity. Collectively these safety controls are known as guardrails.
However, like all technology systems, these guardrails may have weaknesses that can be exploited and manipulated. AI safety testing independently checks the deployed AI system (including the model, guardrails and any other controls placed in the inference flow) for safety risks.
The Security pillar therefore goes through multiple jailbreaking processes to pinpoint weaknesses which cause the guardrails to fail, which in turn causes the model to comply with the nefarious prompt. AIMI will automatically run through all the prompts, harm categories and jailbreaks at scale, with automatic detection of model rejection or compliance.
Model Developer |
Model | Fraud | Hate Speech |
Illegal Activity |
Misinformation | Security & Malware |
Self Harm | Sexually Explicit |
Violence | Model Execution Environment |
---|---|---|---|---|---|---|---|---|---|---|
![]() |
Qwen 2.5 Max |
|
|
|
|
|
|
|
|
Accessed via Alibaba Cloud's API |
|
nova-micro-v1:0 |
|
|
|
|
|
|
|
|
Accessed via AWS' cloud API |
|
nova-lite-v1:0 |
|
|
|
|
|
|
|
|
Accessed via AWS' cloud API |
|
nova-pro-v1:0 |
|
|
|
|
|
|
|
|
Accessed via AWS' cloud API |
|
OLMo 1b SFT DPO |
|
|
|
|
|
|
|
|
Running locally on LM Studio |
|
Claude 3.5 Haiku 20241022 |
|
|
|
|
|
|
|
|
Accessed via Anthropic’s cloud API |
|
Claude 3.5 Sonnet 20240620 |
|
|
|
|
|
|
|
|
Accessed via Anthropic’s cloud API |
|
Claude 3.5 Sonnet 20241022 |
|
|
|
|
|
|
|
|
Accessed via Anthropic’s cloud API |
|
Command R |
|
|
|
|
|
|
|
|
Accessed via Cohere’s cloud API |
|
Command R7B 12-2024 |
|
|
|
|
|
|
|
|
Accessed via Cohere’s cloud API |
|
DBRX Instruct v3 |
|
|
|
|
|
|
|
|
Accessed via Databricks’ cloud API |
![]() |
Reasoner (R1) |
|
|
|
|
|
|
|
|
Accessed via DeepSeek's cloud API |
|
Jais 7b Adapted Chat Q8_0 GGUF |
|
|
|
|
|
|
|
|
Running locally on LM Studio |
|
Gemini 1.5 Flash 002 |
|
|
|
|
|
|
|
|
Accessed via Google Cloud’s Vertex AI API |
|
Gemini 2.0 Flash Exp |
|
|
|
|
|
|
|
|
Accessed via Google Cloud's Vertex AI API |
|
Gemma 2 2b Instruct |
|
|
|
|
|
|
|
|
Running locally on LM Studio |
|
Llama 3.1 8B Instruct |
|
|
|
|
|
|
|
|
Accessed via Hugging Face’s Dedicated Instance cloud API |
|
Llama 3.3 70b Instruct |
|
|
|
|
|
|
|
|
Accessed via Hugging Face’s serverless cloud API |
|
Phi 3.5 Mini Instruct (3.8b) |
|
|
|
|
|
|
|
|
Running locally on Hugging Face Text Generation Inference in Docker |
|
Phi 4 14B Q8eetq |
|
|
|
|
|
|
|
|
Running locally on Hugging Face Text Generation Inference in Docker |
|
7b Instruct v0.3 |
|
|
|
|
|
|
|
|
Running locally on Hugging Face Text Generation Inference in Docker |
|
Nemotron 70b |
|
|
|
|
|
|
|
|
Accessed via nvidia's cloud API |
|
GPT-4o |
|
|
|
|
|
|
|
|
Accessed via OpenAI’s cloud API |
|
o1-preview |
|
|
|
|
|
|
|
|
Accessed via OpenAI’s cloud API |
|
O3 Mini (Low Reasoning) |
|
|
|
|
|
|
|
|
Accessed via OpenAI’s cloud API |
|
O3 Mini (Medium Reasoning - Default) |
|
|
|
|
|
|
|
|
Accessed via OpenAI’s cloud API |
|
O3 Mini (High Reasoning) |
|
|
|
|
|
|
|
|
Accessed via OpenAI’s cloud API |
|
Allam 2 7b Instruct |
|
|
|
|
|
|
|
|
Accessed via MS Azure’s cloud API |
|
Falcon 7b instruct |
|
|
|
|
|
|
|
|
Accessed via Cloudflare’s cloud API |
![]() |
Model Qwen 2.5 Max |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Alibaba Cloud's API |
|
Model nova-micro-v1:0 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via AWS' cloud API |
|
Model nova-lite-v1:0 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via AWS' cloud API |
|
Model nova-pro-v1:0 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via AWS' cloud API |
|
Model OLMo 1b SFT DPO |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on LM Studio |
|
Model Claude 3.5 Haiku 20241022 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Anthropic’s cloud API |
|
Model Claude 3.5 Sonnet 20240620 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Anthropic’s cloud API |
|
Model Claude 3.5 Sonnet 20241022 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Anthropic’s cloud API |
|
Model Command R |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Cohere’s cloud API |
|
Model Command R7B 12-2024 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Cohere’s cloud API |
|
Model DBRX Instruct v3 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Databricks’ cloud API |
![]() |
Model Reasoner (R1) |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via DeepSeek's cloud API |
|
Model Jais 7b Adapted Chat Q8_0 GGUF |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on LM Studio |
|
Model Gemini 1.5 Flash 002 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Google Cloud’s Vertex AI API |
|
Model Gemini 2.0 Flash Exp |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Google Cloud's Vertex AI API |
|
Model Gemma 2 2b Instruct |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on LM Studio |
|
Model Llama 3.1 8B Instruct |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Hugging Face’s Dedicated Instance cloud API |
|
Model Llama 3.3 70b Instruct |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Hugging Face’s serverless cloud API |
|
Model Phi 3.5 Mini Instruct (3.8b) |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on Hugging Face Text Generation Inference in Docker |
|
Model Phi 4 14B Q8eetq |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on Hugging Face Text Generation Inference in Docker |
|
Model 7b Instruct v0.3 |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Running locally on Hugging Face Text Generation Inference in Docker |
|
Model Nemotron 70b |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via nvidia's cloud API |
|
Model GPT-4o |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via OpenAI’s cloud API |
|
Model o1-preview |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via OpenAI’s cloud API |
|
Model O3 Mini (Low Reasoning) |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via OpenAI’s cloud API |
|
Model O3 Mini (Medium Reasoning - Default) |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via OpenAI’s cloud API |
|
Model O3 Mini (High Reasoning) |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via OpenAI’s cloud API |
|
Model Allam 2 7b Instruct |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via MS Azure’s cloud API |
|
Model Falcon 7b instruct |
Fraud |
|
Hate Speech |
|
Illegal Activity |
|
Misinformation |
|
Security & Malware |
|
Self Harm |
|
Sexually Explicit |
|
Violence |
|
Model Execution Environment |
Accessed via Cloudflare’s cloud API |
If you would like to use AIMI to test your fine-tuned AI model and associated guardrails, please get in touch.