AI Safety

What is this AI
Safety research?

This ongoing research makes use of the Security pillar of AIMI for Gen AI (which also has pillars for Fairness, Toxicity and Privacy). It tests a range of leading language models, both large proprietary models and easily accessible, smaller open models and is regularly updated with new AI models.

The Security pillar tests models & associated guardrails for nefarious activity across a range of harm categories. If safe, AI systems should not comply with the nefarious, harmful prompts.

Guardrails vs AI Safety Testing

Contemporary models aim to provide a layer of AI safety, meaning that the developers of the models have built a layer into the models to detect and reject nefarious activity. Deployers of AI systems also add a layer of safety controls to the deployed AI system, outside of the model, aimed at catching and blocking nefarious activity. Collectively these safety controls are known as guardrails.

However, like all technology systems, these guardrails may have weaknesses that can be exploited and manipulated. AI safety testing independently checks the deployed AI system (including the model, guardrails and any other controls placed in the inference flow) for safety risks.

The Security pillar therefore goes through multiple jailbreaking processes to pinpoint weaknesses which cause the guardrails to fail, which in turn causes the model to comply with the nefarious prompt. AIMI will automatically run through all the prompts, harm categories and jailbreaks at scale, with automatic detection of model rejection or compliance.

Results

Model
Developer
Model Fraud Hate
Speech
Illegal
Activity
Misinformation Security &
Malware
Self Harm Sexually
Explicit
Violence Model Execution
Environment
Qwen 2.5 Max Accessed via Alibaba Cloud's API
nova-micro-v1:0 Accessed via AWS' cloud API
nova-lite-v1:0 Accessed via AWS' cloud API
nova-pro-v1:0 Accessed via AWS' cloud API
OLMo 1b SFT DPO Running locally on LM Studio
Claude 3.5 Haiku 20241022 Accessed via Anthropic’s cloud API
Claude 3.5 Sonnet 20240620 Accessed via Anthropic’s cloud API
Claude 3.5 Sonnet 20241022 Accessed via Anthropic’s cloud API
Command R Accessed via Cohere’s cloud API
Command R7B 12-2024 Accessed via Cohere’s cloud API
DBRX Instruct v3 Accessed via Databricks’ cloud API
Reasoner (R1) Accessed via DeepSeek's cloud API
Jais 7b Adapted Chat Q8_0 GGUF Running locally on LM Studio
Gemini 1.5 Flash 002 Accessed via Google Cloud’s Vertex AI API
Gemini 2.0 Flash Exp Accessed via Google Cloud's Vertex AI API
Gemma 2 2b Instruct Running locally on LM Studio
Llama 3.1 8B Instruct Accessed via Hugging Face’s Dedicated Instance cloud API
Llama 3.3 70b Instruct Accessed via Hugging Face’s serverless cloud API
Phi 3.5 Mini Instruct (3.8b) Running locally on Hugging Face Text Generation Inference in Docker
Phi 4 14B Q8eetq Running locally on Hugging Face Text Generation Inference in Docker
7b Instruct v0.3 Running locally on Hugging Face Text Generation Inference in Docker
Nemotron 70b Accessed via nvidia's cloud API
GPT-4o Accessed via OpenAI’s cloud API
o1-preview Accessed via OpenAI’s cloud API
O3 Mini (Low Reasoning) Accessed via OpenAI’s cloud API
O3 Mini (Medium Reasoning - Default) Accessed via OpenAI’s cloud API
O3 Mini (High Reasoning) Accessed via OpenAI’s cloud API
Allam 2 7b Instruct Accessed via MS Azure’s cloud API
Falcon 7b instruct Accessed via Cloudflare’s cloud API

Model

Qwen 2.5 Max

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Alibaba Cloud's API

Model

nova-micro-v1:0

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via AWS' cloud API

Model

nova-lite-v1:0

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via AWS' cloud API

Model

nova-pro-v1:0

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via AWS' cloud API

Model

OLMo 1b SFT DPO

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on LM Studio

Model

Claude 3.5 Haiku 20241022

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Anthropic’s cloud API

Model

Claude 3.5 Sonnet 20240620

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Anthropic’s cloud API

Model

Claude 3.5 Sonnet 20241022

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Anthropic’s cloud API

Model

Command R

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Cohere’s cloud API

Model

Command R7B 12-2024

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Cohere’s cloud API

Model

DBRX Instruct v3

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Databricks’ cloud API

Model

Reasoner (R1)

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via DeepSeek's cloud API

Model

Jais 7b Adapted Chat Q8_0 GGUF

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on LM Studio

Model

Gemini 1.5 Flash 002

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Google Cloud’s Vertex AI API

Model

Gemini 2.0 Flash Exp

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Google Cloud's Vertex AI API

Model

Gemma 2 2b Instruct

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on LM Studio

Model

Llama 3.1 8B Instruct

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Hugging Face’s Dedicated Instance cloud API

Model

Llama 3.3 70b Instruct

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Hugging Face’s serverless cloud API

Model

Phi 3.5 Mini Instruct (3.8b)

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on Hugging Face Text Generation Inference in Docker

Model

Phi 4 14B Q8eetq

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on Hugging Face Text Generation Inference in Docker

Model

7b Instruct v0.3

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Running locally on Hugging Face Text Generation Inference in Docker

Model

Nemotron 70b

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via nvidia's cloud API

Model

GPT-4o

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via OpenAI’s cloud API

Model

o1-preview

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via OpenAI’s cloud API

Model

O3 Mini (Low Reasoning)

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via OpenAI’s cloud API

Model

O3 Mini (Medium Reasoning - Default)

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via OpenAI’s cloud API

Model

O3 Mini (High Reasoning)

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via OpenAI’s cloud API

Model

Allam 2 7b Instruct

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via MS Azure’s cloud API

Model

Falcon 7b instruct

Fraud
Hate
Speech
Illegal
Activity
Misinformation
Security &
Malware
Self Harm
Sexually
Explicit
Violence
Model Execution
Environment
Accessed via Cloudflare’s cloud API
Model successfully rejects all prompts and attacks in this harm category
Model complies with one or more prompts or attacks in this harm category

If you would like to use AIMI to test your fine-tuned AI model and associated guardrails, please get in touch.