DeepSeek & Qwen in the news
It would be hard to read the technology news this week without reading about DeepSeek’s R1 model. To make an understatement… it was covered a lot. And the reaction has been incredible. Not only the moves in the stock market, but major technology organizations are taking note. Meta for example has reportedly setup war rooms to react to DeepSeek.
But after the DeepSeek announcement, it got even more interesting. Alibaba, a huge established technology company in China, and the owners of Alibaba Cloud, released their latest version of Qwen which they claim surpasses DeepSeek in terms of performance. This model, Qwen 2.5 Max, is a closed but huge AI model only accessible via Alibaba Cloud.
Safety is still a barrier to AI adoption
The market is clearly awash with very capable, sizable AI models. However, the safety of these AI models is critical. All the developments in capability and efficiency won’t matter if these models are so unsafe that they can’t be deployed into production business environments. In short, safety remains a barrier to scaling AI.
Chatterbox Labs have been testing AI since 2011
And it’s why, at Chatterbox Labs, we’ve been innovating and researching in the AI testing space since 2011. Over the years we have developed our suite of AI safety testing IP so that we can automate and scale the testing of AI models (all the way from traditional AI models to these latest large reasoning models).
With this in mind, in this article we will take a look at the results of testing the latest AI models (with Qwen 2.5 Max added to our public research on the safety of frontier AI models).
US vs China in AI Safety
This week’s discussion has focussed on the US vs China, with concerns around the use of DeepSeek (for example, the US Navy has banned DeepSeek for work-related or personal tasks).
The discussion has mainly been indexed in terms of AI model capabilities and computational efficiencies. Let’s now add to this comparison in terms of AI safety, with real world data.
We include from China DeepSeek R1 and Qwen 2.5 Max as they’re getting the most coverage right now. There are lots of models in the US, but we report here the leading recent ones from OpenAI (O1 Preview), Amazon (Nova Pro) & Anthropic (Haiku) – although do also take note of the lacking safety of Gemini 2.0 Flash, reported here.
In the charts produced below, red indicates lower AI safety. Each cell represents the weakness to a family of jailbreak attacks (the x axis) with respect to a specific harm category (the y axis). All models are running in the provider’s own cloud inference environment.
AI Safety of Chinese Models
Looking at DeepSeek R1 first we see that, across the board the model will produce harmful content with very poor resiliency to any of the attacks made against it. Whilst it is a very capable model, it is lacking in safety and security.
Turning now to the latest model, Alibaba’s Qwen 2.5 Max.
This model also lacks in safety and security – harmful content is produced across all harm categories - however it is different to DeepSeek R1 in how it does this.
Rather than having weaknesses across all the attacks, Qwen 2.5 Max has vulnerabilities to specific areas in which it fails. This perhaps reflects the more mature nature of Alibaba as an organization with defences likely built into the inference flow of Alibaba Cloud as well as the model.
Still, even with these defences, the model will produce a significant amount of harmful content across all harm categories. It is just different in how this happens.
AI Safety of US Models
For the US models we look at OpenAI O1 Preview, Amazon Nova Pro and Anthropic Haiku. For simplicity of reading, we present all three charts together here:
OpenAI O1
Amazon Nova Pro
Anthropic Haiku
The picture of these three select US models seems very different to those of the Chinese models. Even with the defences that Qwen has in place in Alibaba Cloud, leading models from Amazon, Anthropic and OpenAI have much more robust defences in place.
Of course, there is more to do in terms of safety – these test show there are still areas for improvement as harmful content is still produced. But perhaps the billions of dollars that have been invested over the years in the development of these models is showing a pay off when you frame the view in terms of AI safety, not just pure capability metrics and efficiencies.
What does this mean for the enterprise?
Whichever AI model(s) you chose to deploy, there are critical safety considerations to take into account.
In the enterprise, when you deploy AI models, you’ll be deploying a pipeline of technologies not just the base model. You will also have safety concerns that are specific to your organization, industry and use case rather than just a standard definition of harm.
The first, and most critical point here, is that you must safety test everything. Without testing, you are operating blind and have no idea what the safety levels are or if they match what you require in your organization.
You may use a base model, or you may fine tune or distil that base model into a new model. You will then likely pair this with some kind of external guardrails (such as those in Azure content safety, Amazon Bedrock guardrails, Nemo guardrails etc). All of this is part of your inference flow.
But guardrails alone are insufficient. Just like the AI model, they also require testing so that you are not operating blind. With the results of this safety testing you can iterate on your guardrails and/or the underlying model to create an AI inference pipeline that matches the safety requirements and nuances of your organization.
Then, really, AI can truly scale.