What is unique about machine learning and AI compared to traditional rules-based systems?   Well, amongst various things, a key point is that AI learns from a training dataset instead of explicitly being coded with rules to follow.

There is a huge focus on Responsible AI (aka Ethical or Trustworthy AI) – and rightly so.  However, a lot of this discussion focusses simply on the AI model itself.  This isolated view ignores a huge chunk of the AI development process – in particular that which involves the data that is used train the AI model itself.

In fact, a significant portion of the ethical issues occur with the data, even before a model has been trained.  Is the data biased?  Are there privacy issues?  Is the data complete and representative?  An AI model trained on biased or poor data, will very likely end up being biased or perform poorly.  Social issues should be addressed too – do we have the right to use this data?  Should we be using AI for this task at all?

Looking at the data is also important as, whilst some organizations may have tens or hundreds of models already in production, others are still early on their AI path.

Therefore, at Chatterbox Labs, we apply our patented AIMI platform, not just to the AI model, but critically to the data as well.  It’s very often used even before a model has been built.

The analysis typically focusses on two core aspects:

Fairness (aka bias)

The training data likely comes from a legacy manual process that is to be automated with AI.  Many questions can be answered in this area.  Is the data biased already? If your teams applied debiasing techniques, did they work?  If you remove sensitive attributes (such as gender and race) could this information still be captured in an AI model from proxy variables?  Does your data sufficiently represent all groups or are some groups under-represented?  These are key questions to address before training an AI model.


What exposure does an organization open themselves up to in the event of a data leak?  Privacy isn’t binary (that is, a dataset isn’t simply private or not private – there’s always residual risk).  When the data represents people, business or military establishments questions should be asked such as: Can people (or businesses/military establishments) be reidentified?  To what extent could this happen?  What additive information would an attacker need to be able to reidentify people?  Could sensitive information be inferred from the data?  Could we determine that a person is part of a sensitive dataset (think private healthcare information)?  There will always be residual risk, but it’s important to go beyond the binary notion of private vs not private to get a full picture of these residual risks.

AI regulation is a key focus for discussion in the AI community.  There’s emerging AI regulation, guidance and standards in pretty much all jurisdictions.  But do not forget that data regulation and human rights laws are already in place (and enforceable) and have been for some time.  The leaders of this are the California Consumer Privacy Act and the GDPR in the EU, but each jurisdiction has their own applicable laws too.

Chatterbox Labs’ patented AI Model Insights platform can answer all the quantitative questions we highlighted above on the data, before you train you AI model.  Once you have your model in development or deployed in production even greater insight can be delivered.

Back to Blog