No Data, Small Data or Poor Data; Enterprises need Synthetic Data

  • Stuart Battersby
  • January 31, 2019

As Artificial Intelligence and Machine Learning have been gaining a strong foothold across the Enterprise in recent years, corporations have realized there is a new problem in town: the data. Machine Learning system need good data, at scale in order to learn hidden patterns during their training phases. In real-world business scenarios there are often problems with data including:

  1. No Data (Access rights to data) – Often data simply can’t be moved outside of an authorized area or used by unauthorized people. This causes challenges for the global organization that needs to freely move data between teams and locations.
  2. Small Data (Quantity of data) – Whilst there is a huge trend to discuss Big Data we often see the opposite problem. Data volumes are too low, or in some instances, data simply does exist.
  3. Poor Data (Time spend preparing data) – Even when data does exist, supervised machine learning systems require this data to be labelled, or annotated. This process of producing what’s know as Gold Standard data can be too time consuming, meaning AI projects don’t start.

Chatterbox Labs are an Enterprise AI software company. Along side our AutoML and Explainable AI products, have a portfolio of software products targeted at Synthetic Data Generation. These products each differ slightly and aim to alleviate some of the challenges listed above.

SRMI – Observe & Synth

Our SRMI product is targeted at numerical, binary and categorical data and excels in instances that a data set exists within your business that either cannot be used to due legal constraints or is too small. Our software first runs an observe step that understands the properties of your dataset and the relationships between each variable. It learns the recipe needed to create a similar dataset.

These relationships are then extracted (leaving the data behind) and stored in a package file. You can the play these relationships forwards during the synth step in order to generate new data. This can be an limitless process allowing you to generate millions of new data points in your accessible environments.

NLP Text Generator – Synthetic text at scale

The NLP Text Generator allows you to address the no-data case. Rather than using a machine learning approach, the NLP Text Generator follows a templating system design. Your Subject Matter Experts encapsulate their domain knowledge inside a series of simple templates. The NLP Text Generator will handle all the linguistics. The system creates human readable sentences, with variation, at scale. As a benchmark, 1 million sentences can be created in just 1 minute.

Boost with Reinforcement Learning – Addressing data quality

Labeling training data is burdensome and time consuming. However, the training data is a critical aspect of any supervised machine learning system. Using Patented Reinforcement Learning methods, we synthetically generate pairs of data & training labels ready for a supervised machine learning system to use. The underlying system takes advantage of an unlabeled data export, paired with a small labelled training dataset. Using this approach we are seeing time savings in preparing training data by up to 95%

If you’re interested in learning more about our technologies you can find our technical documentation in the Documentation Portal or please get in touch.

Back to blog

Get in Touch