Synthetic Data: The Data Revolution No One is Talking About

Synthetic data is estimated to comprise 60% of all data used in the development of AI by 2024

Add bookmark

Elliot Leavy
06/21/2022

If oil was the resource of the 20th century, certainly data has so far proven to be the resource of the 21st. Every industry, from agriculture to space, relies deeply on data in order to understand and react to an increasingly complex world, making detailed datasets some of the most valuable resources a business can ever hope to obtain.

This is what concerns many businesses today because although companies like Amazon and Google have a world of data at their fingertips, it doesn’t seem likely that they will be sharing it with anyone else anytime soon.

But what if a business could make its own dataset from scratch? Not relying on historical, real-world data but instead digitally generating artificial data to feed its artificial intelligence models, leapfrogging ahead of (or at least catching up with) the tech giants who have such a colossal head start?

This is just one of the many ideas behind synthetic data, which until recently was an idea that was viewed as too good to be true, but that is now being harnessed in almost every industry. In a nutshell, synthetic data is data that’s been created by a person or computer, instead of the traditional monitoring and collecting of regular data. In practice, it most often looks like finding the patterns in existing data and recreating them with an algorithm or artificial intelligence.

Goodfellow's GANS

How does this work? To begin, we need to first look at generative adversarial networks (GANs) which were invented by Ian Goodfellow in 2014. GANs are powerful neural networks that allow for unsupervised learning. Goodfellow’s main breakthrough here was the creation of GANs with two separate neural networks, which could then be pitted against one another.

In particular GANs work by taking a given dataset (for example a collection of photos of human faces) and having one neural network (called the “generator”) generating new images that are mathematically similar to the existing images. Meanwhile, the second neural network (the “discriminator”) is fed photos without being told whether they are from the original dataset or from the generator’s output; its task is to identify which photos have been synthetically generated.

Over time, these two models hone each other’s abilities and eventually the discriminator’s classification success rate falls to 50%, no better than random guessing, meaning that the synthetically generated photos have become indistinguishable from the originals.

New Models

Since Goodfellows GANs, there have been several other developments in the synthetic, particularly with regard to visual synthetic data. The main ones being the emergence of diffusion models and neural radiance fields (NeRF).

Diffusion models have gained a lot of online traction as of late in the form of OpenAI’s DALL-E 2 and Google’s Imagen and work by corrupting their training data by incrementally added noise and then figuring out how to reverse this noising process to recover the original image. Once trained, diffusion models can then apply these denoising methods to synthesize novel “clean” data from random input.

Alternatively NeRF, is a powerful new method that quickly and accurately turns two-dimensional images into complex three-dimensional ones, which can then be manipulated and navigated to produce diverse, high-fidelity synthetic data.

What's Next?

So there have been many developments over the past decade that give credence to the idea that, by 2024 60% of the data used in artificial intelligence development will be synthetic. Already, 96% of teams working on computer vision rely on synthetic data and another analysis suggests that the number of companies focused on supplying synthetic data nearly doubled between 2019 and 2020 alone.

This is a data revolution which is happening across all industries from insurance, to army intelligence, to mobility and health care. The reasons are manyfold. Not only does it lower the cost of training AI models, synthetic data also lacks any privacy issues whatsoever as the data is of course not about any real individual.

This is why the National Institutes of Health used synthetic data to replicate their database of more than 2.7 million COVID-19 patient records, creating a dataset with the same properties for analysis but without the information to identify them. Indeed, cybersecurity companies often find themselves the biggest supporters of these new datasets thanks to their implicit anonymity.

Then there are the ethical implications of synthetic data. While there is always risk of bias being compounded by synthetic data as there is with any dataset, in many ways, it can easily reduce implicit biases such as underrepresentation, simply by generating a wide sample.

But it isn’t only because synthetic data can be more ethical that makes synthetic data so useful to businesses. Practically, synthetic data is being utilized in numerous ways. The tractor manufacturer John Deere for example is helping its machines think like human farmers by creating synthetic images of plants in different weather conditions to improve its computer-vision systems that will eventually be used on tractors to spot weeds and spray weed killer on them.

Meanwhile, car manufacturers have for some time been using synthetic data to train their vehicles, the reason being that such datasets allow for a vast set of situations to be simulated by their driverless vehicles, improving safety in scenarios that have yet to occur naturally in real life. In fact, late last year NVIDIA announced DRIVE Sim, which generates “synthetic ground-truth data for training deep neural networks that make up perception in autonomous vehicles”, accelerating autonomous vehicle (AV) development. DRIVE Sim is part of its wider Omniverse offering, a software package for 3D workflows.

Synthetic data also has the upper hand in terms of creating more complex data types - not everything is comprehensible to the human eye for instance. As Nathan Kundtz, co founder and CEO of Rendered, a cloud-based platform for high-volume synthetic data generation, explains: “As we move away from images that are easy for humans to interpret, it becomes much harder to build labeled datasets that can be used in artificial intelligence training,” said Nathan Kundtz, co founder and CEO of Rendered. “This has historically meant that these sensor types have largely been frozen out of the AI revolution.” Not so with synthetic data, which means that we are better able to represent various kinds of satellite data will make it easier to train AI to automatically detect when farm yields drop, CO2 levels rise and coral reefs suffer.

Simulation Scenarios

Of course, synthetic data is not the only tool available to businesses in learning from ‘what if’ scenarios. Digital twinning - the recreation of physical objects or systems virtually - is also an increasingly important way of doing business and improving processes.

Although widely different (in the sense that one simulates data for AI and other simulates the interactions of models for people and AI), digital twinning is more than capable of working concurrently with synthetic data. In short, both digital twins and synthetic data use algorithms to simulate a physical entity. Synthetic data can be used to kick start the initial digital twin. Then data capture from the physical twin allows the digital twin to improve over time. In this sense, the digital twin can be used to enhance the quality of synthetic data in a cycle of continuous improvement.

Whatever your needs, businesses today have a wide selection of tools at their disposal, and synthetic data may just be one of the most important ones yet. It is the data revolution noone is talking about, and it will accelerate every industry that harnesses it.

Tags: technology Analytics MACHINE LEARNING NERF GANS GOODFELLOW SYNTHETIC DATA