Master Data Management

Is synthetic data the future of AI?

Welcome to the world of artificial…data! In what may seem slightly dystopian to some, the era of synthetic data may be coming, and soon. Let’s parse through this phenomenon and understand the implications.

As indicated by the name, synthetic data is a class of data that is artificially generated. It is distinctly different from data that is generated in the real world, from observations of real-world happenings, actions and reactions. A layperson’s question would be why on earth does anyone need synthetic data? The answer lies in the difficulty that is inherent in acquiring real-world data: it is an expensive and cumbersome process to acquire much of the data, despite the billions of Internet of Things (IoT) devices surrounding us. The data thus acquired may be imbalanced or biased (skewed towards a particular condition), unavailable when required, corrupted, or unusable. Of what use is such data, one may ask, in many cases!

Enter synthetic data – better-annotated data, that when combined with real data, creates an enhanced dataset that is usable and extensible to build AI models. Synthetic data, in other words, makes real data usable.

Synthetic data is a useful tool to employ in many scenarios: consider, for example, the case where a new system is to be tested and no data exists or the data that exists is biased. Sounds far off? Consider the safety testing of new models of cars where all existing data is of full-grown males driving the car, and not women. A terribly skewed dataset that would result in incomplete testing for half the population!

Synthetic data may also be used to supplement data in small datasets that may be currently ignored. Here, think of the real-world scenario of rare diseases – diseases that are ignored by big pharma because of their complexity and rarity. It just doesn’t make economic sense to do drug discovery for rare diseases. It didn’t, rather, all these days, resulting in an ‘orphaned drug’ market.

The arrival of AI and synthetic data has transformed this sector: enhanced data sets can now be used to derive possible candidate drugs to treat rare diseases. For example, synthetic data has been used to develop drugs that treat cystic fibrosis, a rare genetic disease. With better patient matching and stratification techniques and precision medicine, the treatment of such rare diseases is attracting more attention. Idiopathic pulmonary fibrosis is another rare disease that has benefited from patient stratification due to synthetic data.

Other use cases where synthetic data may be potentially useful include hackathons, product demos, simulations or prototypes. Synthetic data may also be used to train models and make them more accurate. After all, real-world data may not always cover all the edge cases required to provide models with the well-rounded training they require to be accurate.

So how does one go about generating this synthetic data? One must understand that there are many techniques for synthetic data generation. It could involve, for example, data augmentation, that is, applying transformations to existing data; simulation (using mathematical or statistical models), generative models (using generative adversarial networks - GANs) or transfer learning (training a model on a dataset and then using that model to generate new data). The methods are varied, depending on the application and need.

Nothing comes without risks of course. The same is the case with synthetic data. The quality of data in this scenario hinges on the quality of synthetic data. Employing such data necessitates extra verification to guarantee its alignment with the actual world. As it gains acceptance, synthetic data will have to combat user scepticism, disbelief or opposition: perceptions that may be clarified by increased transparency about data generation techniques.

So, we come to the question: are we staring at the future of AI? I think we know enough now to say we are looking at a very probable future, and a promising one at that. While the objections to using synthetic data do exist, they are surmountable – the benefits far outweigh the risks. In an era where some AI forecasters believe we are running out of data to train models on, synthetic data can help us build well-rounded, well-trained models that can benefit research and development in multiple fields.

The future of data, it is safe to say, is augmented: a blend of real and synthetic! A possible utopian future indeed.

*For organizations on the digital transformation journey, agility is key in responding to a rapidly changing technology and business landscape. Now more than ever, it is crucial to deliver and exceed on organizational expectations with a robust digital mindset backed by innovation. Enabling businesses to sense, learn, respond, and evolve like a living organism, will be imperative for business excellence going forward. A comprehensive, yet modular suite of services is doing exactly that. Equipping organizations with intuitive decision-making automatically at scale, actionable insights based on real-time solutions, anytime/anywhere experience, and in-depth data visibility across functions leading to hyper-productivity, Live Enterprise is building connected organizations that are innovating collaboratively for the future.

Recent Posts