Generation and use of synthetic data

In a digitalized world, data is the new oil, but only a few have them in adequate quantity and quality and in time to be able to use artificial intelligence and data science algorithms that produce satisfactory results. Therefore, more and more people are producing their own oil effectively and at a much lower cost. We are talking about the synthetic data.

Synthetic data generation was born as a methodology for preserve privacy of personal data, especially those that involve a clearer ethical and legal component, such as in the clinical and health environment. Whether through total replacement through a simulation model, or through data imputation as a form of replacement for real-world information, synthetic data allows us to avoid the use of original data without losing algorithmic reliability. Following this thread, they also began to be used as a way to increase the size of databases in computer vision, modifying the original data with plausible transformations of real-world images through random changes in lighting, object color, displacements and rotations, and image superposition.

It was soon demonstrated that this annotated information generated from simulations or algorithms was indeed a good alternative to real-world data. Despite being artificially generated data, they are capable of passing statistical tests comparing the probability distribution functions of real data and the information distribution of synthetic data. Since most discriminator or generator systems aim to reach only this statistical degree of accuracy, for a few years now, developers of deep neural networks have been massively using synthetic data to train the models.

In fact, the current paradigm argues that only by using synthetic data to train your algorithms is it possible to build valuable, high-quality artificial intelligence models. The use of this synthetic data allows generating information with noise or even exploring regions of data where real data is not available, which helps algorithms create more complete and robust models than those that had only been created using real-world data.

We have already discussed some ways of creating synthetic data, from models, through plausible random changes, using a rule base, all of which are basic generating systems. But, twisting the wheel, it is also possible to generate this synthetic data using other algorithmic artificial intelligence generating systems, such as antagonistic generative networks (GANs) or the self-coding systems or autoencoders variational. In this case, synthetic data generating systems end up being reduced to a random vector generator capable of synthesizing information statistically equivalent to that which is being used as the real base.

Finally, in the last two years, the generation of synthetic data, complete or by imputation of part of the information, is showing itself as a valid path for the generation of secure environments. federated learning. Learning systems act locally on their own data and condition their training with synthetic data imported from other data space providers. This guarantees a secure data space and sharing based on the specifications agreed upon between the providers and the user.

Prof. Cecilio Angulo

Founder of IDEAI-UPC and President of the ACIA.

Generation and use of synthetic data

Other articles

Entrepreneurship in AI in Catalonia

Artificial Intelligence Talent in Catalonia

The Future of Quantum Artificial Intelligence: Prospects and Challenges