Generation and use of synthetic data

Prof. Cecilio Angulo, Founder of IDEAI-UPC and President of the ACIA.

In a digitized world, data is the new oil , but only a few have it in the right quantity and quality and in time to be able to use artificial intelligence and data science algorithms that produce satisfactory results. Therefore, more and more people are making their own oil effectively and at a much lower cost. We are talking about synthetic data .

The generation of synthetic data was born as a methodology to preserve the privacy of personal data, especially those that involve a clearer ethical and legal component, such as in the clinical and health environment. Whether through full replacement through a simulation model, or through data imputation as a form of substitution for real-world information, synthetic data allows the use of original data to be avoided without losing reliability algorithmic Following this thread, they also began to be used as a way to increase the size of databases in computer vision, by modifying the original data with real-world image-like transformations through random changes of il ·lighting, color of objects, displacements and rotations, overlapping of images.

It was soon shown that this annotated information generated from simulations or algorithms was indeed a good alternative to real-world data. Despite being artificially generated data, they are capable of passing statistical tests of comparison between probability distribution functions of real data and that information distribution of synthetic data. Since most discriminator or generator systems aim to reach only this statistical degree of accuracy, for a few years deep neural network developers have been massively using synthetic data to train models.

In fact, the current paradigm advocates that only from the use of synthetic data in the training of your algorithms is it possible to build valuable and high-quality artificial intelligence models. Using this synthetic data makes it possible to generate information with noise or even explore regions of data where no real data is available, helping algorithms to create more complete and robust models than those that were built using only real world data.

We have already discussed some forms of creating synthetic data, from models, by means of plausible random changes, using a base of rules, all of them basic generating systems. But, twisting the curl, it is also possible to generate this synthetic data using other generator artificial intelligence algorithmic systems, such as generative adversarial networks ( GANs ) or self -coding systems or auto- encoders variations . In this case, synthetic data generator systems end up being reduced to a random generator vector capable of synthesizing statistically equivalent information to that which is being used as a real basis.

Finally , these last two years the generation of synthetic data, complete or by imputation of part of the information, is showing itself as a valid path for the generation of safe federated learning environments . The learning systems act locally on their own data and condition their training with synthetic data imported from other providers in the data space. In this way, a safe and shared data space is guaranteed based on the specifications agreed between the providers and the user.

Generation and use of synthetic data

Other articles

The strength of AI in Catalonia: an ecosystem of talent with international projection

Applications, benefits and risks of GPT-3

Efficiency and training of deep neural networks