Generating synthetic data

Publication date: March 15, 2024

This article was prepared by KIELTYKA GLADKOWSKI based on legal assistance provided to clients operating in language models and synthetic data in technology transactions.

Natural language generation

Natural language processing (NLP) involves enabling computers to understand human language in written and oral form. This process covers many fields such as computer science, artificial intelligence, linguistics and data science.

Natural language generation can be divided into two parts: natural language understanding and natural language generation. The task of natural language understanding (NLU) is to capture the meaning of information that is obtained from speech or text, while the task of natural language generation (NLG) is to create a text based on data that can be understood and which will have meaningful narrative. Natural language generation is based on the comprehensive creation of sentences and phrases that summarize, explain and describe data in a way similar to the way a human would do it. Both NLG and NLU must take into account language rules that are based on syntax, lexicons, morphology and semantics.

Content generated using NLG should include only the most relevant information, and what is more, the content should be generated in a logical manner, maintaining the appropriate structure and organization of the potential text. All these factors make natural language generation a very complex procedure in which every single action brings us closer to creating natural-sounding content. Over the last few years, NLG has been triumphing, but it still cannot compare to human texts, which are characterized by creativity and emotional charge. Today, natural language generation systems can transform data into narratives through templates or dynamic document creation. In the template technique, texts are structured with empty spaces that are filled with data. Thanks to, among others, Markov chains, recurrent neural networks and Transformer architectures, NLG systems have made progress, providing the possibility of dynamic content creation, sentences are built from semantic representations and the desired linguistic structure, the next step is organizing a logical whole that is adapted for communication purposes. Natural language processing plays an important role in many systems and commonly used technologies. However, the development of generated natural language is limited. System development often happens with English. Progress rarely translates to other languages because they have different grammatical structures. There are also translations based on NLG, but their use is suboptimal, slows down significantly by one to three orders of magnitude during processing, and is unprofitable from a business point of view.

Some examples of possible applications of NLG:

Creating responses to chatbots, voice assistants and other AL system conversations (an example of a voice assistant is Siri from Apple);
Suggesting texts in emails to clients and personalizing responses to customer messages;
Generating product descriptions for e-commerce websites;
Machine translations;
Generating a coherent narrative from a set of premises or a short summary;
Creating reports on the status of Internet of Things devices;
Generating and personalizing scenarios used in customer service;
Creating text narratives based on structured data, turning business data, financial reports and other types of data into easy-to-understand language;
Transcript generation (combination of NLG, speech recognition and audio understanding to transform audio into text);
Paraphrasing, transforming a sentence in a natural language into a new one that has the same semantic meaning but a different lexical or syntactic form.

Methods of generating text data

Many methods are used to create natural language to achieve the intended goal. One of the branches of development is the creation of metrics that best reflect the quality of the text. The generated texts should be linguistically correct and similar to human texts in terms of syntax, grammar and tone. In cooperation with information theory, other aspects of statements are additionally analyzed. The best methods for metrics include greedy search (greedy algorithms) and the maximum likelihood method (top-k), however, there is progress in the development of methods that are aimed at informativeness of the text. These methods enable current and future text generation models to carry the amount of information in inverse proportion to the number of words and a modifiable parameter. Combinations of various text generation methods, which are trained using constantly improved metrics, provide the possibility of increasingly sophisticated control over sequences of words. However, despite continuous development, the simplest methods are also used in specific cases.

Markov chains

Markov chains are one of the first methods used to generate natural language. It is a mathematical model that describes a stochastic process in which the probability of an event depends on the previous event, i.e. it does not depend on the transitions that led to the current state. The Markov chain consists of a finite number of states and some probabilistic rules Pij , i.e. the probability of the process moving from state i to state j, which is written in the form of a transition matrix. The system transitions from one state to another according to certain probabilistic rules. The state space can have any value. In terms of text generation, these will be words: the model predicts the next word in the sentence based on the last word you typed. This model was used, among other things, to create word suggestions when typing text in the keyboard application on early smartphone models.

Recurrent neural network (RNN)

Recurrent neural network (RNN) can be used to process such sequential data as audio signals, time series, text data, images and object movement. In terms of digital data, RNN uses the sequential nature of text to remember previous words to predict future words. It passes each element of the sequence through the structure with feedback, each part received at the output of the model serves as input to the model creating a new element of the sequence. Due to this, information from the previous steps is stored in the sequence, i.e. RNN memory. During each iteration, the model remembers previously encountered words. Each word is assigned a probability with which it will occur (based on the previous word) and the word with the highest probability is selected. As the sequence length increases, the capabilities of the recurrent neural network become limited.

RNNs have problems with unstable gradients, both the “exploding” gradient problem and the vanishing gradient problem. The simple gradient method makes it possible to find the global minimum of the cost function in neural networks, which allows finding the optimal network configuration. In terms of cost calculation for RNN, information is propagated backwards towards the output layer as well as through the time steps of the recurrent layer. The backpropagation algorithm passes backwards through all the neurons of the network to update their weights, you can observe the tendency of the gradient to flatten and gradually decay due to multiplication. When multiplying by a small number is performed repeatedly, the gradient value decreases very quickly until it approaches zero. The further the algorithm goes through the network, the lower the gradient is and the more difficult it is to update the weights. Eventually, the weights are no longer updated or the update is not significant, which indicates that the network’s ability to learn is being paralyzed. At this point, we reach a situation where the network is unable to transmit useful information from the model output back to the layers near the model input. Such limitations mean that the recurrent neural network is unable to remember long sequences and create complex and coherent sentences.

LSTM networks

LSTM networks (long short-term memory) were created as a solution to problems with remembering long sequences. LSTM networks are a special variant of RNN. They have gained popularity in NLP tasks because they are able to learn from context, which is necessary when processing long sequences of text data. Typically there is a single activation function, in LSMT there are as many as 3: an input gate, a forget gate and an output gate. At each point in time, gates determine what past information must be retained and what should be deleted. This procedure limits the number of previous sequence elements that influence the current state. The input gate controls the signals that are stored in the internal state unit, and the forget gate controls the effect of the previous state on the current state. These gates influence what is to be recorded and what is to be forgotten. The output gate, on the other hand, controls the amount of information flowing from the internal state to the output of the unit and then to the next layer. Controlling the flow of information gives you the opportunity to remember or forget words that are currently unnecessary. Thanks to this, the network tracks information selectively, propagating only important information backwards, which reduces the risk of a vanishing gradient. However, LSTM networks are not without drawbacks. However, such complexity of paths causes high computational requirements, making model training difficult and the need to introduce parallel computations. This complexity also affects the network’s memory, which is limited to a few hundred words.

Transformers

Transformers are Transformer neural networks, which are an innovation combining the encoder-decoder structure and the attention mechanism. Such a network consists of a set of encoders for processing input data of any length and a set of decoders for outputting the generated sentences. The attention mechanism makes it possible to take into account the dependencies between all sequences, regardless of the distance separating them. The transformer is able to model dependencies in longer sequences, which means it has the ability to remember the context of a given word. It all starts with processing all words in the input sequence and the modeled relationships between them at the same time. To account for how words relate to each other, relationship information is incorporated into the vector representation. Each word is represented individually in the vector space, without having to reduce each piece of information to a single vector of fixed length. This procedure makes it possible to model longer sentences and far-reaching linguistic dependencies without increasing computational requirements. Using the Transformer architecture has significantly improved the performance of NLP solutions, especially the NLG solution. Language models that use this architecture take into account the relationships between all words in a sentence, without the need to reduce them into one fixed-length vector. All words are represented individually in vector space, which makes it possible to process long sentences without having to increase computing power. The Transformer architecture provides options for introducing parallel computations. Thanks to this, models that operate on the attention mechanism rank highest in the NLP and NLG benchmarks. The most famous models that use the Transformer architecture to generate language are: GPT, BERT, XLNet, T5 and BART.

GPT

GPT model (generative pre-trained transformer), which is very popular nowadays, is a generative model for natural language that was created by the team at OpenAI. The goal is to predict what word will appear next in an incomplete sentence, taking into account all the previous words. GPT uses Transformer structure decoder blocks in its architecture. The attention mechanism in this model takes into account words, word pairs , etc., but only from the sequence preceding the current word. The sequence that is completed with a new word goes back to the model input and the next word is predicted. This sequence continues until the sentences are completed. The GPT model creates further text in the best and consistent way with the topic, thanks to the sentences received from the input. Before GPT appeared, NLP models were trained on large amounts of properly annotated text data, which severely limited their development. Mainly unlabeled data is made publicly available and its preparation requires a lot of time. Using such data, the GPT-1 model was built in 2018, which gave users the opportunity to tune the model to the task they were performing. Through transfer learning, the GPT-1 model became a device that facilitates the implementation of NLP tasks and gave the opportunity to increase its potential by combining it with other models and increased data and parameter sets. The subsequent GPT-2 and GPT-3 models included larger data sets and added more parameters, the purpose of such changes was to build stronger language models that are capable of even better inference and modeling. GPT models are no strangers to learning, for example, the GPT-3 model was trained on the Common Crawl corpus, which has approximately 500 billion words in statements, these words come from diverse and multi-layered websites. GPT-3 can automatically create unique texts, they are creative and fit the context, they look as if written by a human. Such capabilities were used to generate questions and answers, create reports and codes, search documents, and the like. Undoubtedly, GPT-3 has enormous potential, but currently its use is expensive and burdens available resources.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by researchers at Google AL that uses the encoder mechanism from the Transformer architecture. BERT learns natural language based on syntactic and semantic information. During training, the missing words are predicted based on other words in the sentence, and the vector representations for the words are context-sensitive, in different sentences they are different, depending on the context in which they appear. BERT was trained on web data from BooksCorpus and Wikipedia in English. To use BERT, you need to add one or more network layers to the pre-trained model and train the network for your own tasks. The size of the model is problematic, so smaller versions have been created for commercial applications, for example: DistilBERT, ALBERT, RoBERTa or HuBERT. BERT is the basis for many attempts to adapt language models to Polish. BERT can be used to solve many language tasks: answer generation, sentiment analysis, text classification, summarization and content generation, just a few starting sentences. The model is trained on two unsupervised strategies:

MLM (masked language model) teaches you to understand the relationships between words. It removes the one-way limitation and provides the ability to predict masked tokens based on predecessors and successors. MLM allows for two-way text learning, words in the input sentence are randomly masked, and the model is designed to predict the missing words using context, both on the right and left.

NSP (next sentence prediction) consists of pre-training representations of text pairs. NSP teaches you to understand relationships between sentences by predicting whether a particular sentence is likely to occur next after a given sentence.

XLNet

XLNet is a model developed by researchers at Carnegie Mellon University and Google Brain, which was created taking into account the advantages and disadvantages of autoregressive (AR) language models and the BERT model. AR models aim to estimate the probability distribution of a text corpus by predicting the next token based on all previous ones, they only take into account one-way context, and they are not very effective in language comprehension tasks that require two-way context. As for BERT, it provides the possibility of bidirectional contexts and better performance, but it still has some limitations. XLNet combines BERT’s ability to learn bidirectional contexts with the generalized autoregressive pre-training method of the Transformer-XL model. XLNet is not trained on the modified dataset, which causes it to avoid BERT’s data masking limitations. The model also allows the calculation of the joint probability of predicted words, eliminating the independence assumption made in BERT. Using the PLM (permutation language modeling) mechanism, XLNet learns bidirectional context by training all possible permutations of words in a sentence. For this purpose, the expected logarithm of the probability is maximized in all possible permutations of the input sequence. To improve the network architecture, XLNet also integrates a segment recursion mechanism with the architecture’s relative coding scheme. Such a model achieves better results than BERT in NLP tasks.

T5

The T5 (text to text transfer Transformer) model saw the light of day in 2020, presented by the Google team. The model gives suggestions to transform all NLP tasks into one unified text to text format, where input and output are always text strings. Using the text to text format means that the T5 model can be used for many NLP tasks using the same hyperparameters, loss function and decoding procedure. This approach can be successfully used to create answers to questions, generate abstract summaries, solve classification sentences, reasoning in natural language and regression. To enable the execution of tasks, a text prefix is added to the original input data specifying the task to be performed. T5 uses the original Transformer structure. Like BERT, it uses MLM with a slight modification. BERT replaces each subsequent masked word with separate symbols, while T5 replaces a sequence of subsequent masked words with one symbol. In the case of the T5 model, there are as many as 5 different models available, each with a different number of parameters: T5-small (60 million parameters), T5-base (220 million parameters), T5-large (770 million parameters), T5-3B (3 billion parameters), T5-11B (11 billion parameters). All these models were trained on approximately 1 trillion tokens. The unlabeled data comes from the C4 set (Colossal Clean Crawled Corpus), which contains approximately 750 GB of text, this corpus is a cleaned up version of the Common Crawl corpus.

BART

Autoencoder denoising. BART is designed to train sequence to sequence (Seq2Seq) models, which take a sequence of elements and change it into another sequence of elements. The model uses the typical Transformer architecture, which can be seen as a generalization of BERT and GPT. Two-way encoder and one-way decoder are used. The training data contains distorted texts, which Bart learns to reconstruct. There is complete freedom in choosing the method of introducing noise. The authors took into account such models as: token masking, token removal, sentence permutation, text filling (text infilling), where fragments of text are replaced with one token, as well as document rotation (the text is rotated so that it starts with a randomly selected word). The model is effective when it is tuned to generate text. It also works in NLU tasks, it will also be useful for machine translation, answering questions, creating text summaries and classifying sequences (sentences or tokens), among others.

Generating images and image data

Using AI systems to create images is already a known practice in the computer environment vision. The development of image generating systems is possible thanks to deep learning and artificial intelligence techniques, which has allowed the creation of numerous techniques that can achieve impressive results.

One of them is the super resolution technique, which increases the resolution of the image while maintaining its details. There are various approaches to solving this problem, and most rely on deep generative networks such as VAE and GAN. Text-to-image generators such as DALL-E from OpenAI and Imagen from Google are able to create photorealistic images, blurring the line between real and synthetic images. Synthetic images generated using the discussed models have become a promising perspective for supplementing, diversifying and creating representative data sets, offering very satisfactory results.

We distinguish models that generate images based on their structure, architecture and how particular varieties are aimed at solving specific problems. Generative networks and autoencoders are very popular. They are characterized by a wide range of applications when creating images, but they do not offer much control over the generated images. Moreover, these models require a large database of reference images, but are unable to link more than two concepts. An alternative to networks of generative autoencoders may be models using the Transformer structure and diffusion models.

Variational autoencoders (VAE)

One of the generative models capable of synthetically generating new data that are supposed to resemble those contained in the training set is a variational autoencoder. Its structure consists of two elements – an encoder and a decoder. The encoder’s task is to take input images and generate their representations, and the decoder’s task is to reconstruct the input data. However, appropriate modifications are introduced to the encoder that enable the generation of new images, rather than the reconstruction of the input images. The mechanism of operation of the autoencoder is based on learning the best encoding and decoding scheme. Due to the regularization structure of this model, it does not work well in generating a large amount of detail in the image. Autoencoders are useful when we want to reduce details in an image or detect anomalies in it. Another undoubted advantage is that, thanks to their simplicity, they can be trained quickly. They are also often used in combination with other techniques or as elements of more complex structures.

Generative adversarial networks (GANs)

The GAN mechanism includes two neural networks – a generator and a discriminator, which compete with each other in a zero-sum game. The generator creates new, reliable image examples by taking a vector from a selected random distribution and generating samples from it that become false training examples for the discriminator. Both the samples created by the generator and the actual data go to the discriminator, which learns to classify the examples provided to it as true or false. The generator and the discriminator compete with each other and improve thanks to the so-called teaching opponents. As you progress, the discriminator becomes increasingly worse at distinguishing artificial from real images, and its accuracy decreases. When the discriminator is forced to guess, it means that the generator creates perfect imitations of images and the GAN network achieves the so-called Nash equilibrium. The standard technique for training GAN models is based on gradient optimization in a high-dimensional real space. One problem with this technique is mode collapse, which is a situation where the generator produces a small number of samples due to slightly different median values. One of the improved variants of GAN networks is cGAN (conditional generative adversarial network), which allows for conditional image generation. In these networks, the generator and the discriminator are conditioned on auxiliary information that is additional input. In the field of super resolution, deep convolutional GANs are also popular, enabling the creation of larger, higher quality images.

CycleGAN

CycleGAN is a neural network model that enables image-to-image translation without the need to access pairs of corresponding images from different domains. The model is able to learn the relationship between these domains and transform the characteristics of one domain into those of the other. The model learns two mappings, which are each other’s inverse mappings and bijections. Importantly, CycleGAN does not need paired training examples for learning. To avoid the problem mode collapse, an optimization metric ensuring bijectiveness and mutual inverse is used. CycleGAN has many applications, such as transferring style, changing the season, or replacing the color/texture/contour of an object in an image. The model can also be used to turn painted images into photos.

StyleGAN

StyleGan, as the name suggests, uses syllabus transfer techniques. It consists of a mapping network and a synthesis network. The former transfers the image coding to a vector, which is then subjected to affine transformations, creating vectors defining many styles of the generated image, which then go to the synthesis network that creates new images. StyleGAN uses convolution techniques and adds some noise to them, which means that the photos it generates boast a high level of realism. It distinguishes abstract features of images very well, which makes it suitable for high-level photo merging.

Transformers

Transformers are based on an architecture similar to the previously mentioned autoencoders. They have an encoder and a decoder, and their distinguishing feature is the attention mechanism. These are individual dependencies between individual vectors. At first, they were used for text data, but due to the growing popularity, people began to look for ways to use this model also for image data. Thanks to this, they now allow combining both text and image data into one model.

Image GPT (iGPT)

This is an image completion model created by OpenAI. The transformer is built on a structure and operates on sequences of pixels. After providing the initial pixels, it is able to complete the image. This happens by reducing the resolution of images and turning them into one-dimensional vectors. Then the trained BERT or GPT models predict the next pixels. The resulting representations are evaluated using a linear sampling or tuning method. The former achieves the result through feature extraction, while the latter tunes the model to classify images. The undisputed advantage of iGPT is its impressive results, but at the cost of low computational efficiency.

DALL-E and DALL-E 2

Another model created by OpenAI is DALL-E, capable of generating images based on text descriptions. The creators of DALL-E boast that their model contains 12 billion parameters and is trained on 250 million pairs of images and texts. For DALL-E, the input is the given text, based on which it generates images. This model is built on a transformer structure using decoder blocks, but also contains elements of a variational autoencoder. By analyzing descriptions, Transformer learns the correlation between the language and what the image represents, thanks to which it is able to accurately create satisfactory results. After receiving a text description, it creates 512 images from which the best ones are selected. In 2022, DALL-E2 was presented, which, thanks to a larger photo database and improved network architecture, is able to create more precise and realistic images. DALL-E 2 also implements the “zero- shot” technique, which allows generating images based on text descriptions that were not used during the model training process. This allows DALL-E 2 to generate images that are described with words it has never seen before. The CLIP model, consisting of a text encoder and an image encoder, is used to create an image representation based on the text description, and then it is passed to the decoder, which turns it into a synthetic image. The authors used a diffusion model, which was experimentally found to be better. There is also a decoder that generates images by transforming its representation into an actual image using the diffusion model again. This modified version of GLIDE was called unCLIP by the developers. It reverses the mapping learned by the CLIP encoder. Despite the impressive results, the model has not been made publicly available due to the creators’ concerns about its misuse. It is still being developed, especially in terms of neutralizing the potential negative effects of its use.

Diffusion models

The diffusion method is used in text descriptions of images based on texts. It is based on the assumption that it is possible to build a model that can reverse the process of systematic information decay and recover information from noise. The system is therefore designed to generate new data from the noise itself by progressively denoising through a diffusion process and an inverse reverse diffusion process that transforms the noise back into data from the target distribution. In its basic form, it generates data from randomly sampled latent space. In order to supervise the image creation process, the model was extended by conditioning on the classifier labels, which allows controlling the image creation process. There are also diffusion models that do not require a classifier (e.g. GLIDE- guided language to image diffusion for generation and editing). In this case, the model learns a gradient vector that allows it to go through subsequent stages of reverse diffusion. These models can be directed using text vectors. Another way to direct generation in diffusion models is the CLIP technique, which is one of the varieties of GLIDE. It involves minimizing the dot product of text and image vectors, which allows for any image formulation.

Imagen

Imagen was developed by a team of researchers at Google Brain as a response to DALL-E2. It is also a model for generating realistic images from text descriptions, using language models and diffusion models. Imagen has a great track record and has achieved the highest FID score in image generation. Unlike DALL-E 2, Imagen uses the ready-made T5-XXL text encoder, which does not change while training the generative model. This allows to better understand the context of the descriptions, thanks to which the images generated by Imagen more precisely reflect the intentions contained in the text. The model generates images of increasing resolution using a sequence of conditional diffusion models and an enhanced neural network called Efficient U-net. The authors, like the developers of DALL-E 2, did not publicly release Imagen, fearing misuse of the technology. However, the creators themselves hope that their work will serve as inspiration for the creation of further image generators.

Generating image data – methods

Text to image generation

Generating images in this way combines elements of NLP and image recognition. Its goal is to create an image that is as faithful to the text description as possible. This process requires a powerful generative model and the understanding and ability to create connections between individual perceptual data. The need is to understand the input text, then match the terms used in it to the appropriate objects, while maintaining the complexity of shapes and colors through pixels and the overlap of various objects or features. Building and developing precise AI capable of understanding such connections is extremely important in the context of overall development

Generating tabular data

In the case of the method of generating tabular data, we can distinguish two main categories of techniques: statistical techniques and machine learning methods. In each of these methods, synthetic data results from modeling real data sets. The created model is used to create new values with statistical properties similar to those present in real data. Having similar statistical properties means that the model must reproduce the distribution to such an extent that an analyst working with a synthetic data set receives results similar to real data. The conclusions should be the same in both versions. Without having the data available, but knowing what their distribution would look like, the analyst can generate a random sample from any probability distribution. How useful such information is depends on knowledge about a given topic. Where real data is available, then synthetic data is generated from the distributions that best fit the real data. It is worth noting that when generating models that create data, you must not focus on the actual data itself, but should focus on the processes that lead to the creation of data. Thanks to this solution, it is possible to achieve a probabilistic approximation of real data that will not contain any identifying information, while maintaining the properties of the original.

Monte Carlo method

The Monte Carlo method (MC) is one of the oldest and most frequently used statistical techniques. Defines a class of computational algorithms that use the process of multiple random sampling to model probabilistic systems. The intention is to approximate the possibility of a random event occurring, establishing the probabilities of various outcomes of this component. MC uses a range of values defined in the problem domain and uses the given probability distributions, thereby building a model of possible outcomes for any random variable. After a certain number of trials, the results are recalculated, each draw is associated with different random values of a specific domain. Thanks to the simulated sample, there is a chance to estimate the expected value of the probabilistic component active in a given process. This possibility is provided by the so-called law of large numbers, according to which, by increasing the number of repetitions of a given experiment, the average value of the results approaches the actual value of the probability of these events occurring. During MC simulation, the model generates multiple datasets that can be viewed as realistic variations of the original dataset. Choosing MC to create data is a simple solution when our goal is to model the probabilistic nature of phenomena, and due to its complexity it is impossible or difficult to use other methods. However, the Monte Carlo method is inaccurate compared to other methods because to obtain realistic simulation results, precise empirical data are needed to best specify the problem domain.

MCMC and Gibbs sampling

As a rule, classic MC methods are based on creating samples consisting of independent observations. There is a variant of Markov Chain Monte Carlo (MCMC) which is used to generate sequences of dependent observations, these sequences as the name suggests are Markov chains. The combination of the Monte Carlo method and Markov chains makes it possible to sample high-dimensional probability distributions, taking into account the dependencies between samples. There are algorithms that define ways to construct chains. One such method is Gibbs sampling, which works by sampling the distributions of conditional variables. This is very beneficial when the joint distribution of variables is not known or is difficult to sample, but we know the conditional distributions of each variable. Iterative sampling from conditional distributions, conditional on the current values of other variables, can lead to an approximate sample from the joint distribution.

Bayesian networks

Bayesian networks (BN) are a probabilistic graph model of the joint probability distribution for a set of variables, each node represents a variable, while the edges between nodes represent the probabilistic relationships between them. BN consists of two parts: a network structure and a set of local probability distributions. The network structure manifests itself in the form of a directed acyclic graph, expressing all pairwise conditional dependencies between variables. The set of local probability distributions expresses the conditional distributions for pairs. The summed probability distribution in a Bayesian network can be determined using the chain rule, thanks to the local probability distributions defined by a given BN model, which makes it possible to create samples from product factors, i.e. conditional distributions. This solution facilitates multidimensional scaling and is computationally efficient. Bayesian networks provide good intuition for modeling and synthesizing population data. Despite its limitations, BN is a popular method for generating synthetic sets.

Generative adversarial networks (GANs)

Generating generative models using neural networks, such as generative adversarial networks, appears to be rational in terms of performance and flexibility in representing and creating realistic, high-quality synthetic data. In the earlier part of the text, the GAN network was divided into a generator and a discriminator. In most cases, research in the field of GAN concerns images, but there is a growing tendency to look for applications in other aspects, one of them is tabular data. At first glance, it may seem that training models based on data such as an image is more difficult than learning on structured data, but the algorithms used to create tabular data when necessary can become complex. The complexity is caused by the different types of data that are present in tabular data. The models are trained on real samples and learn to approximate them in order to generate synthetic data. The challenge is to generate data that reproduces the structural as well as static properties of real data, but whose values are not obtained by direct observation of the generative process. This creates another problem: synthetic tabular data is difficult to assess the quality of. Despite the disadvantages, statistics show that using GANs, it is possible to produce realistic synthetic tabular data that can be used in selected business cases. There are GAN models that were created for specific applications.

table -GAN

table -GAN is a model built on the GAN architecture using the table data synthesis method, which is able to generate data with properties similar to real ones. This model was created to protect data and reduce the risk of potential data leakage. A third element, a classifier, has been added to the basic elements of GAN: generator and discriminator. The discriminator tries to distinguish real records from synthetic ones, the generator creates more and more realistic records, which makes the discriminator’s task more difficult. The task of the classifier is to increase the semantic integrity of the generated records. The classifier is trained with real data to learn real correlations between labels and other table attributes. The classifier uses the acquired knowledge to check whether the records that have been generated are semantically correct. This prevents the intrusion of errors because the discriminator can itself assess the semantic integrity of the generated codes, but it can be fooled and the appearance of the classifier prevents this from happening. The authors pointed out that models that were trained on synthetic tables show similar results to those trained using real tables. This offers prospects for a promising future for technology development.

CTGAN

CTGAN (Conditional Tabular Generative Adversarial Network) is another model based on the GAN architecture. CTGAN introduces corrections to overcome problems such as: the presence of different data types, the occurrence of non-Gaussian and multimodal distributions, and the imbalance of categorical attributes. Model training has been extended to include mode-specific training normalization, a type of normalization that allows continuous values of any range and distribution to be transformed into a limited vector representation suitable for neural networks. Moreover, a conditional generator and training by sampling were used to overcome difficulties with unbalanced training data. It is important that attribute categories are sampled evenly during training. It is also important that it is possible to recover the actual distribution of data during training. Such changes make CTGAN produce high-quality synthetic tables.

CTAB-GAN

CTAB-GAN can model various types of data with complex distributions. When creating CTAB-GAN, the creators once again took into account the problems of earlier models. A classifier has been added to the conditional GAN framework along with a loss function for the classifier that aims to compute the discrepancy between the generated and predicted classes, which helps increase the semantic integrity of the generated codes. A conditional coding system has also been added, which allows for efficient coding of mixed variables and dealing with highly skewed distributions for continuous variables. As in the case of table -GAN, CTAB-GAN consists of a generator, a discriminator and a classifier. Tests have been performed which show that CTAB-GAN is superior to previous methods by modeling mixed variables and providing the ability to generate unbalanced categorical variables and continuous variables with complex distributions. The development of research resulted in the creation of an extended version of CTAB-GAN+. However, the task has not changed, the goal of the new algorithm is still to improve the quality of synthetic data in terms of machine learning usability and static similarity, and to effectively implement differential privacy for GAN training to be able to control the performance within privacy levels.

TimeGAN

A mixed series is a sequence of observations that are ordered in time. To effectively model time series, you need to capture the distribution of variables at each point and the complex dynamics over time. TimeGAN (Time Series Generative Adversarial Network) is a generative model designed to preserve temporal dynamics in synthetic data. TimeGAN can make realistic time series data from various domains, such as stock prices. This model uses four components: an embedding function, a retrieval function, a sequence generator and a sequence discriminator. All components are trained together so that the model learns to encode features, generate a representation, and iterate over time. Embedding and retrieval functions guarantee the mapping between features and the latent space, enabling the adversarial network to learn the underlying temporal dynamics of the data using lower-dimensional representations. The discriminator and generator operate in the hidden space, while the hidden dynamics of real and synthetic data are synchronized by a loss function. TimeGAN offers improvements in generating realistic time series over other models. Work is still ongoing to improve the model.

Selected tools for generating synthetic data

Pydbgen

Pydbgen is a good choice if you want your synthetic data to contain common variables with some degree of customization, but without reflecting major dependencies between them. Pydbgen is a simple tool that allows you to randomly generate data specified by the user, such as: name and surname, date, time or license plate number. This data is saved in a DateFrame Pandas or in the form of an SQLite table in a database file or in an MS Excel file. Thanks to this method, after writing a few lines of code, it is possible to create a set of any size with tables filled with user-defined random data.

Faker

The task of the library Faker is the generation of artificial data that can be used to test applications, run without data and obtain anonymity of participants. Faker was created to make generating data easier. Library Faker offers many methods, thanks to which it is possible to generate data in accordance with specific requirements in a short time and without any effort. There is also an option to create different types of information specific to specific countries.

Mimesis

This library is similar to pydbgen and Faker, but is more complete, it is a high-performance artificial data generator, faster than Faker in terms of speed. It allows you to create data related to people, food, transport, addresses, computer equipment and the like. It offers various language localizations, allowing you to generate different types of information for specific countries. Mimesis offers methods for creating context columns, which makes it a good tool for creating correct and at the same time diverse synthetic sets.

Mesa

With the help of Mesa, it is possible to generate agent models (agent based modeling, ABM). Such modeling involves simulating the actions and interactions of agents in order to assess their impact on the system. Mesa enables the generation of synthetic data using complex scenarios and creates an artificial environment in which agents can interact with the environment and with each other. Agents can be representations of living cells, animal behavior and individual people or organizations. The priority task is to obtain explanatory insight into the behavior of agents under a given set of rules, especially observing and understanding the behavior of agents under a given set of rules. The effects of interactions between agents are also interesting. Using built-in basic Mesa components or custom implementations, it is possible to quickly create ABM models, along with their visualization and analysis of results using tools available in Python. The DataCollector component of the DataCollection module provides an easier way to collect data generated by the models you create. In particular, it allows for recording in the form of tables presenting variables collected at the model and agent level, i.e. the calculation value of the model or each agent in their current state.

DataSynthesizer

DataSynthesizer changes the input dataset and generates a structurally and statistically similar synthetic dataset with a high privacy guarantee. This system consists of three modules: DataDescriber, DataGenerator, and ModelInspector. First, DataDescriber processes the input set. For each categorically valued attribute, DataDescriber calculates the frequency distribution of values represented by a bar chart, from which DataGenerator draws samples taken when creating the synthetic set. DataGenerator uses a file that stores domains and estimated attribute distributions that are inferred from real data. Moreover, DataGenerator offers three types of data generation: random, independent or correlated. The purpose of the ModelInspector module is to check the similarity between the input and output sets. The user specifies the size of the output file.

Synthetic Data Vault

Synthetic Data Vault is a collection of libraries for generating synthetic data. It allows the user to model single-table, multi-table, and even time-series data sets, and then create synthetic data in the same format and statistical properties as real data sets. This technique creates sets using mathematical techniques and deep learning models, such as GANs, in particular CTGAN. The environment can handle files containing mixed data types and missing values. Synthetic Data Vault also offers a set of tools for running data set generators and applying generators to multiple data sets and applying dedicated metrics to assess the quality of the generated data.

KG LEGAL \ INFO BLOG

Methods of generating text data

Markov chains

Recurrent neural network (RNN)

LSTM networks

Transformers

GPT

BERT

XLNet

T5

BART

Generating images and image data

Variational autoencoders (VAE)

Generative adversarial networks (GANs)

CycleGAN

StyleGAN

Transformers

Image GPT (iGPT)

DALL-E and DALL-E 2

Diffusion models

Imagen

Generating image data – methods

Selected tools for generating synthetic data

KG LEGAL \ INFO
BLOG