LICENSING DATASETS FOR AI TRAINING
Publication date: October 23, 2025
DATASETS AND SYNTHETIC DATA
Creating and developing an AI model requires an unimaginable amount of data. Once input, the model analyzes the information, performs calculations, and draws conclusions based on this data that informs its future operations. Comparing this process to that of humans, one could say that AI “learns” in this way. AI systems are trained on numerous examples and draw model patterns from them, allowing them to predict correct solutions. This process is called “AI training”. Data is currently so expensive and difficult to access that it is estimated that it may be in short supply by 2032. The answer to these problems is synthetic data. This data is generated by the AI itself, which uses parameters from real-world data and randomly generates subsequent scenarios. These scenarios are designed to faithfully reproduce the properties, complexities, and relationships observed in the original data from which they were generated. There are certain risks involved. First of all, it is about creating synthetic data based on erroneous assumptions from real data – with the continuous introduction of new real data, previous errors can be corrected and artificial intelligence will be able to “unlearn” them, whereas if the first synthetic data generated is “contaminated”, each subsequent one will also contain erroneous information. The undoubted advantage of synthetic data is that it can be used to generate subsequent scenarios that may not actually occur at all or very rarely. This is used in industries such as automotive (for simulating traffic scenarios), finance (detecting fraud), and healthcare (detecting and treating rare conditions).