Publication date: October 23, 2025
DATASETS AND SYNTHETIC DATA
Creating and developing an AI model requires an unimaginable amount of data. Once input, the model analyzes the information, performs calculations, and draws conclusions based on this data that informs its future operations. Comparing this process to that of humans, one could say that AI “learns” in this way. AI systems are trained on numerous examples and draw model patterns from them, allowing them to predict correct solutions. This process is called “AI training”. Data is currently so expensive and difficult to access that it is estimated that it may be in short supply by 2032. The answer to these problems is synthetic data. This data is generated by the AI itself, which uses parameters from real-world data and randomly generates subsequent scenarios. These scenarios are designed to faithfully reproduce the properties, complexities, and relationships observed in the original data from which they were generated. There are certain risks involved. First of all, it is about creating synthetic data based on erroneous assumptions from real data – with the continuous introduction of new real data, previous errors can be corrected and artificial intelligence will be able to “unlearn” them, whereas if the first synthetic data generated is “contaminated”, each subsequent one will also contain erroneous information. The undoubted advantage of synthetic data is that it can be used to generate subsequent scenarios that may not actually occur at all or very rarely. This is used in industries such as automotive (for simulating traffic scenarios), finance (detecting fraud), and healthcare (detecting and treating rare conditions).
Copyright
Training AI requires vast amount of data. The resources used for this purpose are often copyrighted works within the meaning of the Act of 4 February 1994 on Copyright and Related Rights (Journal of Laws of 2025, item 24, as amended). Legal acts issued by the European Union also explicitly refer to the use of data for the purposes of AI training – the most important of which is Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC (OJ EU L 130 of 2019, p. 92), which was implemented into Polish law by the Act of 26 July 2024 amending the Act on Copyright and Related Rights, the Act on the Protection of Databases and the Act on Collective Management of Copyright and Related Rights (Journal of Laws, item 1254). The new regulations introduced exceptions allowing the use of copyrighted data for the purposes of AI training.
Article 6, Section 1, Item 22 of the Copyright Act explicitly addresses this issue, introducing a statutory definition of AI training, according to which it is the analysis of texts and data exclusively using automated techniques for analyzing texts and data in digital form in order to generate specific information, including, in particular, patterns, trends, and correlations. The Act calls this process text and data mining. The concept of mining refers exclusively to the use of a certain automated tool. This is a computer program based on generative technology. The word “exclusively” does not mean that a human (e.g., someone who issues commands to the AI) cannot participate in the process. The legislator used the word “analysis” functionally; it refers to accessing specific material and processing it to generate new products. The purpose of mining is to obtain new knowledge from existing databases.[1]
The above definition was used in Article 263 Section 1 of the Copyright Act, which sets forth that it shall be permitted to reproduce disseminated works for the purpose of text and data mining, unless the rightholder has stipulated otherwise. This means that using copyrighted works for the purpose of training AI falls within the scope of fair use. For the provision in question, the purpose of mining is irrelevant – it must be “solely” lawful. Work mining can be carried out for both commercial and non-commercial purposes – the only distinction occurs in the case of training AI by scientific and cultural institutions. To the extent that they provide research services commercially, the aforementioned provision applies. In the case of non-commercial research, these institutions may refer to Article 262 – the practical difference is that in such a situation, the entitled person cannot submit the reservation referred to in Article 263, Section 1. This is a declaration of intent made by an authorized person within the meaning of the Copyright Act, in which they can exclude a specific aspect of permitted use. Due to its nature, this declaration can take various forms; the most important thing is for the authorized person to prove that they made such a declaration and that it reached the intended recipient.
Fair use, as referred to in the Polish Copyright Act, is neither subjectively nor objectively limited – it applies to all texts or data that constitute “works” within the meaning of Article 1. The term “disseminated works” should not be understood as all works available to the public. This refers to works to which access is granted with the consent of the rightholder or a person authorized by them to grant permission – so it is not possible to use materials made available illegally, that is, without the author’s knowledge. The provision regarding work exploration only allows for “reproduction” – various arrangements of works are considered in this respect. These arrangements can take the form of incorporating the work into an object, i.e., fixing it. An example would be placing the work on a flash drive. However, arrangements in a cloud computing medium or in another way that does not refer to copies are possible. However, making recordings of works publicly available is excluded. Excessive copying of works in quantities beyond the scope of exploration exceeds the limits of free use.[2] It is worth referring to Article 35 of the Act, which states that fair use must not infringe upon the normal exploitation of the work or the legitimate interests of the creator.
There were also introduced analogous provisions to the Polish Act of 27 July 2001 on the Protection of Databases (consolidated text: Journal of Laws of 2024, item 1769), also based on the implementation of the aforementioned directive. It is therefore important to indicate the differences between the two database protection regimes:
A database can therefore be protected only under copyright law (if it is creative in nature and its creation did not require significant investment), only under the Database Protection Act (if its creation required significant investment but is not creative in nature), or under both regimes (if it is creative in nature and its creation required significant investment). Referring to the concept of synthetic data discussed above, it should be noted that synthetic data, as products of artificial intelligence, are not works within the meaning of the Copyright Act and are therefore not subject to protection. The opposite is true for synthetic databases – if it is creative in nature and created by a human, it will be a work within the meaning of the Act. However, under the Database Protection Act regime, neither creative nature nor human involvement in the creation of the database is required – significant investment is sufficient.
GDPR
If personal data are used to train artificial intelligence, Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) will apply. In such a situation, a legal basis for data processing is required in accordance with Article 6 (e.g. consent of the data subject) or Article 9 (in the case of so-called special category data). Whenever possible for the entity training artificial intelligence, data entered into the model should be processed in such a way that prevents the identification of a natural person (e.g. in an anonymous or pseudonymous manner).
Data Act
Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on harmonised rules on fair access to and use of data and amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828 (Data Act) introduces numerous new regulations that enable data acquisition. The Data Act is crucial for the process of training artificial intelligence (AI) because it affects how data is acquired, shared, and used in machine learning. These provisions regulate, among other things:
Artificial Intelligence Act (AI Act)
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (“Artificial Intelligence Act“) will be of key importance for the development of AI (including model training) in the Member States of the European Union in the coming years. The aim of this act is to increase the competitiveness of European companies in relation to foreign technology giants, while respecting the rights and values of members of the European community.
First, it is worth mentioning that the regulation imposes on developers of general-purpose AI models (e.g., Chat GPT, Gemini, etc.) an information obligation covering data used for AI training and validation (this refers to real-world data used to verify and correct errors in synthetic data generated by the AI). This obligation covers, among other things, the type and origin of the data, its scope and properties, as well as measures to detect inappropriate data sources that may lead to erroneous synthetic data or model bias. This data will be provided to the AI Authority and the competent national authorities upon request (Annex XI in conjunction with Article 53(1)(a)). They are also required, among other things, to provide downstream providers with information enabling efficient integration and to publicly disclose a summary of the data used to train the model (Article 53(1)(b)).
Moreover, the AI Act strengthens copyright protection for works that can be used to train artificial intelligence, primarily in the context of the copyright holder’s objection, which was described above. Article 263 (2), implemented under the Directive on Copyright in the Digital Environment, introduces the so-called opt -out mechanism, according to which, in the case of works made publicly available, the objection is made in a machine-readable format (i.e., using so-called metadata). The idea is that, upon encountering a work equipped with an opt-out mechanism informing it that the copyright holder objects to the use of its content for machine AI training, the AI Act will itself know that it cannot use that work. This regulation (which, after all, arises from EU law) is referenced in Article 53(1)(a) of the AI Act. Under this provision, providers of general-purpose AI models will be required to implement policies to ensure compliance with EU law on copyright and related rights, in particular with a view to identifying and complying with the reservation of rights, including through state-of-the-art technologies (i.e. opt -out mechanisms using metadata).
An important requirement is also introduced by Article 53(1)(d) of the AI Act. According to this provision, developers of general-purpose AI models are required to prepare and make publicly available summaries of the content used to train the model, in accordance with a template provided by the AI Office. The intent of this requirement is to improve the transparency of datasets and to facilitate the exercise and enforcement of rights under EU law by parties with a legitimate interest, including copyright holders. The summary should include a list of datasets used by AI models, such as public datasets, licensed private data, content sourced from the internet, and user data. According to the European Commission’s guidelines, the information contained in the summary should cover the data used in all stages of model training, from initial to final, including model matching and tuning. Interestingly, the template requires the disclosure of a summary of the list of the most popular internet domains; however, it does not require the disclosure of details about the specific data and work used to train the model, as this would go beyond the requirement of Article 53(1)(d), which concerns the provision of only a “summary,” which, according to Recital 107 of the AI Act, must be “comprehensive and sufficiently detailed,” but not “technically detailed” — such data are trade secrets. To protect them, the template requires varying levels of detail depending on the source of the data under consideration (e.g., limited disclosure is required for licensed data). The summary should also be updated whenever the provider introduces new data into the AI training process—every 6 months, or immediately if the data is particularly relevant. The summary should be made publicly available (e.g., on the model developer’s website) at the latest when the model is launched on the EU market.
Existing models must comply with the above-mentioned regulations by August 2, 2027 – otherwise, the AI Office will be able to impose fines of up to 3% of the company’s global annual turnover.
A solution to increase the competitiveness of AI models trained in Europe is regulations stipulating that any entity offering an AI model in Europe must first train it in accordance with EU law, regardless of the territory in which such training takes place. The AI Act, in Article 5(1)(e), also prohibits the placing on the market or putting into service of AI systems that create or expand databases for facial recognition, through the so-called untargeted acquisition, scraping facial images from the internet or CCTV recordings. This is intended to increase the protection of personal data when training artificial intelligence models.
AI Training Dataset Licensing
The primary legal act regulating the protection and licensing of databases is the Copyright and Related Rights Act. As mentioned earlier, if a database meets the characteristics of a work within the meaning of Article 3 of the Act (i.e., its selection, arrangement, or composition is creative in nature), it will be subject to the provisions of this Act, including the provisions governing the license agreement. Databases for AI training may also fall under this scope if their creative, individual character is demonstrated (in Polish case law, it is very easy to demonstrate such character and, consequently, to recognize a given collection as a work). Pursuant to Article 66, paragraph 1, a license agreement entitles the licensee to use the work for a period of five years in the territory of the country where the licensee has its registered office or place of residence, unless the agreement provides otherwise. Importantly, the license agreement does not transfer the copyright to the licensee. The creator may grant authorization to use the work in the fields of exploitation specified in the agreement, specifying the scope, location, and duration of such use – thus, it is possible to restrict the use of data from the database only to AI training. Unless the agreement provides otherwise, the licensee may not authorize another person to use the work within the scope of the obtained license. An exclusive license agreement (i.e., one that reserves the exclusive right to use the work in a specific manner) must be in writing under pain of nullity. If no exclusive license agreement is concluded, the granting of a license does not limit the creator’s authorization to use the work by others in the same field of exploitation.
The Database Protection Act, on the other hand, does not regulate the issue of license agreements at all. It might seem, therefore, that if a database is not individually creative in nature and is not subject to copyright law, then the provisions of a license agreement do not apply to it either. In practice, when providing access to such databases to companies training AI models, a so-called innominate agreement, structured similarly to the license agreement under the Copyright Act, would be necessary.
Due to the limited amount of data they contain, data from open licenses is generally not used to train AI models. These licenses effectively constitute an offer by the copyright holder to make the work (database) available to everyone under the terms required by that specific license. These works are available on the same terms to all interested parties. The license agreement is concluded conclusively by commencing to use the collections covered by the license.
TRAINED DATASETS
Trained datasets are, in a sense, databases composed of the “results” of AI training. For example, by feeding the model medical data on patients with a specific disease, the model can determine the most common causes and symptoms of the disease (generate statistics) and suggest the most effective treatment options. These AI training products, resulting from its calculations, constitute trained datasets. As AI products, they are not subject to copyright protection in themselves, but once a human creates an “individual system/set of a creative nature”, they can be considered a database within the meaning of the Copyright Act, and thus be subject to protection and licensing. Similar to synthetic data, if a database is created from trained data with significant investment, it will be subject to protection under the Database Protection Act.
Sources:
[1]A. Niewęgłowski [in:] Copyright. Commentary, 2nd ed., Warsaw 2025, article 6
[2]A. Niewęgłowski [in:] Copyright law. Commentary, 2nd edition , Warsaw 2025, article 26(3).