AI TRAINING - KIELTYKA GLADKOWSKI LEGAL | CROSS BORDER POLISH LAW FIRM RANKED IN THE LEGAL 500 EMEA SINCE 2019

LICENSING DATASETS FOR AI TRAINING

jakub — Tue, 28 Oct 2025 20:18:58 +0000

Publication date: October 23, 2025

DATASETS AND SYNTHETIC DATA

Creating and developing an AI model requires an unimaginable amount of data. Once input, the model analyzes the information, performs calculations, and draws conclusions based on this data that informs its future operations. Comparing this process to that of humans, one could say that AI “learns” in this way. AI systems are trained on numerous examples and draw model patterns from them, allowing them to predict correct solutions. This process is called “AI training”. Data is currently so expensive and difficult to access that it is estimated that it may be in short supply by 2032. The answer to these problems is synthetic data. This data is generated by the AI itself, which uses parameters from real-world data and randomly generates subsequent scenarios. These scenarios are designed to faithfully reproduce the properties, complexities, and relationships observed in the original data from which they were generated. There are certain risks involved. First of all, it is about creating synthetic data based on erroneous assumptions from real data – with the continuous introduction of new real data, previous errors can be corrected and artificial intelligence will be able to “unlearn” them, whereas if the first synthetic data generated is “contaminated”, each subsequent one will also contain erroneous information. The undoubted advantage of synthetic data is that it can be used to generate subsequent scenarios that may not actually occur at all or very rarely. This is used in industries such as automotive (for simulating traffic scenarios), finance (detecting fraud), and healthcare (detecting and treating rare conditions).

LEGAL ISSUES RELATED TO AI TRAINING

Copyright

Training AI requires vast amount of data. The resources used for this purpose are often copyrighted works within the meaning of the Act of 4 February 1994 on Copyright and Related Rights (Journal of Laws of 2025, item 24, as amended). Legal acts issued by the European Union also explicitly refer to the use of data for the purposes of AI training – the most important of which is Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC (OJ EU L 130 of 2019, p. 92), which was implemented into Polish law by the Act of 26 July 2024 amending the Act on Copyright and Related Rights, the Act on the Protection of Databases and the Act on Collective Management of Copyright and Related Rights (Journal of Laws, item 1254). The new regulations introduced exceptions allowing the use of copyrighted data for the purposes of AI training.

Article 6, Section 1, Item 22 of the Copyright Act explicitly addresses this issue, introducing a statutory definition of AI training, according to which it is the analysis of texts and data exclusively using automated techniques for analyzing texts and data in digital form in order to generate specific information, including, in particular, patterns, trends, and correlations. The Act calls this process text and data mining. The concept of mining refers exclusively to the use of a certain automated tool. This is a computer program based on generative technology. The word “exclusively” does not mean that a human (e.g., someone who issues commands to the AI) cannot participate in the process. The legislator used the word “analysis” functionally; it refers to accessing specific material and processing it to generate new products. The purpose of mining is to obtain new knowledge from existing databases.[1]

The above definition was used in Article 26³Section 1 of the Copyright Act, which sets forth that it shall be permitted to reproduce disseminated works for the purpose of text and data mining, unless the rightholder has stipulated otherwise. This means that using copyrighted works for the purpose of training AI falls within the scope of fair use. For the provision in question, the purpose of mining is irrelevant – it must be “solely” lawful. Work mining can be carried out for both commercial and non-commercial purposes – the only distinction occurs in the case of training AI by scientific and cultural institutions. To the extent that they provide research services commercially, the aforementioned provision applies. In the case of non-commercial research, these institutions may refer to Article 26²– the practical difference is that in such a situation, the entitled person cannot submit the reservation referred to in Article 26³, Section 1. This is a declaration of intent made by an authorized person within the meaning of the Copyright Act, in which they can exclude a specific aspect of permitted use. Due to its nature, this declaration can take various forms; the most important thing is for the authorized person to prove that they made such a declaration and that it reached the intended recipient.

Fair use, as referred to in the Polish Copyright Act, is neither subjectively nor objectively limited – it applies to all texts or data that constitute “works” within the meaning of Article 1. The term “disseminated works” should not be understood as all works available to the public. This refers to works to which access is granted with the consent of the rightholder or a person authorized by them to grant permission – so it is not possible to use materials made available illegally, that is, without the author’s knowledge. The provision regarding work exploration only allows for “reproduction” – various arrangements of works are considered in this respect. These arrangements can take the form of incorporating the work into an object, i.e., fixing it. An example would be placing the work on a flash drive. However, arrangements in a cloud computing medium or in another way that does not refer to copies are possible. However, making recordings of works publicly available is excluded. Excessive copying of works in quantities beyond the scope of exploration exceeds the limits of free use.[2] It is worth referring to Article 35 of the Act, which states that fair use must not infringe upon the normal exploitation of the work or the legitimate interests of the creator.

There were also introduced analogous provisions to the Polish Act of 27 July 2001 on the Protection of Databases (consolidated text: Journal of Laws of 2024, item 1769), also based on the implementation of the aforementioned directive. It is therefore important to indicate the differences between the two database protection regimes:

Copyright and Related Rights Act – Under Article 3, a database is subject to copyright if it meets the characteristics of a work, even if it contains unprotected materials, provided that the selection, arrangement, or composition thereof is creative. If a database can be considered a work based on these criteria, it will be protected by copyright.
Database Protection Act – According to Article 1, databases are subject to protection specified in the Act, regardless of the protection granted under the Copyright and Related Rights Act. According to this Act, a database is a set of data or any other materials and elements collected according to a specific system or method, individually accessible in any manner, including electronic means, requiring significant investment, whether qualitative or quantitative, in order to create, verify, or present its contents. As can be seen, database protection under this Act does not require the database to be creative; the emphasis is on the investment costs associated with creating, verifying, or presenting its contents.

A database can therefore be protected only under copyright law (if it is creative in nature and its creation did not require significant investment), only under the Database Protection Act (if its creation required significant investment but is not creative in nature), or under both regimes (if it is creative in nature and its creation required significant investment). Referring to the concept of synthetic data discussed above, it should be noted that synthetic data, as products of artificial intelligence, are not works within the meaning of the Copyright Act and are therefore not subject to protection. The opposite is true for synthetic databases – if it is creative in nature and created by a human, it will be a work within the meaning of the Act. However, under the Database Protection Act regime, neither creative nature nor human involvement in the creation of the database is required – significant investment is sufficient.

GDPR

If personal data are used to train artificial intelligence, Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) will apply. In such a situation, a legal basis for data processing is required in accordance with Article 6 (e.g. consent of the data subject) or Article 9 (in the case of so-called special category data). Whenever possible for the entity training artificial intelligence, data entered into the model should be processed in such a way that prevents the identification of a natural person (e.g. in an anonymous or pseudonymous manner).

Data Act

Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on harmonised rules on fair access to and use of data and amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828 (Data Act) introduces numerous new regulations that enable data acquisition. The Data Act is crucial for the process of training artificial intelligence (AI) because it affects how data is acquired, shared, and used in machine learning. These provisions regulate, among other things:

The right to access data: users of devices generating data (e.g. smartphones, vehicles or industrial machines) gain the right to read it and transfer it to other entities, which facilitates the creation and training of AI models.
Obligations of service providers: entities processing data must ensure their availability in a transparent and non-discriminatory manner, which has a direct impact on compliance with the principles of ethical use of data by AI algorithms.
Data sharing policies: Companies are encouraged (or required) to share data with each other in certain situations, which can spur the development of the AI sector by increasing the availability of diverse datasets.

Artificial Intelligence Act (AI Act)

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (“Artificial Intelligence Act“) will be of key importance for the development of AI (including model training) in the Member States of the European Union in the coming years. The aim of this act is to increase the competitiveness of European companies in relation to foreign technology giants, while respecting the rights and values of members of the European community.

First, it is worth mentioning that the regulation imposes on developers of general-purpose AI models (e.g., Chat GPT, Gemini, etc.) an information obligation covering data used for AI training and validation (this refers to real-world data used to verify and correct errors in synthetic data generated by the AI). This obligation covers, among other things, the type and origin of the data, its scope and properties, as well as measures to detect inappropriate data sources that may lead to erroneous synthetic data or model bias. This data will be provided to the AI Authority and the competent national authorities upon request (Annex XI in conjunction with Article 53(1)(a)). They are also required, among other things, to provide downstream providers with information enabling efficient integration and to publicly disclose a summary of the data used to train the model (Article 53(1)(b)).

Moreover, the AI Act strengthens copyright protection for works that can be used to train artificial intelligence, primarily in the context of the copyright holder’s objection, which was described above. Article 26³(2), implemented under the Directive on Copyright in the Digital Environment, introduces the so-called opt -out mechanism, according to which, in the case of works made publicly available, the objection is made in a machine-readable format (i.e., using so-called metadata). The idea is that, upon encountering a work equipped with an opt-out mechanism informing it that the copyright holder objects to the use of its content for machine AI training, the AI Act will itself know that it cannot use that work. This regulation (which, after all, arises from EU law) is referenced in Article 53(1)(a) of the AI Act. Under this provision, providers of general-purpose AI models will be required to implement policies to ensure compliance with EU law on copyright and related rights, in particular with a view to identifying and complying with the reservation of rights, including through state-of-the-art technologies (i.e. opt -out mechanisms using metadata).

An important requirement is also introduced by Article 53(1)(d) of the AI Act. According to this provision, developers of general-purpose AI models are required to prepare and make publicly available summaries of the content used to train the model, in accordance with a template provided by the AI Office. The intent of this requirement is to improve the transparency of datasets and to facilitate the exercise and enforcement of rights under EU law by parties with a legitimate interest, including copyright holders. The summary should include a list of datasets used by AI models, such as public datasets, licensed private data, content sourced from the internet, and user data. According to the European Commission’s guidelines, the information contained in the summary should cover the data used in all stages of model training, from initial to final, including model matching and tuning. Interestingly, the template requires the disclosure of a summary of the list of the most popular internet domains; however, it does not require the disclosure of details about the specific data and work used to train the model, as this would go beyond the requirement of Article 53(1)(d), which concerns the provision of only a “summary,” which, according to Recital 107 of the AI Act, must be “comprehensive and sufficiently detailed,” but not “technically detailed” — such data are trade secrets. To protect them, the template requires varying levels of detail depending on the source of the data under consideration (e.g., limited disclosure is required for licensed data). The summary should also be updated whenever the provider introduces new data into the AI training process—every 6 months, or immediately if the data is particularly relevant. The summary should be made publicly available (e.g., on the model developer’s website) at the latest when the model is launched on the EU market.

Existing models must comply with the above-mentioned regulations by August 2, 2027 – otherwise, the AI Office will be able to impose fines of up to 3% of the company’s global annual turnover.

A solution to increase the competitiveness of AI models trained in Europe is regulations stipulating that any entity offering an AI model in Europe must first train it in accordance with EU law, regardless of the territory in which such training takes place. The AI Act, in Article 5(1)(e), also prohibits the placing on the market or putting into service of AI systems that create or expand databases for facial recognition, through the so-called untargeted acquisition, scraping facial images from the internet or CCTV recordings. This is intended to increase the protection of personal data when training artificial intelligence models.

AI Training Dataset Licensing

The primary legal act regulating the protection and licensing of databases is the Copyright and Related Rights Act. As mentioned earlier, if a database meets the characteristics of a work within the meaning of Article 3 of the Act (i.e., its selection, arrangement, or composition is creative in nature), it will be subject to the provisions of this Act, including the provisions governing the license agreement. Databases for AI training may also fall under this scope if their creative, individual character is demonstrated (in Polish case law, it is very easy to demonstrate such character and, consequently, to recognize a given collection as a work). Pursuant to Article 66, paragraph 1, a license agreement entitles the licensee to use the work for a period of five years in the territory of the country where the licensee has its registered office or place of residence, unless the agreement provides otherwise. Importantly, the license agreement does not transfer the copyright to the licensee. The creator may grant authorization to use the work in the fields of exploitation specified in the agreement, specifying the scope, location, and duration of such use – thus, it is possible to restrict the use of data from the database only to AI training. Unless the agreement provides otherwise, the licensee may not authorize another person to use the work within the scope of the obtained license. An exclusive license agreement (i.e., one that reserves the exclusive right to use the work in a specific manner) must be in writing under pain of nullity. If no exclusive license agreement is concluded, the granting of a license does not limit the creator’s authorization to use the work by others in the same field of exploitation.

The Database Protection Act, on the other hand, does not regulate the issue of license agreements at all. It might seem, therefore, that if a database is not individually creative in nature and is not subject to copyright law, then the provisions of a license agreement do not apply to it either. In practice, when providing access to such databases to companies training AI models, a so-called innominate agreement, structured similarly to the license agreement under the Copyright Act, would be necessary.

Due to the limited amount of data they contain, data from open licenses is generally not used to train AI models. These licenses effectively constitute an offer by the copyright holder to make the work (database) available to everyone under the terms required by that specific license. These works are available on the same terms to all interested parties. The license agreement is concluded conclusively by commencing to use the collections covered by the license.

TRAINED DATASETS

Trained datasets are, in a sense, databases composed of the “results” of AI training. For example, by feeding the model medical data on patients with a specific disease, the model can determine the most common causes and symptoms of the disease (generate statistics) and suggest the most effective treatment options. These AI training products, resulting from its calculations, constitute trained datasets. As AI products, they are not subject to copyright protection in themselves, but once a human creates an “individual system/set of a creative nature”, they can be considered a database within the meaning of the Copyright Act, and thus be subject to protection and licensing. Similar to synthetic data, if a database is created from trained data with significant investment, it will be subject to protection under the Database Protection Act.

Sources:

Act of 4 February 1994 on copyright and related rights (consolidated text: Journal of Laws of 2025, item 24, as amended).
Act of 27 July 2001 on the protection of databases (consolidated text: Journal of Laws of 2024, item 1769),
Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC (OJ EU L 130, 2019, p. 92)
Act of 26 July 2024 amending the Act on Copyright and Related Rights, the Act on Database Protection and the Act on Collective Management of Copyright and Related Rights (Journal of Laws, item 1254)
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (OJ EU L. of 2016, No. 119, p. 1, as amended).
Act of 27 July 2001 on the protection of databases (consolidated text: Journal of Laws of 2024, item 1769).
Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on harmonised rules on fair access to and use of data and amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828 (Data Act)
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139, (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act)
A. Niewęgłowski [in:] Copyright. Commentary, 2nd ed., Warsaw 2025
Annex to the Communication to the Commission – Approval of the content of the draft Communication from the Commission – Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models required by Article 53 (1)(d) of Regulation (EU) 2024/1689 (AI Act )

[1]A. Niewęgłowski [in:] Copyright. Commentary, 2nd ed., Warsaw 2025, article 6

[2]A. Niewęgłowski [in:] Copyright law. Commentary, 2nd edition , Warsaw 2025, article 26(3).

Artykuł LICENSING DATASETS FOR AI TRAINING pochodzi z serwisu KIELTYKA GLADKOWSKI LEGAL | CROSS BORDER POLISH LAW FIRM RANKED IN THE LEGAL 500 EMEA SINCE 2019.

USE OF MEDICAL DATA FOR AI TRAINING

jakub — Tue, 21 Oct 2025 19:03:20 +0000

Publication date: October 21, 2025

Under EU Law, namely Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (hereinafter “GDPR”) and the pending entry into application of Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directive 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (hereinafter “AIA”), the use of sensitive data (including medical data) for AI training would only be possible after obtaining consent, in cases specified by law, or when using anonymized data. AIA is not a lex specialis vis-à-vis the GDPR, so when using personally identifiable data, using data for AI model training requires meeting the requirements of both acts.

Anonymized data

First, it should be noted that the GDPR, in accordance with its Recital 26 and Article 4(1), refers to personal data, meaning data that allows for the identification of a data subject, and should not therefore apply to anonymized information. The AIA’s definition of “special categories of personal data” refers to the GDPR definition, so it will not apply to personal data either. Therefore, it can be concluded that the use of anonymized data for training AI models is permissible under both acts.

Pursuant to Article 11 of the GDPR, if the purposes for which a controller processes personal data do not or no longer require the identification of a data subject by the controller, the controller shall not be obliged to process additional information in order to identify the data subject for the sole purpose of complying with the GDPR.

GDPR

The GDPR states that personal data may be processed only in strictly defined cases and in compliance with certain standards. Generally, pursuant to Article 6(1)(a) of the GDPR, consent is required for the lawfulness of personal data processing (see exceptions below). Consent to data processing must be given for a specific purpose. It must also be freely given, specific, informed, and unambiguous. Consent cannot be presumed, and the data controller bears the burden of demonstrating that the data subject has consented to processing (Article 7(1) of the GDPR).

Article 5 of the GDPR stipulates that personal data must be processed lawfully, fairly, and in a transparent manner for the data subject. Personal data must be adequate, relevant, and limited to the purposes for which they are processed. They must also be accurate and, where necessary, kept up to date. The controller must take all reasonable steps to ensure that personal data that are inaccurate in relation to the purposes of processing are promptly erased or rectified.

The literature indicates that the purpose of processing must be specific and clear, it cannot be an abstract purpose and that processing data for a purpose other than that for which it was collected is only possible based on consent or a legal provision. Consequently, processing data collected with consent for a different purpose (for example, providing a medical service) for the purpose of training artificial intelligence would be inadmissible without the additional consent of the data subject.

The list of cases in which data processing is permitted without the data subject’s consent is contained in Article 6(1)(bf) of the Regulation. It would be difficult to argue for the use of data for AI training in any of the cases other than those described in point (f). However, medical data falls into the category of so-called sensitive data described in Article 9 of the GDPR, and the aforementioned article prohibits their processing by establishing a separate list of exceptions in its paragraph 2, not including “legitimate interests”. Therefore, it would not be permissible to use this data without consent for purposes such as commercial ones, justifying this by the processor’s “legitimate interest”.

AI ACT

Meanwhile, the AI Regulation introduces the possibility of using sensitive data (understood in the same way as under the GDPR) when developing AI systems. According to Article 10(5) of the regulation (to be applied from August 2, 2026), AI system providers may exceptionally use this data if strictly necessary for the purpose of detecting and correcting bias in high-risk AI systems. This requires compliance with the GDPR requirements (including consent) and the following conditions:

a) it is not possible to effectively detect and correct bias by processing other data;

(b) special categories of personal data are subject to technical restrictions on the re-use of personal data and state-of-the-art security and privacy measures, including pseudonymisation;

(c) special categories of personal data are secured, protected and subject to appropriate safeguards, including strict access controls and documentation, to avoid abuse and ensure that only authorised persons, subject to appropriate confidentiality obligations, have access to such data;

d) this data may not be sent, transferred or otherwise made available to other entities;

(e) special categories of personal data shall be deleted once the bias has been corrected or after the personal data retention period has expired, whichever comes first;

(f) records of processing activities must include a justification as to why the processing of special categories of personal data was strictly necessary to detect and correct bias and why this purpose could not be achieved by processing other data.

It seems that an exceptional case of using such data has been envisaged here, and if the conditions described in this article do not exist, the use of sensitive data for training AI models will not be permissible.

Sale of personal data

As mentioned above, the GDPR does not apply to anonymised data, so their sale or other “commercial use” is not regulated by the GDPR and could encounter possible legal obstacles resulting from regulations other than the GDPR, for example sector-specific ones.

In the case of selling personal data (not anonymised), consent or the conditions of Article 9 paragraph 2 would be required, as we are dealing with sensitive data, it would not be possible to invoke the “legitimate interest” of the “seller”.

Summary

In summary, the use of anonymized data (unless prohibited by specific regulations) for training AI models is permissible. In the case of non-anonymized or pseudonymized personal data, compliance with GDPR and AIA requirements, including consent, would be required.

Artykuł USE OF MEDICAL DATA FOR AI TRAINING pochodzi z serwisu KIELTYKA GLADKOWSKI LEGAL | CROSS BORDER POLISH LAW FIRM RANKED IN THE LEGAL 500 EMEA SINCE 2019.