Publication date: October 21, 2025
Currently, useful data includes not only specific information organized into rows, columns, or databases, but also data that is not organized in any specifically defined way. This constitutes the majority of data we encounter, including images and text documents such as tweets and blog posts. Thousands of individuals and organizations generate it daily, with little regard for how it can be used. It is precisely thanks to unstructured data that such rapid AI development is possible through machine learning, which involves training algorithms to find patterns and correlations in large data sets.
Despite the lack of a specific structure or predefined template, unstructured data, like structured data, can contain personal data, thus falling under the GDPR. However, most anonymization techniques apply to structured data. While some techniques can be adapted appropriately, the scale and lack of a specific format significantly limit their applicability.
A significant challenge in the context of unstructured data anonymization is the lack of a fixed template. Whether data is revealed in a text file largely depends not only on the words themselves, but also on context and ambiguous phrases. Photos, on the other hand, can contain such obvious data as a face, but by tracking internet activity, we already know that it is possible to identify someone’s address through details as small as a specific street layout or distinctive signs.
Unstructured data is also characterized by its multidimensional nature. After all, an image is composed of individual pixels, and to write a tweet, we must use specific words arranged in a logical order. Separate components do not directly connect to a specific person, but when combined, they can pose a privacy threat. Due to this level of complexity, unstructured data anonymization techniques rely primarily on machine learning, which itself is still under development.
Text data anonymization is primarily achieved by detecting personal and sensitive data. We can find these, for example, by matching regular expressions. Regular expressions are a set of rules that a given string of characters must meet. A properly constructed expression will be able to detect appropriate amounts or identification numbers. Another method is dictionary search, which is based on comparing a specific word or phrase with a previously prepared dictionary. For greater efficiency, we can consider or ignore case sensitivity and use so-called fuzzy matching, which takes into account a certain difference between a word in the dictionary and a potential typo in the text. The last method presented is the use of trained artificial intelligence models to detect typos.
This most complex method uses pre-trained artificial intelligence to detect complex expressions based on part-of-speech recognition, word form, and context.
Separate techniques will be used to anonymize image data. These primarily involve irreversibly distorting the image in a way that prevents the recovery of personal data. The simplest technique is obfuscation, which replaces the sensitive area with a solid color. Another technique is pixelation, which reduces the resolution of a given area to render the data unrecognizable. The final technique is blurring, which relies on filters that blur or blur specific areas. Techniques for detecting environmental elements containing information that could lead to the identification of individuals are also under constant development. This is primarily accomplished through deep learning methods, particularly convolutional neural networks. These networks process large amounts of data in a grid format and then extract relevant detailed features for classification and detection. They can hide sensitive data when looking at an image, but they remain technically reversible. This problem is solved by GANs (generative adversarial networks), which replace a potentially sensitive object with an artificially generated one. However, there is a chance that the newly generated object will be similar to another, already existing one. This is particularly inconvenient when generating faces. Furthermore, removing some objects can render the images worthless.
Big data processing is a phenomenon that could significantly impact how information is managed and received, both in the private and public sectors. The amount of unstructured data currently at our fingertips is enormous, and it is crucial to ensure that sensitive data remains protected even in such conditions. Properly trained artificial intelligence comes to the rescue here, as it seems to be the only way to meaningfully sort through such vast amounts of data. Therefore, the development of AI should undoubtedly be of interest to every personal data controller.
On September 12, 2025, the provisions of Regulation (EU) 2023/2854 of December 13, 2023, on harmonised rules on fair access to and use of data and amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828, the Data Act, came into force. The Regulation specifically governs access to data generated by so-called connected products, which are defined under the Regulation as “a thing that acquires, generates or collects accessible data relating to its use or its environment and that is capable of communicating data from the product by means of an electronic communications service, a physical link or device access, and whose primary function is not to store, process or transmit data on behalf of a party other than the user”[1]. This applies in particular to various types of products that, for example, use the internet to collect and transmit data about their surroundings. This applies, above all, to all types of products known as the Internet of Things (IoT ), including household appliances, vehicles, watches, medical and agricultural devices, monitoring systems, temperature control systems, lighting systems, and so on. The most important change brought about by the regulation is that it requires products to be designed to allow users access to and control over the data generated by these devices.
The Data Act does not focus directly on unstructured data, but on data itself, defined under Article 2, Section 1 of the regulation as “any digital representation of actions, facts, or information, and any compilation of such actions, facts, or information, including in the form of an audio, visual, or audiovisual recording”. Article 3 paragraph 1 indicates, however, that: connected products shall be designed and manufactured, and related services shall be designed and provided, in such a way that data from the product and the related service, including relevant metadata necessary for the interpretation and use of the data, are easily, securely, free of charge by default, in a complete, structured, commonly used and machine-readable format, and, where appropriate and technically feasible, directly accessible to the user, which means that the provisions refer in particular to structured data, and in principle, that device manufacturers, under these provisions, should design their devices in such a way that the data generated are structured, easily and universally accessible and in a machine-readable format.
The scope of data covered by the sharing and transfer obligations is specified in Recitals 15 and 16. The first recital states that despite the information derived or inferred from such data that is the result of additional investment in assigning value to the data or extracting knowledge from the data, in particular through complex proprietary algorithms, including those that are part of proprietary software, should not fall within the scope of this Regulation and therefore the obligation on the data holder to make this information available to the user or recipient of the data should not cover them unless the user and the data holder jointly decide otherwise. Such data may include, in particular, information inferred through sensor fusion, which derives or infers data from multiple sensors collected in a connected product using complex proprietary algorithms and which may be subject to intellectual property rights, which means that the sharing obligation covers “raw” and pre-processed data, but not derived, inferred or enriched data, especially if it is the result of significant investment. This is confirmed by the latest version of the FAQ published by the European Commission[2] regarding the Data Act of 12/09/2025.
In Recital 16, the European legislator identified another category largely excluded from the obligations of Chapter II of the Regulation, namely content. This recital states that content should be understood in terms of its form; it can be textual, audio, visual, or audiovisual, and is often protected by intellectual property rights. The FAQ published by the European Commission states that content should be understood as the result of a creative and imaginative process, which is then made available for viewing by others. The aforementioned document cites the example of camera manufacturers, who are required to provide various information in the form of parameters: event logs, timestamps, battery charge levels, locations, etc. This is not an obligation to provide audiovisual content alone. However, it is possible that a camera could be connected to appropriate software, which, for example, would bring its functionality closer to that of an advanced sensor capable of interpreting the images it displays. For example, cameras installed in cars operate in this way, making parking easier or warning of a potential collision. The provisions of the regulation will certainly apply to the data generated in this way.
The Data Act is not intended to replace GDPR regulations in any way, but merely to supplement them. The Data Act should be applied in full compliance with the standards arising from the GDPR, which primarily means that any sharing of data containing information about specific individuals must be done in accordance with the GDPR. The regulation expands the scope of protection and rights of individuals so that, in addition to the rights arising from the GDPR (requesting access to data, rectification, deletion, etc.), users will also be able to obtain information about non-personal data, most often technical data generated by devices.
Users can require the data holder to provide readily accessible data from the product/service the user uses, along with the relevant metadata necessary to interpret and use this data. The data and metadata must be of the same quality as the data holder. This disclosure should be made without undue delay, in an easy and secure manner, and free of charge to the user, in a comprehensive, structured, commonly used, and machine-readable format, and, where appropriate and technically feasible, on a continuous basis and in real time. The transfer of data to third parties is regulated under Article 5 of the Regulation. This institution would pursue one of the objectives set by the European legislator, which was to stimulate innovation in “aftermarkets” or the development of completely new services using specific data, including data from various connected products or related services (Recital 32). In certain cases, an obligation to disclose data is provided for in cases of exceptional need by a public authority (Articles 14 and 15).
In addition to the previously mentioned Article 3, which provides the basis for data disclosure, Article 4 is also crucial, as it governs situations in which direct access cannot be provided, i.e., the Article 3 procedure cannot be used. Then, the user is entitled to submit a request to the data holder to provide them with so-called easily accessible data.
[1] Art.2 point 5 Data Act.
[2]Source: https://digital-strategy.ec.europa.eu/en/library/commission-publishes-frequently-asked-questions-about-data-act . P.6 of the document.