Few comments on Pseudo-anonymisation: secret key encryption, hash function, hash function with key, deterministic encryption and tokenization

Publication date: April 16, 2024

In the light of the GDPR, the principles of data protection do not apply to data whose links between personal data and the persons concerned have been irreversibly removed. Consequently, anonymization is not subject to the provisions of the GDPR Regulation. However, the measure that the GDPR directly indicates as a method of data protection is pseudo-anonymization. In this case, personal data is replaced in such a way that with the use of appropriate information it is possible to identify the persons to whom data relates.

The main mechanism on which pseudo-anonymization is based is the replacement of sensitive information with identifiers, which are then properly encrypted, but in a way that allows the possible reversal of this procedure. Therefore, in the whole process, we receive two sets of information, the first is a set of data that we cannot associate with any natural person, and the second is information containing identifiers that allow the data to be assigned to a specific person. Only authorized users have the ability to reconstruct data. According to the findings of The Article 29 Working Party, we can distinguish five anonymization techniques. They are secret key encryption, hash function, hash function with key, deterministic encryption and tokenization.

In the secret-key encryption method, algorithms transform the data into a cipher. Such a cipher can only be read by people who have the key. The biggest objections to this method are the fact that the key must be in the possession of the sender and recipient of the data. This necessitates the creation of a sufficiently secure key distribution method. The key itself, in turn, is a generated sequence of bits. The length of the key itself is also important, as it determines how easy it will be to perform an attack intercepting it. A brute force attack, because we are talking about it, consists in constantly trying all possible combinations of characters until the correct one is obtained.

Another method of pseudo-anonymization is the hash function. Using it, we transform the dataset into a fixed-length string. The probability of getting the same hash for different values is low, and any change to the value automatically changes the hash as well. Calculating the input value from the hash itself, i.e. reversing the data in this method, is not directly possible. However, from the hash itself, you can infer whether the values have changed. The level of protection is therefore high, but in return the operation of receiving data is very complicated.

Deterministic encryption results in the same text being returned using the same input and key. The problem, however, is that when performing statistical analysis or building a dictionary of pairs of plain and scrambled values, it is possible to find correlations between the scrambled values and, as a result, to discover true values. The Article 29 WP equates this technique to picking a random number as a nickname for each attribute in the database and then deleting the correlation table. Cryptographers have also proposed the notion of probabilistic encryption. As the name suggests, it introduces an element of chance, because true values are presented by means of randomly selected possibilities, which is to prevent the creation of connections. The process of deterministic encryption can be turned into probabilistic encryption by appending a random value to the true values before encryption and removing it after decryption.

The last of the described techniques – tokenization is quite well known in connection with its use in the financial sector. Card identification numbers are replaced with values (tokens) that limit usability for the attacker. Most often it is based on the use of one-way encryption mechanisms or assigning using index functions or randomly generated numbers that have not been mathematically derived from the raw data. During tokenization, the original data is replaced with tokens without the use of any formulas. The true values are thus de facto impossible to reconstruct. However, the original data is stored in the so-called vault and this is the only way to associate personal data with tokens.

For the most part, pseudoanonymization methods effectively protect our data, while still allowing us to reverse the process of depersonalization. The introduction of this concept into legal circulation is definitely a great advantage of these methods due to the possibility of demanding legal liability and compensation at the level of the regulation. In times dominated by a huge flow of data, knowledge of depersonalization techniques and the laws associated with it is extremely important.

SOURCES:,Analiza-rozwiazan-w-zaresie-anonymizator-danych-i-generowania-danych-syntetyczn.html /justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf