Data anonymization under GDPR

Since the entry into force in 2018 of the General Data Protection Regulation (GDPR), companies must redouble their vigilance with regard to the dissemination of internal documents containing personal data, whether it is data from employees, customers or any other natural person. Article 35 of the Regulation requires them to produce a privacy impact assessment of individuals in the event of a high risk. This study must define the protection measures in accordance with the risk analysis. One protection measure that naturally comes to mind is to anonymize these names in the documents before making them public to persons other than the recipients of the integrated version of the document.

In France, the 2018/2022 Programming and Justice Reform Act recently specified pseudonymization rules applicable to court decisions before they are made available to the public (open data). The surnames and forenames of the parties and third parties, with the exception of magistrates and clerks of the court, must be concealed.

In the company too, the rules of anonymization must be adapted according to the nature of the document, the type of natural person mentioned, and the population targeted by the dissemination of the document. Systematic anonymization could indeed harm the intelligibility of the remaining document and be disproportionate to the risks incurred. For example, in order to exploit a corpus of life insurance contracts for cross-selling purposes, the company could anonymize the names of the beneficiaries while leaving the name of the customer, which is the only one necessary for marketing purposes.

When Kairntech platform comes in…

This is where AI and Kairntech platform are of interest. The creation of the dataset allows the names of individuals to be manually annotated according to their semantic context. The tool thus learns when these names should be replaced by initials, and when they should be left intact. The quality of anonymization is measured with a battery of indicators, such as accuracy and recall rate.

Dataset creation on case law for anonymisation use case.

The size of the dataset, in terms of number of pages and number of documents, the number, variety and relative semantic complexity of the labels, and the number of annotations in each of these labels carried in the dataset, are then the essential parameters that determine the overall level of quality.

The Kairntech platform allows you not only create high quality dataset but also experiment the best Machine Learning algorithm to use by comparing automatically quality in few clicks, without coding. It is today reasonable to expect global accuracy close to 95% at document level leveraging most recent Neural Networks algorithms, far beyond what we used to achieve few years ago with traditional approaches based on rules (we used to reach 70% accuracy at document level).

Conclusion

The Kairntech platform will not only perform the anonymization of documents. It will also produce the quality indicators of this processing. These indicators will feed, then revise, the impact study, and thus justify the anonymization algorithm used in case of audit or litigation.