Anonymization of court decisions conceals and protects the information of an individual if its disclosure could be harmful. In accordance to the legislation, all data which enables unique identification of an individual, must be anonymized.
Court decisions are mostly textual. Identifying entities that need anonymization therefore requires an understanding of the language and content of the text, where context in which individual words are used is also important. This makes anonymization of court decisions is therefore difficult. In my thesis I focus on identification of entities that need anonymization.
I obtained the data from the IUS-INFO case-law portal and used a deep neural network based on the BERT model to process it. I classified words as "anonymize" or "do not anonymize".
Existing anonymization systems use manually extracted features. I show that anonymization is more successful using the vector inputs of the BERT model, which were successful using only of a small learning set designed to identify named entities. Anonymization was even better using the learning set built from annotated court decisions.