Updating structured vocabularies: a necessary but difficult task

Structured vocabularies (lexicons, dictionaries, thesauri, taxonomies…) and, more broadly, knowledge bases, play an important role in many applications when it comes to organizing or making information accessible.

But they still need to be constantly updated, which is a long, tedious and difficult task.

A good example is the famous medical MeSH thesaurus that facilitates the search and access to medical articles. Enriching scientific content with MeSH terms ensures that an article on a specific topic, for example “Type 2 Sugar Diabetes”, can be found even if the author has used one of the many synonyms for this disease such as “NIDDM” or “Adult Onset Diabetes Mellitus“.

Legal intelligence monitoring: the case for assistance

Let’s take for example the field of legal intelligence: Every company must respect the law and comply with regulations. For an information professional, it is essential to be informed of relevant information in the legislative, regulatory, jurisprudential and even doctrinal fields.

However, there is a constant evolution of legal concepts or notions: What are the emerging legal concepts, the new acronyms, the new terms used by lawyers in labor, commercial, intellectual property or in any other field of law?

Remember that there are more than 75 legal codes in France that are constantly evolving. Tracking evolutions manually quickly becomes impossible, especially if you are interested in other equally complex subjects.

Therefore, it becomes essential to get assistance to acquire new essential information, i.e. that will be relevant to your economic activity.

Using a seed vocabulary

The approach we describe here is based on the assumption that a seed vocabulary exists at the beginning. In our example, this will be a list or thesaurus containing different legal concepts. We also need a corpus of legal documents.

The challenge is to automatically analyze the content of these documents and ensure that new concepts that deserve to be included in the vocabulary are well identified and brought to the attention of the expert.

How to proceed?

Kairntech’s response to this challenge encompasses the following steps:

1. Create a project on the Kairntech platform, import the above mentioned seed vocabulary and document corpus

2. Configure a text annotator with the seed vocabulary and then automatically annotate the document corpus. This annotation consists of automatically locating and marking the documents with the terms of the seed vocabulary, regardless of the presence of lower or upper case letters, the inflected forms of words (plural…), the insertion of empty words within noun phrases…

3. Check the quality of the annotated corpus, make any corrections manually or with assistance. The annotated corpus has become a training dataset.

A training dataset is built automatically from a seed vocabulary.

4. Create automatic learning experiments with state-of-the-art Machine Learning and Deep Learning (neural networks) engines such as CRF-Suite, Spacy, Flair, Delft (Bi-LSTM), Sklearn, Trankit… possibly associated with lexical embeddings such as ELMO, BERT… These engines are provided by the platform and are updated regularly. Thus, the user can experiment with these different algorithms by creating training models, evaluate their respective quality and select the one offering the best performance. This could be a compromise between extraction quality and annotation speed for example. 

In our example, only 260 annotated documents were sufficient to generate a model exceeding 85% of average quality. Thanks to new algorithms and word embeddings, a few hundred or even a few thousand annotations can be sufficient to launch a system in production, so that the creation of a training dataset can be counted in days.

Experimentation with different algorithms by creating training models and assessing their respective quality.

5. Build a processing chain (NLP Pipeline) combining (i) the text annotator built from the seed vocabulary, (ii) the selected training model, (iii) a reconciliation component that automatically distinguishes known terms from new ones. And finally, test the processing chain on a new text and verify that it allows you to extract new terms.

Automatic detection of a relevant new term in a document (annotated in red)

A scalable system that generates new insights

This system can be used on a larger scale:

  • Via the programming interface (REST API) of the Kairntech platform ;
  • or by creating a new project on the platform, by importing a new set of documents and automatically annotating it with the processing sequence already in place. It is then possible to search, browse and filter its corpus and discover new terms.

Although all this is no longer a major technological obstacle today, the requirement pursued by Kairntech is that these different steps can be carried out by non-computer scientists, i.e. experts with a thorough knowledge of the domain but not necessarily having programming skills.

The execution of these steps produces new terms that not only enrich the initial vocabulary but also highlight recent and new legal concepts that needed to be added.

Conclusion: a valuable aid accessible to non-informaticians

Among the many tasks when managing and analyzing large quantities of documents are the creation, management and updating of business vocabularies, knowledge bases and dictionaries for a given domain. 

The approach described above can be done simply and quickly without particular programming skills. 

This saves the user a considerable amount of time that would otherwise be spent on long, tedious and often very complex work.