Finding new needles in content haystacks: Vocabulary maintenance with AI!

Introduction

Structured vocabularies (thesauri, taxonomies…) play an important role in many applications where complex, large and volatile information needs to be organized and made accessible. A fine example is the famous MeSH thesaurus that facilitates search and access on Medical topics. Enriching scientific content with MeSH terms allows to guarantee that content on a specific topic (say “Diabetes Mellitus, Type 2”) can be found even if only one of its many frequently used synonyms are used in a specific paper (such as “NIDDM”, “Adult Onset Diabetes Mellitus” etc).

So many new terms, so little time

Thesauri, however, often need to be constantly updated as their subject evolves: New terms become relevant as new scientific discoveries are made and new technology emerges. Updating vocabularies finally can be a cumbersome and time consuming task when there are tens of thousands of topics to manage.

The case of technology watch

Kairntech client TecIntelli serves their customers with detailed insights into highly competitive and volatile technological markets: Take the highly relevant field of battery technologies: Batteries have become an essential ingredient for tomorrow’s mobility and technological progress is happening at a fast pace, relentlessly and in many places around the world. Who is proposing which battery technology? Where are the key hotspots? Which players for alliances or announce new benchmarks? Tracking these questions with purely manual approaches quickly becomes infeasible and when you serve one client on this technological field and others, equally complex topics, then it is evident that AI support is dearly needed.

The approach with seed vocabulary

The approach that we want to describe now rests on the assumption that a seed vocabulary of domain specific terms (in our case known battery types and technologies) exists. Also we have a large body of domain specific publications in which battery technologies and markets are discussed. The challenge now is to ensure that as the field evolves, new technologies that merit their inclusion into the thesaurus are identified and brought to the attention of the expert.

How to proceed?

The Kairntech answer to this challenge comprises that following steps:

Create a project of NER type (Named Entity Extraction) and import the corpus of content.

Import the seed vocabulary in formats like xls, csv, skos, txt…

Search & browse your seed vocabulary.

Configure a text annotator (“Gazetteer”) with the seed vocabulary using the PhraseMatcher engine provided by the platform and sync-up it to get ready for annotation!

Annotate automatically the whole corpus with the PhraseMatcher Gazetteer

Review the dataset quality by using the search engine, navigation and filtering features and various statistics. Spot possible inconsistencies on annotations which may affect the quality of the training model you are going to create!

Train a Machine Learning model on the annotated corpus using one of the engines provided by the platform (CRF-Suite, Spacy, Delft, Flair). Let’s configure a first experiment with Flair:

Launch the training process which is done in the cloud, no need to take care of that.

Build an annotation pipeline combining the phrasematcher Gazetteer and your flair model so as to be able to extract both known entities (from the seed vocabulary) and unknown ones.

Test the annotation pipeline (annotation plan) in the Test page with a text and check results:

If you are happy with unit tests, you can use this annotation pipeline at a larger scale:

through the REST API;
or in creating a new project in Kairntech platform, importing a corpus and automatically annotate it with the pipeline. Then you will be able to search, navigate & filter your corpus and discover new terms.

While none of this is a major technological obstacle anymore today, the requirement that we at Kairntech also ensure that these steps be applicable for non-programmers, i.e. domain experts with a thorough knowledge about the domain but not necessarily with data science expertise.

Executing these steps yields a list of newly found terms that potentially extend the imported vocabulary and where some of them indicate recent technologies that indeed represent valid updates and extensions of the imported vocabulary.

Conclusion

Among the many scenarios that experts face today when large volumes of textual content need to be managed and analysed is also the support for the efficient creation, management and updating of domain-specific vocabularies. The described approach can be applied without programming and can save the user considerable effort that otherwise would have to be invested in manual work.