Maintain business vocabularies with AI
Structured business vocabularies (thesauri, taxonomies…) play an important role in many applications where complex, large and volatile information needs to be organized and made accessible. A fine example is the famous MeSH thesaurus that facilitates search and access on Medical topics. Enriching scientific content with MeSH terms allows to guarantee that content on a specific topic (say “Diabetes Mellitus, Type 2”) can be found even if only one of its many frequently used synonyms are used in a specific paper (such as “NIDDM”, “Adult Onset Diabetes Mellitus” etc).
So many new terms, so little time
Thesauri, however, often need to be constantly updated as their subject evolves: New terms become relevant as new scientific discoveries are made and new technology emerges. Updating vocabularies finally can be a cumbersome and time consuming task when there are tens of thousands of topics to manage.
The case of technology watch
Kairntech client TecIntelli serves their customers with detailed insights into highly competitive and volatile technological markets: Take the highly relevant field of battery technologies: Batteries have become an essential ingredient for tomorrow’s mobility and technological progress is happening at a fast pace, relentlessly and in many places around the world. Who is proposing which battery technology? Where are the key hotspots? Which players for alliances or announce new benchmarks? Tracking these questions with purely manual approaches quickly becomes infeasible and when you serve one client on this technological field and others, equally complex topics, then it is evident that AI support is dearly needed.
The seed vocabulary approach
The approach that we want to describe now rests on the assumption that a seed vocabulary of domain specific terms (in our case known battery types and technologies) exists. Also we have a large body of domain specific publications in which battery technologies and markets are discussed. The challenge now is to ensure that as the field evolves, new technologies that merit their inclusion into the thesaurus are identified and brought to the attention of the expert.
How to proceed?
The Kairntech answer to this challenge comprises that following steps:
Create a project of NER type (Named Entity Extraction) and import the corpus of content.
Import the seed vocabulary in formats like xls, csv, skos, txt…
Search & browse your seed vocabulary.
Configure a text annotator (“Gazetteer”) with the seed vocabulary using the PhraseMatcher engine provided by the platform and sync-up it to get ready for annotation!
Annotate automatically the whole corpus with the PhraseMatcher Gazetteer
Review the dataset quality by using the search engine, navigation and filtering features and various statistics. Spot possible inconsistencies on annotations which may affect the quality of the training model you are going to create!
Pre-packaged machine and deep learning models
Train a Machine Learning model on the annotated corpus using one of the engines provided by the platform (CRF-Suite, Spacy, Delft, Flair). Let’s configure a first experiment with Flair:
Then launch the cloud-based training process.
And build an annotation pipeline combining the phrasematcher Gazetteer and your flair model so as to be able to extract both known entities (from the seed vocabulary) and unknown ones.
Test the annotation pipeline (annotation plan) in the Test page with a text and check results:
If you are happy with unit tests, you can use this annotation pipeline at a larger scale:
- through the REST API;
- or in creating a new project in Kairntech platform, importing a corpus and automatically annotate it with the pipeline. Then you will be able to search, navigate & filter your corpus and discover new terms.
None of the above is a major technological obstacle anymore today. However, the Kairntech requirement is to ensure that these steps are applicable for non-programmers. That means domain experts with a thorough knowledge about the domain but not necessarily with data science expertise.
Executing these steps yields a list of newly found terms that potentially extend the imported vocabulary. And where some of them indicate recent technologies that indeed represent valid updates and extensions of the imported vocabulary.
Conclusion
The maintenance of domain-specific business vocabularies makes particular sense with large volumes of textual content. The approach described above is accessible to business users without programming expertise. The Kairntech solution creates time-to-value with pre-packaged NLP components.