Introduction

The Kairntech software comes with a built-in entity extraction component that users can add to their projects with just a few mouse clicks. It offers a powerful additional option to analyse and annotate text content using existing world knowledge: The component gives you access to tens of millions of concepts in several languages and on almost all imaginable topics.

In this document we will first describe the component as such, the underlying data repository and how to add it to your project and then show you how you can customize the component to reflect your specific needs.

Entity Extraction using Wikidata

One important functionality implemented in Kairntech is a general purpose, large, prepackages Entity Extraction component: If for instance your task is to recognize the locations or animals or proteins in a text corpus, the good news is that Kairntech comes with all of this (and many more such concepts) prepackaged and ready to be used. Kairntech contains an instance of the Wikidata database. To make this vast knowledge resource operational, we embed the “Entity Fishing” system, implemented by Kairntech’s Chief Machine Learning expert Patrice Lopez .

This setup provides Kairntech with a very versatile entity extraction capability: entity from almost any topic are recognized, in many languages (english, german, french, arabic, italian and spanish are available by default, others on request). Entities are disambiguated (the component picks the proper meaning depending on the context without the need for manual rule-writing). Results are linked to background information and Kairntech takes care to constantly update the underlying database

Entity Extraction on just about any topic

As a result, users can annotate & enrich their content out-of-the-box. In the example below, note how the occurrences such as “Uranus” are linked to background information about this entity. 

In what we show above, we have just defined a default new Entity Recognizer: All the concepts that the component recognizes are mapped to the same label “DefaultLabel”: We get all the recognized concepts mapped to a single “DefaultLabel” label no matter whether it is a planet like “Uranus”, an organization like “NASA” or a spacecraft such a “Voyager 2”. Often this type of extraction will be sufficient, but there will be cases where you may want to get more specific results: All the places, all the proteins or all the composers mentioned in a document. Let’s see how to customize the extraction to allow for this type of scenario in the next section.

Setting up specific extraction components: Making use of higher-level Wikidata types

Say, we need to annotate biomedical content with only the proteins mentioned in the text. We don’t care about planets and space probes in this context. So we want to use the Entity Recognition approach from above but restrict it to only return proteins. Fortunately, this is straightforward in Kairntech and requires only a few steps to set up.

  • We start by checking in Wikidata what information we can use for that filter. For instance, we search for the protein “tumor necrosis factor” (or any other protein). We find that this protein is known in Wikidata to be an “instance of” the type “protein”.

  • Here we see that this is correctly declared as a “instance of” the concept “protein”. Moving the mouse over the word “protein” in the browser shows us that the identifier of protein is Q8054.

  • Let’s make use of this information and proceed in the Kairntech software: we define a new label “Protein”.

  • And then define a new annotator that takes care of the proteins. In order to do this, we click the gear icon on the bottom left and create the new annotator of type “Entity Fishing” and give it an appropriate name, say “Proteins”.

  • We then need to take care of the things this annotator shall process. In “mapped labels” we select the “Protein” label we have defined above and in the text field to its right, we define what for us is a protein. Clicking on the pen icon allows to edit this. 
  • We can then define our protein concept:

  • Voilà, we have a new protein annotation component. Let’s try it out. Let’s go to the test page, enter a test sentence containing a protein, select the new “Protein” annotator and press annotate.

So now we have defined a new annotator that will recognize countless protein names and their many synonyms and will link it to background information about the recognized entity. We did this by accessing Wikidata.org, checking information about known proteins and their associated properties, in particular using the fact that protein instances share a Wikidata property being an “instance of” the Wikidata concept “Protein” (Q8054).

Conclusion

We have outlined how to easily define an annotator in Kairntech that can annotate concepts and entities on almost any topic, in many languages, disambiguated, typed, scored and linked and constantly updated to take new concepts into account as the world around us changes. 

We then proceeded to use this approach to define also specific, custom annotation components to return only precisely the concepts (proteins or aircraft manufacturers or planets) required for our use case. 

This capability of not only allowing users to train their own annotators based on examples but also benefit from vast amounts of public knowledge conveniently packaged in the software, makes Kairntech yet more readily useable for a wide range of scenarios. 

We have omitted some detail steps and recommendations here for the sake of brevity. If you are looking for a more step-to-steps tutorial on how to set this up as a Kairntech user, please check the explanations here. If you want to learn more on Kairntech Entity Extraction or Kairntech in general, don’t hesitate to get in touch with us at info@kairntech.com