A built-in entity extraction component offers a powerful way to analyze and annotate text content using existing world knowledge. This component gives you access to tens of millions of concepts in several languages and on almost all imaginable topics..
In this document we will first describe the component as such, the underlying data repository and how to add it to your project and then show you how you can customize the component to reflect your specific needs.
Entity extraction using Wikidata
The initial objective is to train machine learning models to recognize and extract information from text. It is often sufficient to “show” the system a few dozen examples (i.e. annotate text with the mouse and let the system learn from it) in order to be able to reliably process content.
However, in many cases it is not really necessary to start learning a new concept from scratch, because the respective knowledge already exists in the public domain.
If for instance your objective is to recognize locations, animals or proteins in a text corpus, a prepackaged model exists to extract information from the Wikidata knowledge base. See also What is Knowledge Base?
To make this vast knowledge resource operational the so-called entity fishing” model is used:
- Entities from a wide range of topics are recognized
- In many different languages: English, German, French, Arabic, Dutch, Italian and Spanish are available by default, others languages on request
- Results are disambiguated: for instance, “tornado” may be a kind of storm or a military aircraft. The software picks the correct meaning depending on the context without manual rule writing.
- Results are linked: for many concepts additional valuable information is known. A location as a geographical latitude and longitude, a company may have a domain, a protein an amino acid sequence.
- Results are constantly updated: The world around us is constantly evolving: New products are developed, new people elected into office, new companies founded. The underlying knowledge sources are regularly updated so that your analyses keeps up with a fast-changing world.
Adding Wikidata to your project
Perform the following steps:
- Create an Entity detection project (or open an existing project). It is not mandatory to import documents if you only want to use Wikidata.
- Create a new label which will refer to all these concepts, say “DefaultLabel”
- Go to the Settings menu in the lower left corner
- … and create a new Suggestion Producer of the type EntityFishing
- Specify a name for the new component (say “MyDefaultEntityExtractor”) and then select the “DefaultLabel” in the “default_label” line. Do not forget to save your new component at the bottom of the page.
- You can test the entity extraction by going to the annotation tests view. Copy & Paste a piece of text into the text field, then select the newly created “MyDefaultEntityExtractor” and press Annotate.
- Text is now annotated with known concepts, also referred to as world knowledge. Please note that occurrences such as “Uranus” are correctly linked to the meaning of the entity.
So far only default values were selected: all the concepts that the component recognizes are mapped to the same label “DefaultConcept”. These settings include for instance the minimal score a concept must get in order to be extracted.
Often this type of extraction will be sufficient, but there will be cases where you may want to get more specific results: All the places, all the proteins or all the composers mentioned in a document.
Extraction with higher abstraction level Wikidata types.
As an example: only proteins need to be analyzed.
- Check in Wikidata (wikidata.org) what information we can use to filter on. For instance, if we search for the protein “tumor necrosis factor” we find that this protein is known in Wikidata to be an instance of the type protein.
- The protein is correctly declared as an instance of the concept protein. Moving the mouse over the word protein in the browser shows us that the identifier of the protein is Q8054.
To make use of this information:
- Go the Labels view
- Define a new label, for instance Proteins.
- Go to settings at the bottom left and edit the existing annotator (or create a new one with the type Entity Fishing and give it an appropriate name, say Proteins).
- Select the correct label in the section Mapped labels
- In the text field define a protein by clicking on the pen icon.
- Go to the test page, enter a test sentence containing a protein, select the annotator and press annotate.
We now have defined a new annotator that will recognize countless protein names and their many synonyms and will link it to background information about the recognized entity.
This was achieved by accessing Wikidata.org, checking information about known proteins and their associated information, in particular using the fact that protein instances share a Wikidata property being an “instance of” the Wikidata concept “Protein” (Q8054). The process required checking information in Wikidata and using that information in a specific query language in a JSON expression. Information on the query syntax options can be found here.
While the taxonomy of higher-level concepts in Wikidata is immensely rich, it is not always totally consistent.
For instance you want to extract all company names and start by checking information on Boeing: you find that Boeing is an “instance of” (among others) the concepts “business” (Q4830453) and “enterprise” (Q6881511). But then verifying the information on “Embraer”, another aircraft manufacturer, you find that for some reason, it is not listed as “business” or “enterprise” but only as “public company” and “aerospace manufacturer”. This makes defining a comprehensive annotator for all companies or even just all aircraft manufacturers a little less straightforward than one would hope.
One has to live with the fact that Wikidata is a community effort and spend an extra moment to collect the appropriate information from Wikidata when defining a new concept.
Please note that in the JSON expression above for the “proteins” we had a list of identifiers (here containing only the concept Q8054). But this list could potentially contain more identifiers.
Be prepared to have to spend this extra moment to collect the required concept identifiers from Wikidata.