More on configuring Wikidata annotators

The users of the Kairntech software have access to a large set of concepts derived from Wikidata to annotate their content. As described in How to use Wikidata to detect entities and Examples to configure specific Wikidata annotators users have access to millions of entities, frequently updated, in many languages and covering just about any imaginable topic. With a few clicks users can define an annotator using this broad collection of world knowledge as described in the first text above. But in many cases users may not be interested in using all the concepts accessible via Wikidata; instead what is often needed is an annotator returning only selected types of entities. In the second text linked above we outline a specific query language that allows to finetune the extraction to only the type of entities needed for specific demands, say all diseases, all persons who are known to be actors or actresses or all geographical locations.

But the story doesn’t end here. In this text here we describe an approach to further extend the range of entity types that can be modelled with Wikidata-based annotators in Kairntech.

Wikidata is not only a large collection of concepts but in addition to that, concepts are organized in a hierarchical structure via “subclass of” and “instance of” relations between concepts and subconcepts. This structure can be used to define your annotator. Say you wanted to define an annotator finding examples of the concept of “jet airliners” such as the Boeing 747. Searching in wikidata.org we see that the Boeing 747 is a “subclass of” the concept jet airliners.

So with a simple statement as explained in Examples to configure specific Wikidata annotators we can specify an annotator that extracts not the Boeing 747 but hundreds of jet airliners known in Wikidata.

{
  "statements": {
    "$elemMatch": {
      "propertyName": "subclass of",
      "value": {
        "$in": [
          "Q4120025"
        ]
      }
    }
  }
}

Defining an annotator like this (first create a label such as “airplanes” and then a producer of type EntityFishing as explained in How to use Wikidata to detect entities) you can extract occurrences of the Boeing 747 as planned.

But wait a minute, why is the Boeing 747 extracted but not the Airbus A220? Isn’t that also a “jet airliner”? Inspecting the entry of the “Airbus A220” in wikidata we see that this airplane is a “subclass of” “narrow-body twinjet airliner” and only higher up in the Wikidata taxonomy we find the “jet airliner” we have used in our definition above. The example shows that the taxonomy of Wikidata is complex and full of surprises as it can be expected from a community effort.

In addition to that, the query language that we have used above currently does not support transitive relations (A is a “subclass of” some B which in turn is a “subclass of” some class C …). So we need to find another way to extract not only the Boeing 747, but also the Airbus A220 and many other airplanes that are not direct daughters of “jet airliner”.

So what now? How can we define an annotator? We outline here an approach in several steps that allows to do exactly that.

We start by installing a little tool that allows to inspect the Wikidata taxonomy in a comfortable way of the command-line: The tool is called “wdtaxonomy”, is implemented in Node.js and can be installed with a single command “sudo npm install -g wikidata-taxonomy” provided that you have Node.js installed on your machine.
Once you have wdtaxonomy installed, you can query it to return the hierarchy of concepts underneath for instance “jet airliner” by typing “wdtaxonomy Q4120025” where Q4120025 is the Wikidata Id for this concept.)

A rich list of individual airplanes is returned and we see that many are not direct daughters of “jet airliner” but further down in the hierarchy, such as our “Airbus A220”.

So all that is left to do is to collect the Wikidata identifiers returned by wdtaxonomy and use them in a statement for a jet airliner annotator in the Kairntech software. Attention: There are almost 400 Wikidata concepts returned by the command “wdtaxonomy Q4120025” so you may want to harvest them using some appropriate script.

{
  "statements": {
    "$elemMatch": {
      "propertyName": "subclass of",
      "value": {
        "$in": [
          "Q4120025",
          "Q179",
          "Q8791",
          "Q57810792",
          "Q218990",
          "Q499066",
          "Q906937",
          "Q1344448",
          ...

(Note the list in our case will be much much longer.) Using the full expression above will allow our extractor to extract all the jet airliners known to Wikidata.

The approach outlined above allows to define annotators which benefit from the wealth of information contained in the Wikidata dataset integrated in Kairntech. As always this explanation can only be a start. The wdtaxonomy for instance contains many more interesting options (type “wdtaxonomy –help” to get an overview). Future versions of Kairntech may choose to integrate some of the above into the software directly, but in any case, following the approach introduced above allows to model countless concepts in details such as the hundreds of jet airliners, of energy storage systems, of integrated circuits etc.

Examples to configure specific Wikidata annotators

How to use Wikidata to detect entities?