Introduction

Information extraction tends to target two situations: 

  1. Extract entities from an existing vocabulary, or 
  2. Create an extraction model from scratch when there is no existing vocabulary.

However, sometimes the situation is a mixture of the two extremes: an incomplete business vocabulary exists but needs to be completed with relevant additional entities of the same type.

Example #1: People name extraction

Extracting the names of people working for a company is a common scenario, and this corresponds to the first extraction case. A company will most probably have some kind of staff list and any new hire will appear on this list within a few days.

Recognizing person names in general however is clearly of the second type: There is no complete name list of all the people in the world, not to speak of name variants or fictional persons and the like.

In this case the following steps are required:

  • Train a model from scratch on annotated corpora where person names have been tagged, hoping that the system then can generalize the learnt observation to yet unseen names,
  • Build an annotation pipeline combining the model with a Knowledge Base annotator to automatically detect, in an incoming text, the known as well as the unknown person names.

How do I proceed to automatically find new names?

Wikidata which includes a huge number of person names (a built-in feature of the Kairntech suite), can quickly set up a dedicated model for person names extraction:

Firstly a training document set (corpus) is labelled with the person names that are known in Wikidata, as shown below:

Then, different algorithms can be evaluated to select the one with the best accuracy. In the example below a model based on FLAIR was selected.

Then, the annotation pipeline is built combining the FLAIR model with a Knowledge Base annotator (Wikidata in our example below) to automatically detect, in an incoming text, the known (red) as well as the unknown (yellow) person names.

The extracted unknown person names (in yellow) are naturally candidates to enrich the Business Vocabulary and consequently contribute to the strategic holy grail of the Enterprise knowledge graph.

Example #2: Cell lines extraction

In life sciences, cell lines are populations of cells with known, constant properties that have a key importance in many experiments. Cell lines and their respective properties are listed in catalogs and many scientific publications refer in detail to the cell lines that were used in order to allow for verification and reproducibility. Herewith a randomly selected sample from a scientific paper:

Cytotoxicity of recombinant Cec-B (rCec-B) was reported on normal human lung cell line (WI-38), and hepatocellular carcinoma cell line (HepG2).

Cell lines are sold and bought as products and sources like the cellosaurus list almost 100000 human cell lines and many additional thousands from typical lab species like mice and rats. Yet, even with such a broad resource at hand, the task of identifying all mentioned cell lines in a body of documents comes down to more than a mere table lookup, since new entities may have been created since the last update or variants of known entities may have been used in publications.

How do I proceed to automatically find new cell lines?

Using the entity extraction based on Wikidata (which includes the cellosaurus thesaurus…), a dedicated model for cell line recognition can be set up quickly: first a training document set is labelled with all known concepts from the cellosaurus including their respective links to the wikidata vocabulary if the latter exists.

A business application can now extract cell lines from raw document content with a high degree of accuracy. Those cell lines that happen to correspond to an entry in Wikidata will be accompanied by a link to the respective background information. Moreover, the respective hits are disambiguated and scored: in case the name of a cell line is ambiguous with a similar name with another meaning, the system returns the match as a cell line only if the context supports it (thanks to the context provided by Wikipedia pages).

Entity recognition is so much more than simply matching a string of text! Yet, not a single line of code had to be written in order to achieve this.

An annotated dataset is thus created instantaneously with highlighted cell lines, a perfect starting point for further training with deep learning models.

The Kairntech suite provides access to a range of state of the art machine learning approaches, all available by point and click options; in the example below, a model training run using the FLAIR Deep Learning library was used. This example makes use of transformer-based embeddings and leverages the results of pre-training on large volumes of pubmed data (“BioBERT”). Although highly sophisticated, it is merely an option available when defining the training run. A tiny experiment with only 428 entities from 157 documents already leaves us with a model with an f-score of 92,2%.

Finally an annotation pipeline is build that allows to:

  • Run the model as well as cellosaurus-based annotator,
  • Check the different outputs (remove duplicates, select longest match…),
  • Link each entities to the Business Vocabulary (the thesaurus) when no mapping is available.

Thanks to this annotation pipeline, cell lines are automatically provided as known (green) or unknown (orange) as can be seen on below document:

Conclusion

Entity types such as “person names” or “cell lines” are just among the countless number of types that the Kairntech off-the-shelf entity annotation engine recognizes.

The combination of business vocabulary and learning models can be used to efficiently extract new entities from documents in order to update or enrich an existing business vocabulary and clearly contributes to the strategically important enterprise knowledge graph.

Naturally, the validation of new entity candidates and their insertion in the business vocabulary is an important step (“Data Curation”) and can be more or less complex depending on the type of entities that are managed.