Machine Learning approaches in NLP have been shown to be able to solve a wide range of tasks after being trained from scratch on an appropriate training corpus. While this is impressive, it often does not correspond to the demand in many real-world scenarios. Often relevant prior knowledge exists – in the case of information extraction for instance in the form of vocabularies, listing many instances of the required entity types. Being able to make use of such resources easily can jumpstart and speed up your Machine Learning project.

Example with football clubs entities detection

Let’s imagine we want to annotate a corpus of sports documents with the names of the football clubs mentioned in them. Inspecting the corpus, we see that they are about events from different countries and different leagues. Yet we currently only have the list of the german Bundesliga clubs. Let’s see how to quickly proceed with just what we have.

Getting started…

Let’s create a lexicon in the project and import a vocabulary from a range of formats (txt, csv, skos, obo, …)…

then search and check imported terms…,

turn vocabulary into a recognizer

and finally apply it on the documents in the corpus.

Within moments an annotated training corpus exists which at least knows about german Bundesliga clubs. Other clubs from the 2nd Bundesliga in german or the English Premier League or any other football league are unknown at this stage and won’t be annotated yet.

Ok, but just importing a list of strings and then finding these in text documents is not really rocket science.

Leveraging “In & Out Dataset” documents

But there is more, much more: we need to distinguish between corpus documents within the “dataset” (those that do have annotations) and documents outside.

This comes in handy here, since in our scenario where we only know about Bundesliga clubs at the moment, all eventual clubs from other leagues in other countries which are not yet labelled would constitute a “negative example” for the learning process and hence confuse the algorithm.

But since training by default is only performed on the “in dataset” documents, this risk is much smaller here. Note, that still there will be numerous “false negatives” wherever a document not only contains a (annotated) Bundesliga club but also a (yet missed) club from elsewhere.

So, after the first automatic annotation, a certain amount of manual curation on the “in dataset” documents is recommended. Then, by applying a training process (learning a new model) on the now annotated documents we can launch a training run.

Leveraging semantic similarity and word embeddings

In our scenario we can observe that the contexts in which a football club is mentioned are often very similar and that this similarity does not depend on the club being from our original Bundesliga list: In phrases like: “supporters of X were devastated when their team was defeated”, or “In the semi finals however, X was able to beat Y 3:2”, or “The result left X needing six points from their remaining five games to claim a first title since 2010” the X most of the time refers to a club no matter from which precise league.

Therefore, a model trained on the subpart of the corpus with the Bundesliga clubs will to some extent carry over its findings also to other relevant entities. Neural approaches excel at this task with their capacity to retain also long distance information as well as vague semantic similarities captured in appropriate embeddings. A positive example mentioning “X’s new midfielder” will influence also cases where “X’s new striker” is mentioned since “midfielder” and “striker” will bear a certain semantic similarity that is encoded in their respective word embeddings vectors.

Overall Methodology: import – apply – curate – train – repeat

Bootstrapping a model for a new entity type (like Football clubs) therefore amounts to executing the following steps: 

  1. Import a vocabulary that at least partially represents your required entity type
  2. Annotate documents in your corpus with it 
  3. Inspect results (adding missed entities, removing false positives)
  4. Train a neural model on the corpus and apply it on the corpus again 
  5. Repeat steps 3 and 4

Result: A model with great performance

In the above-mentioned case importing the short list of initial football club names and then iterating though the steps above a few times, resulted in a training corpus with some 1300 examples for football clubs in the documents and in a respective neural model with 94% accuracy after roughly one hour of work.

Machine Learning approach: a good fit

Note that although the scenario described here is intentionally simple, it already contains a few characteristics that entity recognition attempts need to address in order to deliver appropriate quality:

  • The context in which football clubs are mentioned in text are already diverse enough to make it hard to capture them with hard coded manually designed rules. Even sport commentators can be surprisingly creative. Using quantitative, neural approaches instead of painfully crafted, brittle, manual rules accounts for this observation. 
  • A strict string matching approach will quickly miss many relevant occurrences. Whereas the full name of a club may be “FC Bayern München”, phrases like “Bayern was lucky last weekend” or “…hadn’t scored a single goal against München in the last 18 months …“ also need to be taken into account. 
  • On the other hand some disambiguation capability is required to distinguish cases such as “the new Newcastle goalie” from “who was born in Newcastle in 1991” where in the second case the city, not the club is mentioned.


While we have talked about the toy example of extracting football clubs above, it is clear that the discussed approached address a wide range of real-world challenges in content analysis efforts: Often a partial vocabulary, containing some but not all the required entities (or their variations) exists or a somewhat complete vocabulary exists but needs to be updated from time to time as new entities become relevant.