Table of content

Home » Blog » Document categorization and Entity Extraction: These NLP veterans are as relevant as ever

Document categorization and Entity Extraction: These NLP veterans are as relevant as ever

October 17, 2022

Reading time: 4 min

Written by

stefangeissler

The Importance of Traditional NLP Use Cases: Document categorization and entity extraction

NLP is booming. The pace of progress is impressive, the recognition of the field and what it has to offer is solid throughout the industry. Significant progress on issues that a few years ago where notoriously difficult to tackle: Question answering, natural language understanding or text generation are examples where state of the art results today are on a totally different level compared to only a few years ago.

With all this excitement one can easily overlook that some long-standing NLP use cases are as relevant as ever and that progress on these areas has continued to increase quality and usability. Document Categorization and Entity Extraction are two examples in this direction.

While Kairntech constantly broadens its reach to NLP new cases, it may be worth noting that also these more traditional ones are well represented and will continue to represent core functionalities.

Document categorization – a solid NLP work horse

Document categorization is the process of assigning one or several category labels to a document. Automatic approaches here have a long tradition in the field and relevant really world uses cases are abundant: Is this document relevant or not according to a certain criterion? If this document of type invoice or offer or documentation? What are the categories that best describe this news item? Is this article of interest for that user, given their previous interests?

documents — Sorting and organising text content – NLP can save a lot of work here. Foto: Roman Deckert on wikimedia commons (CC Attribution-Share Alike 4.0 International)

Document categorization is a well-studied NLP subfield, having been used in production settings for many years. Recent advances have continued to raise the bar in terms of quality but above all, have made document categorization much easier for the user to adapt, tune, evaluate and deploy. At Kairntech users can either import an already categorized collection of documents and start training a model right away or they can create their own corpus. The workload for this latter step is reduced as much as possible by the adoption of Active Learning which reduces significantly the number of training examples that a user needs to provide. Document categorization was among the first NLP approaches that the Kairntech platform offered and has been in production use at Boehringer-Ingelheim for more than three years. A quick introduction to the Kairntech flavor of document categorization can be found here.

Entity Extraction – who, where, when and what

Which people, places, companies are mentioned in a press article? Which products, organisms or chemical substances in a scientific publication? Entity Extraction addresses this question and just as document categorization it has been an area of intense research for decades. And as with the case above, it has seen impressive progress in recent years, as approaches such as deep learning and word embeddings have contributed to have and higher state of the art results, the Kairntech platform offers a readily and easily useable implementation of this essential use case: Users can again either import annotated training data or create them manually and then build a machine learning model from a range of available options.

search — Finding Entities in text: A key NLP task, Foto: Niabot on wikimedia commons (CC Attribution-Share Alike 3.0 Uported)

An introductory overview about entity extraction in Kairntech can be found here and here.

Kairntech: a focus on real-world, production-ready NLP use cases.

As in-person conferences pick up these days again after two years of corona-induced interruptions, we’ve enjoyed attending meetings around NLP in industry contexts again. And what is noteworthy is that besides inspiring presentations on experiments with new use cases made possible by recent advances in NLP, there remains a solid proportion of talks that emphasize the importance of use cases such as document categorization and entity extraction: On search results that need to be filtered to only those items that are relevant for a given question; on documents that benefit from being enriched by identifying key entities inside and link them to background information and many other business-critical processes.

Kairntech, while constantly adding new NLP use cases to its stack, remains committed to serve these solid, production-ready use cases that allow addressing a broad range of real-world scenarios.