Introduction
Natural language processing (NLP) techniques have made tremendous progress during the last couple of years, thanks notably to algorithms such as Neural Networks which have surpassed traditional approaches using rule-based systems. Examples include Machine Translation (DeepL, Google Translate…) or Speech To Text (Siri…).
Information extraction, Text classification, Automatic summarization, Q&A
Subcategories of NLP like Information extraction, Text classification, Automatic summarization, Question & Answers are also highly impacted.
What this means for information extraction for instance is that it is now possible to capture with a high accuracy the value contained in a piece of text: for instance a paragraph in a contract, a sentiment in a feedback survey or key information from a medical report.
Although this is true from a technical perspective, it takes three distinct steps to actually create value from unstructured text data, which are further explored in this article.
Design augmented applications
Augmented applications are business applications powered by Artificial Intelligence. As an end-user, you won’t necessarily notice but there is a lot of AI behind suggesting new Facebook contacts or simply typing a Google search.
To define an NLP application or Use Case is not evident, because it is not always clearly understood what the technology can achieve.
A first approach that makes sense is to identify strategic and often overlapping objectives, that could include:
- Increase operational information processing efficiency
- Better satisfy customers or partners
- Gain competitive advantage by gathering crucial information faster
- Use information intelligence as a strategic advantage, typically by building your own enterprise knowledge graph
Then one should define the use case concisely. And this really is the key challenge you are facing, technology should only be the enabler to solving a specific business issue. Based on discussions with many stakeholders, here is a list of the most frequent NLP use cases, that most of the time are used simultaneously (hence the requirement of data pipelines):
- Automate data processes, in case you already have all the applications and tools
- Create categories from or within documents (ex. detect a topic of an incoming support email, or split a customer verbatim in different segments corresponding to the topic addressed)
- Detect entities within a text element which can be a location, people, product, a value, a particular paragraph in a contract or a sentiment. Such values can also be derived from the context (ex. lawyer fees may be derived by detecting an amount linked to a particular law article).
- Create context for a detected text element or category, for instance by linking the corresponding law article and relevant jurisdiction cases automatically.
- Enrich data by finding relations between text elements and external knowledge graphs. This can work both ways; as an information provider I may want to enrich Wikidata with short biographies of people in my database not present yet in Wikidata, or most importantly, the other way around: add missing people to my own database.
- Detect missing information in documents, for example to detect missing clauses when analyzing employee contracts in a due diligence process.
- Get notified when relevant new publicly and constantly updated information is available, for instance by detecting new & unknown entities in research papers.
- Obtain answers to questions that tend to come in two flavors:
- Highly complex questions using purpose-made queries in knowledge graphs.
- More basic questions exploiting deep learning technology directly without needing any database by using Q&A dataset such as SQUAD .
- Establish relations between detected entities and information contained in knowledge graphs. This is often referred to as the holy grail of big data, because these relations tend to be hidden.
After selecting the use-case(s) the final exercise consists of attaching value to the obtained information. This can be many millions euros for discovering a unique start-up or a technology early, a 30% decrease in costs for a yearly financial audit or a yearly fee for an application that analyses yearly 50000 verbatims of a consumer brand. It is also important to know the frequency of usage, many use-cases are niche-based and only used by few people.
Build data and annotation pipelines
Behind the front-end of NLP-enhanced applications, pipelines of data processes are running that extract and transform data by using various tools and algorithms. The creation of these pipelines is carried out by a pluri-disciplinary team of data-scientists, business analysts and IT experts; and contains a number of steps:
- The creation of a reliable training data set from a sample text corpus. Generally speaking, obtaining value from text data is very time consuming since text elements need to be annotated by qualified human beings, typically business analysts that are often in short supply. It is not uncommon that 70% of an NLP project time is spent on annotation. Hence the challenge to automate and speed-up annotations. However, one should also bear in mind that there is always a point where investing an extra effort in making additional annotations becomes unprofitable.
- The more false positives and false negatives are addressed the more reliable the corresponding NLP model becomes. Different machine or deep learning frameworks exist to create such models. The good news is that the large majority of these frameworks are open source. Yet the less good news is that they are changing and improving all the time.
- The extracted information may then be combined with other data sources such as enterprise or public knowledge graphs, internal or public databases. Every alignment requires a data process on its own; some are basic queries, others are much more complex. The case of knowledge graphs (a well-known example being Wikidata the knowledge graph that underpins Wikipedia) is a case in point. Linking information extracted from a text corpus with a knowledge graph whereby relations between entities prevail, is a particular powerful potential source of value.
- Then the different models & tasks including pre-processing, classification, entity recognition, relation detection, reconciliation and post-processing need to be combined into a data pipeline that will power the augmented application.
The main expected challenges to address would be to reduce setup time and cost of annotation pipelines, provide accurate results through state-of-art algorithms, automate processes while at the same allowing business domain experts to be fully autonomous with a fun-to-use and easy-to-manipulate user interface.
In short, the main expected features would be the following:
- Upload a so-called “corpus” of text documents in which you can easily search and navigate with different levels of granularity (a paragraph in a contract for instance) and different filter options.
- Create labels manually.
- Use labels to create annotations which is as easy as highlighting text using a Stabilo pencil. This can be done manually of course, but thanks to business lexicons and knowledge graphs annotations can also be created automatically.
- To validate or reject an annotation is the next crucial step to improve dataset coverage and quality. A suggestion engine based on active learning saves a tremendous amount of annotation time by suggesting pre-labelled text elements. Feedback from the field demonstrates the capacity to increase efficiency to between 200 and 300 annotations per hour. By automating tedious tasks the overall quality tends to improve by concentrating efforts on the most important segments to annotate, which also implies the need to present annotations smartly and not just by simple sequence of appearance.
- Experiment models using different state-of-art NLP algorithms. Compare these models and continue improving a particular label that has a lower performance until you reach the best possible overall performance. Such an iterative process is very common in NLP projects.
- Test models with a text not yet annotated to check the performance. A model can also be a detection of relations in a knowledge graph, linking a person’s name with the corresponding Wikipedia page for instance.
- Build pipelines using the different processes in order to create an automated, end-to-end extraction process.
Run applications and data pipelines
Once the application and the pipelines are built they are exported as Docker files to an IT production environment. Also regular maintenance is required on the quality of the models.
Here we expect the following features:
- Export the model associated with the training dataset
- Expose APIs for smooth integration with IT tools
- Maintain the model quality over time or improve the performance of the models by enabling learning on larger datasets.
Conclusion
The market for NLP is expected to grow by 32% year-on-year to reach 260 billion in 2026. Easy-to-use and integrated solutions that are accessible to non-technical users are of key importance to achieve wide-spread adoption of NLP technologies. It comes down in the end to the capacity to create augmented applications quickly and in a cost-effective way to address the many valuable, and often niche applications that are waiting out there to hit the market.