Why natural language processing (NLP) platforms save over 80% of your time
Impressive recent progress in Natural Language Processing (NLP) models, for a large majority coming from the open sourced community, do not necessarily go hand in hand with delivering business value for domain experts. ROI of NLP platforms are judged by the requirement to obtain automatically actionable information from documents.
This is what so-called low- or no-code NLP AI platforms set out to do in order to fill the gap between models and industrialized smart applications. This process can be summarized as follows:
This article will estimate for each of the above five phases the Return On Investment (ROI) of platforms such as the one proposed by Kairntech.
Data preparation and import, a prerequisite for NLP AI projects
Manually preparing a set of documents is often fairly straightforward.
Automating the injections of documents requires however out-of-the box converters (PDF Text, PDF image, audio, scientific articles, video-to-text such as DeepTranscript) as well as the capability to develop easily custom XML converters.
In addition, NLP models are often carried out on a segment level, and in particular with long documents such as contracts. Mastering how to get the segments right has a big impact on the overall quality. Configuring document segmentation requires out-of-the-box segmenters as well as the capability to customize these segmenters.
At the core of any NLP AI project: the creation of a training dataset
Creating a high quality training dataset ensures that models of superior quality and performance can be obtained. Getting there however takes time, domain expertise and often progress is made by iterations.
NLP platforms can significantly reduce the time required to create custom training dataset by strategies such as:
- Import already annotated documents ;
- Use off-the-shelf models to kickstart annotation ;
- Use world knowledge (Wikidata) or business vocabularies to pre-annotate ;
- Exploit Active learning which is essentially a suggestion engine that make annotation much faster when presented in a user-interface ;
- A combination of many small elements in the user interface that make all together a large impact on efficiency.
And especially so, when such as platform is collaborative with flexible access management to unite all stakeholders (IT, domain experts, externally sources annotators…).
Create, test and compare different NLP models used to be only the domain of data scientists…
But wouldn’t it be much more efficient to put model creation in the hands of business users instead?
That’s definitely a key differentiator that will strongly impact ROI.
While data scientist will continue to play a role to fine-tune parameters that may in addition be easily accessible from a user interface, a large chunk of the effort (pre-loaded state-of-the algorithms, model quality assessment providing metrics, pinpointing errors and omissions) are very efficient when it comes to prioritizing annotation efforts and keeping domain experts in the driving seat.
It’s not just about NLP models, it’s all about NLP pipelines
Applying a model is not the only thing that counts, as a matter of fact the challenge is to combine a variety of components & models into a single NLP pipeline, and ideally with a an easy to use user interface.
These components can process text and annotations (reconciliation, consolidation…) or even transform the results to a required output format (tabular, reading grid, XML…).
A typical use-case for NLP pipelines is to structure information in such a way it can update automatically business vocabularies or knowledge graphs.
And last but not least, models need to be industrialized
Having a production platform that scales horizontally and vertically; and that integrates via a comprehensive REST API within any existing environment is a huge accelerator.
Moreover, quality control and the improvement of model performance over time, when models are industrialized and applied to new documents, requires a tight integration between development and production environments.
Which leads us to a couple of bold statements on time saved by an automated and embeddable NLP pipeline software.
Based on estimations we come to the following conclusion:
- 50% less time spent on data preparation
- 80% less time spent on dataset creation
- 80% less time spent on model development
- 80% less time spent on pipeline implementation
- 90% less time spent on industrialization
For more information please contact us at info@kairntech.com