Combining state of the art NLP and AI approaches for document analysis.

Posted by Stefan Geißler on September 27, 2019

Introduction

A lot of key processes in today's business world involve the analysis and the processing of textual information. For these processes it is good news that today a wide variety of serious NLP components are available, many of them in the public domain. We'll sketch a scenario for a sample analysis task and how that can be addressed with a combination of such components. Where the software itself is becoming more and more a commodity, the emphasis shifts towards the data required to train the models as well as the expertise to make reasonable use of the available tools: Employing, optimizing and integrating approaches to address these scenarios is at the core of the Kairntech value proposition. To meet the growing need to build and deploy new AI-based products or services from unstructured content (text, audio...) leveraging Machine Learning and Knowledge Graph technologies.

In this text we give a quick overview to illustrate our approach.

Sample task: Analyzing clinical trials

As a sample scenario for the remainder of this text, we select clinical trials, i.e. texts that document a key step in drug development in the pharmaceutical industry. Clinical trials represent a highly regulated type of document since they must adhere to strict standards regarding which information they must document and in which format. Also they are available to the public which makes them a well-suited example case here. Note however, that the approach we sketch here in principle applies to other types of documents as well, such as business documents (like invoices, tenders, contracts), legal documents (like laws, opinions) or technical/scientific publications.

In what follows we'll use a sample corpus of several thousand clinical trials from https://clinicaltrialsapi.cancer.gov/.


The clinical trials from NCI Clinical Trials are available as JSON data with a rich set of metadata.

We'll make use of the extensive set of Metadata that come with these documents, we'll store them in a graph database (we use ArangoDB here) and we'll further enrich the content using a state of the art entity extraction approach from Entity-Fishing that disambiguates entities and links them with background knowledge from Wikidata. Note that while the above choices regarding Arango as the database engine and Wikidata as knowledge source facilitate deployments due to the public availability, the Kairntech approach evidently extends also to proprietary setups, for instance company specific data sources instead of Wikidata.

Why full-text search is not enough?

Evidently having our sample data available in digital form and indexed in a full-text search index, allows already for a wide variety of search requirements, such as "what studies do we have that mention 'Novartis'?" or "show me documents that mention 'Merck' as well as 'Diabetes'!" That kind of access is well-understood and powerful, and yet, it is clear that a wide range of equally relevant search requirements are beyond the reach of this approach and demand more elaborate methods:

  • "Which pharmaceutical company is most active on 'Malaria'"? In order to address that question, a full text index does not suffice. We need instead the capability to recognize the type of an expression as pharmaceutical company, and then collect and sort them.
  • "Which pharmaceutical company is most active on cancer?" Here, in addition to the requirement above, we need to be able to match all occurrences of "leukemia" and many many other expressions referring to specific types of cancer as such, ie. Include taxonomic knowledge.
  • "What is the most recent study on 'SARS'?" Here our approach must not only be able to recognize and normalize a wide variety of date formats ("Oct 21, 2015", "2015-10-21", "October 21st, 2015", …) but also to detect the precise meaning of a date in the document: A specific date could be the date of the publication of the study, of the start of the patient recruitment, the date when one specific incident happened or when a regulation was enacted that concern the study design, and many more such possibilities.

Having imported sample data into our data store, immediately allows to address some of the search demands above. We'll start by make use of the metadata that comes with the data and then extend that by adding additional metadata via entity extraction.


Making use of the metadata, the document collection can be queried for instance for
the principal investigators who have been involved in the largest number of trials

While queries such as the one above of course can be addressed with structured data in a relational database, there are a number of reasons that are responsible for the recent trend to model knowledge rather in graph databases, instead: graph databases are "schemaless", meaning that it is much easier to store, manipulate and navigate heterogeneous data than in relational databases. In addition exploiting links between different item (e.g. between a study, its principal investigator, the study site etc etc) in the relations case often involves executing costly joins whereas in the graph database case they can be accomplished using graph traversal methods that can be much more efficient. [For a more detailed discussion of this topic see this link]"

However, besides appreciating the benefits of storing data in graph databases instead of in a conventional relational database, we sofar have only built upon information that already comes with our data in clean and explicit form: The trials we use as sample data are formatted as JSON documents with a rich set of fields, listing not only the principal investigator, the study sponsor and launch date, but also detailed information on the study design, the participating sites and their respective addresses, the studied diseases and many other types of information. Whatever information we'd be interested in and which is not represented as explicit metadata in our documents is for the time being beyond the approach above. And moreover: In most cases documents will not be organized so neatly with clean and rich explicit metadata. On many cases only the full-text is available. Below will we therefore extend our approach to using information from unstructured parts of the documents.

Extracting, disambiguating, normalizing and linking information

Our trial corpus contains besides all the explicit metadata fields, two fields that hold full-text, the "brief summary" and the "detailed description". Whatever relevant information is in there, it first needs to be processed with appropriate algorithms to perform the sequence of steps above:

  • The entity must be detected within the unstructured text
  • The string in the text may have different meanings (like "Saturn" – the planet and "Saturn" – the god from Greek mythology). Disambiguation identifies the proper meaning in these cases for all the concepts from Wikidata without the necessity for the manual definition of disambiguation constraints.
  • The string in the text may also represent a variant of a content, for which a commonly agreed upon preferred term exists: Normalization is the process that ensures that these preferred terms are used in the analysis results.
  • Finally: Linking establishes a connection between the concept on the text and background information for this concept: for a person, that may be this person's picture or batch and place of birth, for a chemical substance it might be that substance's molecular weight, sum formula or trade name.

The approach described here uses a machine learning based software (entity-fishing) for that building upon the Wikidata knowledge set.

With a few lines of code we extend our data store such that not only the documents with their original metadata are stored but also the new entities extracted from the full text fields. We now can extend our search scenarios to cases such as "show me the most popular substances in trials for which the principal investigator was XYZ" or where the substances metadata have been created by analyzing the summary or the description.


Analyzing the full-text fields summary and descriptions of a particular researcher,
we get an idea of this persons research focus

The identified entities, n.b. are not just strings with a type, they are as outlined above normalized, disambiguated and linked entities by virtue of the connection with the information stored in the source knowledge repository (Wikidata in this case). The result is that the entities bear a host of background information as exemplified by this instance of "Leptin" for which the extraction process is able to add a lot of additional information:


The recognized entities by virtue of being linked back into the reference knowledge source
carry a lot of additional background knowledge (Wikidata in this example).

The setup above is appropriate where the entities to extract from the text fields are known in the public domain - and in the case more specifically listed in Wikidata. And yet, while Wikidata is an enormous source of information, in industry projects, more often than not it will be necessary to managed information that is not listed in these public sources: Lists of employees with their contact information and profiles, domain-specific vocabularies or specific meanings of common entities cannot always be expected to be part of public sources. To complement the approach outlined so far therefore a tool is required to support and facilitate the generation of training data sets for these scenarios.

The suggestion that we at Kairntech have implemented is the Kairntech Sherpa platform that addresses precisely this requirement that in many industrial contexts that demand specific annotation components, the required training data sets are even in today's alleged "Big Data" era simple not available. Instead projects are often hindered by either insufficient training data or by prohibitively large efforts in creating them – efforts that are often exacerbated by the factor that the environments and formats available for the creation of the resp. data sets require extensive technical background which may not be available at the domain experts that one might want to participate in the data generation

The Kairntech Sherpa platform embeds this corpus creation task in an environment that facilitates collaboration and quick adoption also by non-technical users and moreover significantly reduces annotation efforts by implementing Active Learning: The environment makes informed choices about which sample input to present to the users in the next steps in order to maximize training benefit. In the essence the next sample that the user annotate is the one where the current training algorithm is least certain (leaving aside the many examples where the decision can be already made with high confidence). It has been shown that Active Learning can reduce annotation times by a significant margin (cf. this link) and evidently in commercial projects, each day of annotation effort that can be saved can be an essential criterion.

Conclusion

Processing textual data in serious industrial contexts today can build upon dramatic advances in the algorithmic foundations as well as in publicly available components. We have listed a number of them such as large public knowledge sources (e.g. Wikidata), data storage environments (such as ArangoDB or Cayley), NLP and ML libraries (such as Keras, Delft and Spacy) as well as relevant document corpora (such as https://clinicaltrialsapi.cancer.gov or https://free.law). This situation emphasizes the often uttered observation that with such an abundance of powerful software in the public domain today, the real challenge (besides the ability to use, integrate, optimize and maintain the software) is the procurement of the training data to apply the software stack to a relevant business problems. We have presented the Kairntech Sherpa platform as an environment of text annotation that addresses this essential demand.