Out of the box extraction of ~100 mio concepts
...in several languages ready to be used in large-scale document analysis scenarios.
An important access to the information in a document or in a collection of documents are the entities contained in the text: Which person names or places occur, which substances, diseases, materials or organizations are mentioned in the text?
If that is what is needed in your document analysis workflow, the Kairntech Named-Entity Extraction API is the solution. Features of this API are:
- Vast collection of ~100mio entities in six European languages (more available on request)
- Regularly updated to reflect the evolution of world knowledge over time. Update intervals approx. once every three months
- Based on Wikidata, which contains among other things dozens of domain-specific subthesauri (MeSH, Drugbank, Geonames, ...)
- Extracted entities are scored, normalized, typed, disambiguated and linked
- Scored: Each concept comes with a confidence score, so users can filter according to their needs (e.g. “retain only the top 5 concepts from each document”).
- Normalized: Different variants of a concept are mapped onto a canonical, preferred term.
- Typed: Extracted concepts are placed into broader concept subtrees (e.g. user can filter e.g. “retain only the disease names”).
- Disambiguated: For ambiguous concepts the proper meaning in the specific context is determined. E.g. the term “NHL” in a medical context might be the “Non Hodgkins Lymphoma” while in a sports context it may be the “National Hockey League”.
- Linked: Extracted concepts are associated with a URL (into wikidata) and an explanatory abstract introducing the concept. Where concepts are part of one of the subthesauri, the respective identifier is returned, too.
- Rich REST API available, cf. https://sherpa-sandbox.kairntech.com/swagger-ui/, sample (python) code outlining the use of the API can be found here
- Transparent pricing model
The service implemented by the Kairntech Named-Entity Extraction API directly responds to many requirements of document analysis processes: Documents can be enriched with their most relevant concepts.
The concepts can be organized by subtypes and topics. The URL returned by the API can be used as a unique handle allowing the integration of the enriched content with other third party processes. The explanatory abstracts are an important reading help for users unfamiliar with a given concept. The service is regularly updated by Kairntech such that new concepts are constantly taken into account, reflecting changes in the world around us.
The Kairntech Named-Entity Extraction API is based on open source code written by Kairntech’s Chief Machine Learning expert Patrice Lopez (cf. https://github.com/kermitt2/entity-fishing) wrapped into the Kairntech API, prepared for large-scale use and integrated into an update scheme.
The Kairntech Named-Entity Extraction API gives document analysis processes access to the knowledge from almost 100 mio concepts. Sometimes however, even that is not enough: In scenarios where entities need to processed that are for some reason or the other not part of this set – perhaps for company or domain specific entities that are not found in any public vocabulary – the API can be combined with the Kairntech Machine Learning Platform Sherpa, where users can teach the system to take custom entities into account using powerful Deep Learning algorithms in an easy to use Web-interface.
Olivier Deguernel (Sealk.co) “Our processes require the enrichment of document content with a wide range of named-entities. We have analysed the available APIs on the market and have decided to integrate the Kairntech Named-Entity Extraction API into out offering. The clear API, the superior quality and the wealth of information returned by the API made it a valuable completion of our processes.”
For more information on the Kairntech Named-Entity Extraction API, sample code about the recommended use of the API, technical specifications and available pricing models, contact us at email@example.com
Appendix: More sample Annotations (extraction threshold set to a low value to return also more peripheral concepts for illustrative purposes):