Frequently Asked Questions.
It is a supervised categorization that consists of training a model on a dataset to automatically classify documents in predefined categories.
We can assign documents on either single category or multiple categories:
- Single category: only one category can be assigned to each document
- Multi-category: one or more categories can be assigned to each document
In some cases it may make sense to introduce explicit negative examples by defining a “other” category.
Entity detection is known as Named Entity Recognition (NER).
NER is an NLP task that attempts to identify entities in natural language text and assign them to their proper types such as person names, locations, company names, date, time, measurement expressions but also more domain or scenario-specific types such as diseases, proteins, product names, etc.
NER is a well-studied and often-used step in the analysis of textual content: NER alone often allows to assess essential information about the Who, Where, When and What around a given text.
An entity is potentially made of sub entities. For example an entity of type Person has in general two sub entities: his first name and his last name.
The NLP task to address NER is known as Sequence Labelling.
Kairntech implements NER in different forms from which a user can chose according to their specific needs
Entity Linking is the NLP task of recognizing and disambiguating named entities to a Knowledge Base.
Many specific entity types are well-described in the public domain and there is often no need to go any further: the list of countries of world, of elements in the periodic table, of moons of Jupiter and many other such types is more or less known and stable.
Kairntech gives users access to a NER components that “knows” about more than 90 mio such entities on almost all imaginable topics, in many languages and which is constantly updated by Kairntech. We benefit here from the knowledge in the Wikidata project and turn this data into a running, directly usable component in regular intervals.
Users can focus their attention on specific subparts of this huge dataset by defining specific filters (“show me only the organisms!”).
The results of this Entity Linking component are entities that are:
- Typed: the entity “knows” whether it is a location or a ship or a geological age
- Scored: the entity is associated with a numerical score that reflects how important that entity is in the context of the document
- Disambiguated: Where a string (say “cancer”) can have more than one meaning, the component decides which meaning is the appropriate one (the animal or the disease?)
- Normalized: Where a given entity is known with different names (say “NIDDM” which is a synonym of “Diabetes Mellitus Type 2”) the component makes sure to map the various synonyms to the preferred names, making the NER results much cleaner.
- Linked: In many cases a named entity has important information associated to it that is not directly part of the text where this entity may be mentioned: A city has a geographical location, a person has a birthdate and a nationality, a protein has a sequence of amino acids. By linking the entity to its known background information, the Kairntech component massively enriched the processed content with important knowledge.
To learn more on Entity Linking using Wikidata you can read this article here or here.
Summarization is the NLP task of producing a shorter version of one or several documents that preserves most of the input’s meaning.
There are two different flavors of summarization:
- Extractive Summarization
- Abstractive Summarization
Extractive Summarization is the simpler and quicker method, yet often already providing the desired result that allows to assess the relevancy of a content item. Here the most informative sentences of a document are identified and concatenated.
When doing abstractive summarization, the algorithm generates a new text based on the underlying meaning of the original document.
To learn more on Text Summarization, you can read this article.
This is a way of organizing knowledge in a structured way. In general, a knowledge base lists “objects” belonging to “classes”, these objects can have “properties” and “links” between them
What is extracting in connection with a Knowledge Base?
Extract a named entity from a text and uniquely identify it (resolve ambiguity) in the Knowledge Base. For example, the city of “Tripoli” exists in Libya and Lebanon. If the article is about Lebanon, “Tripoli” will be the city of Lebanon.
They are used to have a first level of analysis and understanding of a text by the machine in a general way.
It is possible to create word embeddings in fields where the sentence structures and vocabulary are very specific (biomedical field for example).
The use of word embeddings generally allows to significantly increase the quality of a model.
The corpus consists of raw documents without annotation.
The dataset consists of annotated documents or segments.
This is an ordered list of “business” keywords about a given domain: A vocabulary on “finance” may contain for instance “loans”, “rates”, “interests” …
It is a mathematical representation of a language built from neural networks on a very large number of examples, i.e. on a very large number of documents (hundreds of thousands or tens of millions of documents).
Embeddings capture implicit knowledge (for instance that the word “Paris” is somewhat similar to “Madrid” because they both stand of large European capitals) without anyone having to write that explicitly into a lexicon.
For example: If a first annotation is about “Mr. John Smith” and a second one is about “John Smith”, omitting “Mr.” first, these two annotations will be incoherent. The machine will learn less well. It is very important to follow the same guidelines when annotating.
A segment must be annotated completely.
No annotation should be missing or it will be considered as a counter example to the missing annotation!
It is a named entity or a category suggested by the machine that the user can validate, correct or reject in order to enrich and improve the training dataset and thus improve the quality of the final model.
The user can also verify that one or more named entities or categories can be added to the respective segment or document. The aim is to assist the user to save time in creating a training dataset or a machine learning model.
What is a counter example?
The machine can suggest an entity or a category in a wrong way. If the user rejects this suggestion, the machine will consider the validated segment as a counter example to that entity or category.
The machine learns from both examples and counter examples.
It is a workflow with:
- A format conversion component
- One or several models and annotators (based on lexicons, knowledge bases…)
- A consolidation processor
- A formatter to generate a customized output (content enrichment, XLS export, database feed…)
It is the knowledge base built from Wikipedia.
It consists of dividing a document into “units”: sentence, paragraph…
Why is it interesting to segment documents?
- To save time by annotating only segments and not all documents.
- To build a good model, it is not necessary to annotate all the documents in their entirety (Gold Standard training dataset).
Good to know:
- For NER Projects, the training phase could be done on segments or documents
- For Categorization Projects, the training is done on documents.
It is a set of documents or segments that have been annotated.
What does “in” or “outside” the dataset mean?
- A segment or document that has not been annotated is “outside” the dataset.
Note: A segment whose annotations have been suggested and finally rejected by the user will be “in” the dataset. It will be considered as a counter example.
It is a chain of models whose output of one becomes the input of the other. For example, a first model recognizes people’s names, the second model analyzes the output to extract the first and last name.
Yes, we can import PF scan by using an OCR converter.