FAQs - Kairntech Documentation

It is a supervised categorization that consists of training a model on a dataset to automatically classify documents in predefined categories.

We can assign documents on either single category or multiple categories:

Single category: only one category can be assigned to each document
Multi-category: one or more categories can be assigned to each document

In some cases it may make sense to introduce explicit negative examples by defining a “other” category.

Entity detection is known as Named Entity Recognition (NER).

NER is an NLP task that attempts to identify entities in natural language text and assign them to their proper types such as person names, locations, company names, date, time, measurement expressions but also more domain or scenario-specific types such as diseases, proteins, product names, etc.

NER is a well-studied and often-used step in the analysis of textual content: NER alone often allows to assess essential information about the Who, Where, When and What around a given text.

An entity is potentially made of sub entities. For example an entity of type Person has in general two sub entities: his first name and his last name.

The NLP task to address NER is known as Sequence Labelling.

Kairntech implements NER in different forms from which a user can chose according to their specific needs

Entity Linking is the NLP task of recognizing and disambiguating named entities to a Knowledge Base.

Many specific entity types are well-described in the public domain and there is often no need to go any further: the list of countries of world, of elements in the periodic table, of moons of Jupiter and many other such types is more or less known and stable.

Kairntech gives users access to a NER components that “knows” about more than 90 mio such entities on almost all imaginable topics, in many languages and which is constantly updated by Kairntech. We benefit here from the knowledge in the Wikidata project and turn this data into a running, directly usable component in regular intervals.

Users can focus their attention on specific subparts of this huge dataset by defining specific filters (“show me only the organisms!”).

Entity Linking component using Wikidata as Knowledge Base.

The results of this Entity Linking component are entities that are:

Typed: the entity “knows” whether it is a location or a ship or a geological age
Scored: the entity is associated with a numerical score that reflects how important that entity is in the context of the document
Disambiguated: Where a string (say “cancer”) can have more than one meaning, the component decides which meaning is the appropriate one (the animal or the disease?)
Normalized: Where a given entity is known with different names (say “NIDDM” which is a synonym of “Diabetes Mellitus Type 2”) the component makes sure to map the various synonyms to the preferred names, making the NER results much cleaner.
Linked: In many cases a named entity has important information associated to it that is not directly part of the text where this entity may be mentioned: A city has a geographical location, a person has a birthdate and a nationality, a protein has a sequence of amino acids. By linking the entity to its known background information, the Kairntech component massively enriched the processed content with important knowledge.

To learn more on Entity Linking using Wikidata you can read this article here or here.

Summarization is the NLP task of producing a shorter version of one or several documents that preserves most of the input’s meaning.

There are two different flavors of summarization:

Extractive Summarization
Abstractive Summarization

Extractive Summarization is the simpler and quicker method, yet often already providing the desired result that allows to assess the relevancy of a content item. Here the most informative sentences of a document are identified and concatenated.

When doing abstractive summarization, the algorithm generates a new text based on the underlying meaning of the original document.

To learn more on Text Summarization, you can read this article.

This is a way of organizing knowledge in a structured way. In general, a knowledge base lists “objects” belonging to “classes”, these objects can have “properties” and “links” between them

What is extracting in connection with a Knowledge Base?

Extract a named entity from a text and uniquely identify it (resolve ambiguity) in the Knowledge Base. For example, the city of “Tripoli” exists in Libya and Lebanon. If the article is about Lebanon, “Tripoli” will be the city of Lebanon.

*Typical stages in a Machine Learning Project*

They are used to have a first level of analysis and understanding of a text by the machine in a general way.

It is possible to create word embeddings in fields where the sentence structures and vocabulary are very specific (biomedical field for example).

The use of word embeddings generally allows to significantly increase the quality of a model.

The corpus consists of raw documents without annotation.

The dataset consists of annotated documents or segments.

This is an ordered list of “business” keywords about a given domain: A vocabulary on “finance” may contain for instance “loans”, “rates”, “interests” …

It is a mathematical representation of a language built from neural networks on a very large number of examples, i.e. on a very large number of documents (hundreds of thousands or tens of millions of documents).

Embeddings capture implicit knowledge (for instance that the word “Paris” is somewhat similar to “Madrid” because they both stand of large European capitals) without anyone having to write that explicitly into a lexicon.

For example: If a first annotation is about “Mr. John Smith” and a second one is about “John Smith”, omitting “Mr.” first, these two annotations will be incoherent. The machine will learn less well. It is very important to follow the same guidelines when annotating.

A segment must be annotated completely.

No annotation should be missing or it will be considered as a counter example to the missing annotation!

It is a named entity or a category suggested by the machine that the user can validate, correct or reject in order to enrich and improve the training dataset and thus improve the quality of the final model.

Suggestion UI to validate or reject suggestions

The user can also verify that one or more named entities or categories can be added to the respective segment or document. The aim is to assist the user to save time in creating a training dataset or a machine learning model.

What is a counter example?

The machine can suggest an entity or a category in a wrong way. If the user rejects this suggestion, the machine will consider the validated segment as a counter example to that entity or category.

The machine learns from both examples and counter examples.

It is a workflow with:

A format conversion component
One or several models and annotators (based on lexicons, knowledge bases…)
A consolidation processor
A formatter to generate a customized output (content enrichment, XLS export, database feed…)

Example of how to build an NLP pipeline in Kairntech platform

It is the knowledge base built from Wikipedia.

It consists of dividing a document into “chunks”: sentence, paragraph…

Why is it interesting to segment documents?

for Question-answering scenario (RAG)
To save time by annotating only segments and not all documents.
To build a good model, it is not necessary to annotate all the documents in their entirety (Gold Standard training dataset)

Good to know:

For NER Projects, the training phase could be done on segments or documents
For Categorization Projects, the training is done on documents.

It is a set of documents or segments that have been annotated.

What does “in” or “outside” the dataset mean?

A segment or document that has not been annotated is “outside” the dataset.

Note: A segment whose annotations have been suggested and finally rejected by the user will be “in” the dataset. It will be considered as a counter example.

It is a chain of models whose output of one becomes the input of the other. For example, a first model recognizes people’s names, the second model analyzes the output to extract the first and last name.

Retrieval-Augmented Generation (RAG) is a technique that combines the creative power of Large Language Models (LLMs) with the precision of your own enterprise data. Instead of relying solely on the LLM’s internal knowledge, RAG first retrieves relevant snippets from your documents and then provides them to the model as context to generate accurate and grounded answers. So you can benefit from the power of LLMs, but on your own content, that the LLM has of course never seen during the training (and should not due to confidentiality).

Using RAG you can interact in natural langiage with your data and recieve direct, natural langiage answers on your questions.

Kairntech RAG projects can often be set up in a matter of minutes, importing your data, accepting the defaults and starting to ask questions.

The Kairntech Chat interface allows users to interact with their documents in a conversational way. By leveraging RAG, the chatbot can answer complex questions, summarize long reports, and provide citations for every claim it makes, ensuring that users can always verify the source of the information. By keeping the context of earlier questions adn answers, the Kairntech Chat allows users to refine the questions, ask follow up questions, all on their own content.

Yes, Kairntech supports Single Sign-On (SSO) using standard protocols. This allows for seamless and secure user authentication by integrating with your existing identity providers such as Active Directory. Using Single Sign-On, Kairntech can be deployed in a corporate environment without requiring that each user who is already authenticated in the corporate intranet, needs to log on again when using Kairntech. Instead, authentication is done by using the person’s credentials according to e.g. Active Directory. Topics such as group membership and access to certain Kairntech projects can also be defined using the person’s permissions from Active Directory.

Kairntech is designed for cloud-native environments and can be deployed on Kubernetes (K8s) for maximum scalability and reliability. This allows organizations to orchestrate their NLP workloads, manage containerized deployments, and ensure high availability across distributed environments.

Kairntech provides a dedicated SharePoint connector that allows you to easily index and search through your organization’s document libraries. Once connected, your SharePoint files—including PDFs, Word documents, and emails—can be used as a knowledge source for the RAG Chatbot, making internal knowledge instantly accessible. The Kairntech Sharepoint connector can be setup such that it constantly updates the content in the Kairntech project if the content in the defined Sharepoint folders change.

Hybrid Search is an advanced search strategy that combines traditional keyword-based search (BM25) with modern vector-based semantic search. This ensures that you find exactly what you’re looking for, whether you’re using specific terminology or searching for broader concepts and meanings.

At Kairntech, data privacy and security are paramount. We offer flexible deployment options, including on-premise or private cloud installations, ensuring that your sensitive data never leaves your infrastructure. Furthermore, our platform is designed to work with local or private LLM instances to maintain complete control over your information. No matter whether you plan to work with LLMs from OpenAI, Mistral or Anthropic, either on their respective platforms of on provate cloud installations of these LLM such as in Azure or you intend to use open source LLMS like Llama or DeepSeek, you can beneft form these and many more using Kairntech.