Table of content

Home » Blog » Text classification and categorization

Text classification and categorization

August 31, 2022

Reading time: 3 min

Written by

vincent.nibart

Classify unstructured document collections

Assigning documents to categories is a key step in bringing structure to larger document collections. Users can narrow their study of a search result to a particularly relevant topic or forward subsets of documents to colleagues with a respective interests. Kairntech has offered text classification since the beginning.

However, document categorization requires a training corpus of already categorized documents from which a model can be trained in a process called supervised categorization.

But what if there is no such training corpus, no such model and the user still needs to perform a segmentation of a larger set of documents into topic categories? Let’s see how Kairntech now supports this use case with a process called clustering.

In what follows we will use the example of a random set of 1000 press articles. We do not have a suitable categorization model and we don’t know what is inside our press corpus. So we start by asking the Kairntech software to suggest topics that are implicitly part of the corpus.

Applying this approach to the documents in our project, after a while results in a list of topics the approach has identified by analysing the documents’ content.

Identifying topics without prior manual annotation of training data

In the example above the system comes back and offers one chunk (cluster) of documents on apparently reports about east-african islamist terrorism, another one on texts on refugees on the Mediterranean Sea, another on conflicts related to drug trafficking in Latin America and so on. We did not define these topics manually, they were identified and suggested by the software.

Using this new structure, the user can now home in on the topic of their interest by selecting the proper cluster

to proceed to a list of only the documents related to the specific topic, here, the ethnic conflicts in South-East Asia.

No need to reinvent the wheel: use pretrained models

Kairntech uses a BertTopic model to map document content to a quantitative representation to compute the clusters. More recently Large Language Models (LLMs) have emerged as an alternative approach.

Once a first run has been completed, the user can access the applied settings as usual in the “Experiments” section of the software and access, inspect and edit the applied settings.

Also the initial assignment of documents to clusters can be taken as a jump-start to create your own document category scheme.

Clusters can be renamed for instance replacing the longer list of terms describing our south-east Asian cluster above by a clearer label like “South-East Asian ethnic conflicts” by clicking on the pen icon.

Also documents or groups of documents can be edited jointly: Here the user is about to perform an action on all the 22 documents that have been placed into the cluster on east-african islamist violence.

Use automatic results as a springboard to manual refinement

The Kairntech clustering functionality helps to inspect the implicit internal structure of a larger document collection. It also helps to align with established topics and interests.

While manually analysing a larger document set many be prohibitively cumbersome, here users can let the clustering approach suggest a first attempt. Then proceed by refining the first result by renaming clusters, adding metadata (like the desired category) to whole subsets of documents and ultimately, once satisfied, generate a final document categorization model.

The description above leaves out a number of details that the chosen approach offers. Don’t hesitate to contact us at support@kairntech.com if you have questions or comments about the new functionality.

For more background information