Bringing structure to unstructured document collections

Assigning documents to categories is a key step in bringing structure to larger document collections: Users can narrow their study of a search result to a particularly relevant topic, forward subsets of documents to colleagues with a respective interests etc. Kairntech has offered document categorization since the beginning. However, document categorization requires a training corpus of already categorized documents from which a model can be trained in a process called supervised categorization.

But what if there is no such training corpus, no such model and the user still needs to perform a segmentation of a larger set of documents into topic categories? Let’s see how Kairntech now supports this use case with a process called clustering.

In what follows we will use the example of a random set of 1000 press articles. We do not have a suitable categorization model and we don’t know what is inside our press corpus. So we start by asking the Kairntech software to suggest topics that are implicitly part of the corpus.

Applying this approach to the documents in our project, after a while results in a list of topics the approach has identified by analysing the documents’ content.

Identifying topics without prior manual annotation of training data

In the example above the system comes back and offers one chunk (cluster) of documents on apparently reports about east-african islamist terrorism, another one on texts on refugees on the Mediterranean Sea, another on conflicts related to drug trafficking in Latin America and so on. We did not define these topics manually, they were identified and suggested by the software.

Using this new structure, the user can now home in on the topic of their interest by selecting the proper cluster

to proceed to a list of only the documents related to the specific topic, here, the ethnic conflicts in South-East Asia.

No need to reinvent the wheel: Embedding large pretrained transformer models

Kairntech is using the BertTopic model as a default here to map document content to a quantitative representation on which the clusters are computed. Once a first run has been completed, the user can access the applied settings as usual in the “Experiments” section of the software and access, inspect and edit the applied settings.

Also the initial assignment of documents to clusters can be taken as a jump-start to create your own document category scheme.

  • Clusters can be renamed for instance replacing the longer list of terms describing our south-east Asian cluster above by a clearer label like “South-East Asian ethnic conflicts” by clicking on the pen icon.
  • Also documents or groups of documents can be edited jointly: Here the user is about to perform an action on all the 22 documents that have been placed into the cluster on east-african islamist violence.

Use automatic results as springboard to eventual manual refinement

The new Kairntech clustering functionality can be very helpful in use cases where the implicit internal structure of a larger document collection needs to be inspected and where potentially a new structure needs to be applied, corresponding to established topics and interests in your project or your team. While manually analysing a larger document set many be prohibitively cumbersome, here users can let the clustering approach suggest a first attempt, proceed by refining the first result by renaming clusters, adding metadata (like the desired category) to whole subsets of documents and ultimately, once satisfied, generate a final document categorization model.

The description above leaves out a number of details that the chosen approach offers. Don’t hesitate to contact us at if you have questions or comments about the new functionality.