Automatic Summarization

A growing demand in today’s busy times for many users is to be able to quickly skim through large amounts of documents – news items, websites, publications, documentation – and decide quickly, whether a given item is worth spending the time to read or even just to click on it.

Often the creators and providers of content try to make it easy to make that decision by choosing a proper headline and teaser sentence.

But often even that is not enough to allow people to digest with sufficient speed the content they feel they need to keep an eye on.

This is where Automatic Summarization comes in and Kairntech now offers the possibility to let the software generate such a summary for a text or a whole collection of texts.

The consequences of “publish or perish”: Lots of papers to read

Take the task, for instance, to keep up-to-date with new findings in your specific technical or scientific aera of expertise.

Depending on the subject there may be dozens or hundreds of publications coming out new each week that may be of interest to you.

Fortunately, there is the well-established practice to accompany a publication with an abstract that allows to assess what the paper is about and what the main claims are.

Only if the abstract appears relevant, the reader will proceed to access the full article. That is fine as long as the reader only needs to assess a handfull of articles.

But if longer lists of dozens or hundreds of texts as for instance in a search result or in the proceedings of a large conference need to be assessed or if the content needs to fit onto devices with smaller displays like smartphones, this can become cumbersome and time consuming as even an abstract may already consist of 3-4 detailed paragraphs.

Summarization comes in two flavors: Extractive and Abstractive

Kairntech therefore offers the user access to two different flavors of summarization:

  • Extractive Summarization
  • Abstractive Summarization

Extractive Summarization is the simpler and quicker method, yet often already providing the desired result that allows to assess the relevancy of a content item. Here the most informative sentences of a document are identified and concatenated.

The example below shows an extractive summary of a new item on events in African politics. We can clearly recognize the main topic of the text, however, we will oftentimes note that the result of concatenating sentences from different part of a document may not result in a coherent new text.

In the sample extractive summary above (in shaded grey) we see that for instance in the second sentence the expression “the bloc” refers to an entity (the “East African Community”) which has been introduced in the original text, but that specific sentence is missing in the summary, so the reader is left with the question, what “the bloc” refers to.

When doing abstractive summarization, the algorithm generates a new text based on the underlying meaning of the original document.

Again, we provide a sample summary for this approach here where a complete, new piece of text, capturing the main ideas of the original, is offered.

Abstractive summarization is a lot more computationally intensive and may require special care in tuning the parameter settings. As on other topics, the Kairntech software gives users access to a variety of powerful approaches from the public domain to choose from.

The behavior of both algorithms is governed by numerous parameters – Kairntech tries to set them to reasonable default values, but still it is advisable for the user to visit the options and check whether tuning them for her specific needs yields further improvements.

In the example above the user has chosen to run the abstractive summarization with the pretrained and prepacked distilbert model and a minimum length of 15% of the original text length.

One platform – many use cases

Combined with the other key functionalities of the Kairntech software – entity recognition, document categorization, thesaurus-based indexing and others – summarization is another important ingredient for a broad general-purpose content analysis NLP platform.