Table of content

Home » Blog » Semantic fingerprints for IPTC classification

Semantic fingerprints for IPTC classification

June 13, 2024

Reading time: 2 min

Written by

geertmeulenbelt

Information providers need to add value to the content they are producing. But tagging and classifying content to enrich internal knowledge bases is tedious, time consuming and error prone. Yet these knowledge bases are the cornerstone of value-added services. The patent-pending semantic fingerprints for IPTC classification from Kairntech provides a smart solution for multilingual text classification.

Leverage Wikidata to classify multilingual text

A new piece of content, for instance a journalist submitting an article, creates different challenges when it comes to enrich it with metadata:

The results (recognition of entities, text classification, summarization…) should be highly accurate
Different languages should be managed seamlessly
The processing time should be fast

Labelling texts with named entities requires a certain amount of effort on the part of journalists, so it seems a good idea to create a training dataset and train models to automate this task. To achieve very high quality, these models are specific to each language and the internal taxonomies can also be combined to gain in richness and quality.

For text classification (and in particular the highly complex IPTC taxonomy for news agencies), a really interesting solution is to leverage Wikidata knowledge using semantic fingerprint technology:

Annotate a document using Wikidata
Create a semantic fingerprint using the metadata of all Wikidata terms that have been extracted in the article (their QID and any other information deduced using the Wikidata knowledge base)
Train a model on these semantic fingerprints (and not on the actual content of the article).

A single classification model for multiple languages

Semantic fingerprints provide the unique capability to design a model that is language agnostic.

All you need is a processing server with the following rich features to create a satisfactory end-user experience:

A rich Rest API to embed AI models contained in complex pipelines within existing business application
A horizontally and vertically scalable server that can be installed on premise within a Kubernetes environment
Capacity to maintain the quality of models over time by implementing feedback loops

Deployed with a scalable production server

After an extensive period of development and validation by business users, the semantic fingerprint technology is now used on a daily basis by hundreds of journalists all over the world at a leading provider of information services. The Kairntech engine is running in the background of the internal authoring tool.