Information providers need to add value to the content they are producing. But tagging and classifying content to enrich internal knowledge bases is tedious, time consuming and error prone. Yet these knowledge bases are the cornerstone of value-added services. The patent-pending semantic fingerprints for IPTC classification from Kairntech provides a smart solution for multilingual text classification.
Leverage Wikidata to classify multilingual text
A new piece of content, for instance a journalist submitting an article, creates different challenges when it comes to enrich it with metadata:
- The results (recognition of entities, text classification, summarization…) should be highly accurate
- Different languages should be managed seamlessly
- The processing time should be fast
Labelling texts with named entities requires a certain amount of effort on the part of journalists, so it seems a good idea to create a training dataset and train models to automate this task. To achieve very high quality, these models are specific to each language and the internal taxonomies can also be combined to gain in richness and quality.
For text classification (and in particular the highly complex IPTC taxonomy for news agencies), a really interesting solution is to leverage Wikidata knowledge using semantic fingerprint technology:
- Annotate a document using Wikidata
- Create a semantic fingerprint using the metadata of all Wikidata terms that have been extracted in the article (their QID and any other information deduced using the Wikidata knowledge base)
- Train a model on these semantic fingerprints (and not on the actual content of the article).
A single classification model for multiple languages
Semantic fingerprints provide the unique capability to design a model that is language agnostic.
All you need is a processing server with the following rich features to create a satisfactory end-user experience:
- A rich Rest API to embed AI models contained in complex pipelines within existing business application
- A horizontally and vertically scalable server that can be installed on premise within a Kubernetes environment
- Capacity to maintain the quality of models over time by implementing feedback loops
Deployed with a scalable production server
After an extensive period of development and validation by business users, the semantic fingerprint technology is now used on a daily basis by hundreds of journalists all over the world at a leading provider of information services. The Kairntech engine is running in the background of the internal authoring tool.
See also: Kairntech | Text classification