Table of content

Home » Blog » EvalLLM challenge on few-shot learning: LLMs versus machine learning

EvalLLM challenge on few-shot learning: LLMs versus machine learning

July 31, 2024

Reading time: 3 min

Written by

vincent.nibart

Kaintech accelerates dataset creation with few-shot learning

Kairntech recently took part in the EvalLLM 2024 challenge on few-shot learning. The contest was organised by the French Ministry of Defense via the Direction Générale de l’Armement (DGA).

The aim of this challenge was to automatically identify Named Entities in French news and blog articles. Naturally the searched-for entities all have relevance for homeland security: Person, Function, Organization, Military unit, Group, Location, Site, Resource, Equipment, Event, Time and Id.

Development had to take place in a few-shot learning context, i.e. with few resources for learning. Only an annotation guide and 5 annotated documents were provided.

For the evaluation contest, each team had to submit, within 3 days, the respective results of a maximum of 3 different systems on a corpus of 24 manually annotated documents.

6 teams took part in the challenge, including both private companies and research organizations.

Few-shot learning approaches: Machine Learning versus LLMs

Kairntech implemented three approaches:

LLM GPT4-o,
LLM Mixtral 8x-22B-Instruct,
a more traditional and frugal approach, based on Bi-LSTM-CRF.

In the LLM approaches, processing is carried out at the paragraph level, with a prompt containing the entire reworked annotation guide, i.e. around 6,000 words. In particular, we removed the internal references and inserted a section containing all the examples.

The Bi-LSTM-CRF approach had the originality of handling nested annotations using two cascading systems:

a first to annotate a complex entity,
and a second to detect nested entities.

The other participants used smaller models: either transformer-type models, mostly supervised through data augmentation with LLMs, or fine-tuned versions of LLMs.

Results: GPT-4o wins but is not necessarely the best solution…

GPT-4o achieves superior performance (see table below, scores in bold) in most categories. However, the Bi-LSTM-CRF-based system outperforms GPT-4o in F1 and in ‘micro’ recall. In this case the ‘micro’ metrics weigh the categories according to the number of examples, whereas the classic Bi-LSTM-CRF system learns particularly well with more examples.

The Mixtral-8x22B approach is struggling everywhere.

Run	F1		Precision		Recall
Run	Macro	Micro	Macro	Micro	Macro	Micro
GPT-4o	57.00	54.98	64.35	62.86	51.68	48.86
Mixtral-8x22B	38.06	37.97	43.16	47.38	35.89	31.68
Bi-LSTM-CRF	45.14	55.75	51.26	60.84	40.98	51.45

Kairntech results at EvalLLM challenge – May 2024

Kairntech came second in the ranking of the six participants based on the F1 macro of all the approaches.

The French public research institute, CEA-LIST, achieved the first place with an F1 macro of 59.72, compared with 57.00 with the Kairntech GPT-4o-based system.

The winning approach is a combination of several GLiNER models trained on an LLM augmented corpus.

Conclusion

Kairntech’s participation in the EvalLLM 2024 challenge, in second place behind a major French public research institute, demonstrates our teams’ ability to design, build & run effective systems, with or without LLMs, to solve highly complex tasks, even with few examples.

The results show that LLMs are very suitable for few-shot learning. However they are not necessarily the best solution: smaller supervised models trained or fine-tuned on a task may be more effective.

Kairntech presented its participation at the event held in Toulouse on July 8th 2024.

The article about Kairntech’s participation is available here (in french).

We would like to thank the DGA for organising this challenge on the key subject of LLM.

Table of content

EvalLLM challenge on few-shot learning: LLMs versus machine learning

Kaintech accelerates dataset creation with few-shot learning

Few-shot learning approaches: Machine Learning versus LLMs

Results: GPT-4o wins but is not necessarely the best solution…

Conclusion

Related posts

RAG vs Fine-Tuning: How to Choose the Right Method to Adapt an LLM

Enterprise generative AI: Unlocking the potential of generative AI for modern businesses

Knowledge Management: Definition, Strategies & Best Practices

Contract Analysis AI: The Complete Guide to Automated Contract Review

Top 10 NLP tools in 2026: a complete guide for developers and innovators

Vibe Coding: What It Is, How It Works, and Why It’s Transforming Software Development