EvalLLM challenge on few-shot learning: LLMs versus machine learning

Kaintech accelerates dataset creation with few-shot learning

Kairntech recently took part in the EvalLLM 2024 challenge on few-shot learning. The contest was organised by the French Ministry of Defense of Defense via the Direction Générale de l’Armement (DGA).

The aim of this challenge was to automatically identify Named Entities in French news and blog articles. Naturally the searched-for entities all have relevance for homeland security: Person, Function, Organization, Military unit, Group, Location, Site, Resource, Equipment, Event, Time and Id.

Development had to take place in a few-shot learning context, i.e. with few resources for learning. Only an annotation guide and 5 annotated documents were provided.

For the evaluation contest, each team had to submit, within 3 days, the respective results of a maximum of 3 different systems on a corpus of 24 manually annotated documents.

6 teams took part in the challenge, including both private companies and research organizations.

Few-shot learning approaches: Machine Learning versus LLMs

Kairntech implemented three approaches:

In the LLM approaches, processing is carried out at the paragraph level, with a prompt containing the entire reworked annotation guide, i.e. around 6,000 words. In particular, we removed the internal references and inserted a section containing all the examples.

The Bi-LSTM-CRF approach had the originality of handling nested annotations using two cascading systems:

  • a first to annotate a complex entity,
  • and a second to detect nested entities.

The other participants used smaller models: either transformer-type models, mostly supervised through data augmentation with LLMs, or fine-tuned versions of LLMs.

Results: GPT-4o wins but is not necessarely the best solution…

GPT-4o achieves superior performance (see table below, scores in bold) in most categories. However, the Bi-LSTM-CRF-based system outperforms GPT-4o in F1 and in ‘micro’ recall. In this case the ‘micro’ metrics weigh the categories according to the number of examples, whereas the classic Bi-LSTM-CRF system learns particularly well with more examples.

The Mixtral-8x22B approach is struggling everywhere.

RunF1PrecisionRecall
MacroMicroMacroMicroMacroMicro
GPT-4o57.0054.9864.3562.8651.6848.86
Mixtral-8x22B38.0637.9743.1647.3835.8931.68
Bi-LSTM-CRF45.1455.7551.2660.8440.9851.45
Kairntech results at EvalLLM challenge – May 2024

Kairntech came second in the ranking of the six participants based on the F1 macro of all the approaches.

The French public research institute, CEA-LIST, achieved the first place with an F1 macro of 59.72, compared with 57.00 with the Kairntech GPT-4o-based system.

The winning approach is a combination of several GLiNER models trained on an LLM augmented corpus.

Conclusion

Kairntech’s participation in the EvalLLM 2024 challenge, in second place behind a major French public research institute, demonstrates our teams’ ability to design, build & run effective systems, with or without LLMs, to solve highly complex tasks, even with few examples.

The results show that LLMs are very suitable for few-shot learning. However they are not necessarily the best solution: smaller supervised models trained or fine-tuned on a task may be more effective.

Kairntech presented its participation at the event held in Toulouse on July 8th 2024.

The article about Kairntech’s participation is available here (in french).

We would like to thank the DGA for organising this challenge on the key subject of LLM.