How to define a train/test set? - Kairntech Documentation

The Kairntech platform allows you to create a train/test set in two different ways:

by automatically calculating an “on the fly” distribution
by automatically assigning “train” and “test” metadata to each document or segment

“On the fly” automated distribution

You should apply this method:

Systematically at the start of a new project
As long as you have fewer than 50 annotations per label
As long as the quality of the model obtained is less than 65%
When you want to test a model on only part of the labels in the dataset.

How to do?

When you create a model experiment:

click on “Show advanced parameters” in the Engine parameters

You may select only the labels you want to train your model on
You can change the size of the test set if you wish, but the default value of 0.2 is fine
You can activate the ‘Shuffle’ parameter if you think that the temporal distribution of your annotations could have an impact on a good representation of your training corpus (for example, at the end of an annotation campaign, you have only annotated on a certain label or you have only added counter-examples with the suggester). If in doubt, we recommend activating the Shuffle parameter.
You must ensure that the “train-on” and “test-on” parameters are inactive/absent (see article below).

Automated assignment of “train” and “test” metadata

To apply, this method, you must first divide your dataset as explained below in order to generate train and test metadata for each document (classification dataset) or segment (token classification or NER dataset).

How to split a dataset (train, test)?

You should apply this method:

When you have already obtained good quality (above 65%) and you want to see if small modifications (adding a few examples to the dataset directly or using the Suggester) continue to improve the quality of the model
You have already achieved good quality (over 65%) with one algorithm and you want to test another and compare them under exactly the same conditions
Each label contains a comparable number of annotations (the number of annotations between the one containing the fewest and the one containing the most varies from simple to double)

When you create a new model experiment:

You may select only the labels you want to train your model on (see assumptions above)
The “train_on” and “test_on” parameters need to be entered in the training options

Final note:

When there is a significant difference in the number of annotations for each label (ratio greater than 2) it is recommended to create several projects or, more precisely, as many projects as there are groups of homogeneous labels (where the number of annotations between the label containing the fewest and the one containing the most varies from simple to double).