{"id":4598,"date":"2023-01-02T17:04:18","date_gmt":"2023-01-02T16:04:18","guid":{"rendered":"https:\/\/kairntech.com\/doc\/?p=4598"},"modified":"2025-07-31T14:58:31","modified_gmt":"2025-07-31T12:58:31","slug":"how-to-define-a-train-test-set","status":"publish","type":"post","link":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/","title":{"rendered":"How to define a train\/test set?"},"content":{"rendered":"\n<p>The Kairntech platform allows you to create a train\/test set in <strong>two different ways<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>by automatically calculating an &#8220;on the fly&#8221; distribution<\/li>\n\n\n\n<li>by automatically assigning &#8220;train&#8221; and &#8220;test&#8221; metadata to each document or segment<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">&#8220;On the fly&#8221; automated distribution <\/h2>\n\n\n\n<p>You should apply this method:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Systematically at the <strong>start of a new project<\/strong><\/li>\n\n\n\n<li>As long as you have <strong>fewer than 50 annotations per label<\/strong><\/li>\n\n\n\n<li>As long as the quality of the model obtained is <strong>less than 65%<\/strong><\/li>\n\n\n\n<li>When you want to test a model on only <strong>part of the labels<\/strong> in the dataset.<\/li>\n<\/ul>\n\n\n\n<p>How to do? <\/p>\n\n\n\n<p>When you create a model experiment: <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>click on &#8220;Show advanced parameters&#8221; in the Engine parameters<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"468\" src=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png\" alt=\"\" class=\"wp-image-5338\" srcset=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png 1024w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-300x137.png 300w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-768x351.png 768w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1536x702.png 1536w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You may select only the labels you want to train your model on<\/li>\n\n\n\n<li>You can change the size of the test set if you wish, but the default value of <strong>0.2 is fine<\/strong><\/li>\n\n\n\n<li>You can activate the \u2018<strong>Shuffle<\/strong>\u2019 parameter if you think that the temporal distribution of your annotations could have an impact on a good representation of your training corpus (for example, at the end of an annotation campaign, you have only annotated on a certain label or you have only added counter-examples with the suggester). If in doubt, we recommend <strong>activating the Shuffle parameter<\/strong>.<\/li>\n\n\n\n<li>You must ensure that the &#8220;train-on&#8221; and &#8220;test-on&#8221; parameters are <strong>inactive\/absent<\/strong> (see article below).<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"774\" height=\"649\" src=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-113.png\" alt=\"\" class=\"wp-image-5339\" srcset=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-113.png 774w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-113-300x252.png 300w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-113-768x644.png 768w\" sizes=\"auto, (max-width: 774px) 100vw, 774px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Automated assignment of &#8220;train&#8221; and &#8220;test&#8221; metadata<\/h2>\n\n\n\n<p>To apply, this method, you must first divide your dataset as explained below in order to generate train and test metadata for each document (classification dataset) or segment (token classification or NER dataset).<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-kairntech-documentation wp-block-embed-kairntech-documentation\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"HWwL58ITwN\"><a href=\"https:\/\/kairntech.com\/doc\/how-to-generate-train-and-test-metadata\/\">How to split a dataset (train, test)?<\/a><\/blockquote><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; visibility: hidden;\" title=\"&#8220;How to split a dataset (train, test)?&#8221; &#8212; Kairntech Documentation\" src=\"https:\/\/kairntech.com\/doc\/how-to-generate-train-and-test-metadata\/embed\/#?secret=sAqiikENOz#?secret=HWwL58ITwN\" data-secret=\"HWwL58ITwN\" width=\"600\" height=\"338\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>You should apply this method:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>When you have already obtained good quality (above 65%) and you want to see if small modifications (adding a few examples to the dataset directly or using the Suggester) continue to improve the quality of the model<\/li>\n\n\n\n<li>You have already achieved good quality (over 65%) with one algorithm and you want to test another and compare them under exactly the same conditions<\/li>\n\n\n\n<li>Each label contains a comparable number of annotations (the number of annotations between the one containing the fewest and the one containing the most varies from simple to double)<\/li>\n<\/ul>\n\n\n\n<p>When you create a new model experiment: <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You may select only the labels you want to train your model on (see assumptions above) <\/li>\n\n\n\n<li><strong>The &#8220;train_on&#8221; and &#8220;test_on&#8221; parameters need to be entered in the training options<\/strong><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"582\" height=\"673\" src=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-114.png\" alt=\"\" class=\"wp-image-5340\" srcset=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-114.png 582w, https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-114-259x300.png 259w\" sizes=\"auto, (max-width: 582px) 100vw, 582px\" \/><\/figure>\n\n\n\n<p>Final note:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>When there is a significant difference in the number of annotations for each label (ratio greater than 2) it is recommended to create several projects or, more precisely, as many projects as there are groups of homogeneous labels (where the number of annotations between the label containing the fewest and the one containing the most varies from simple to double).<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The Kairntech platform allows you to create a train\/test set in two different ways: &#8220;On the fly&#8221; automated distribution You should apply this method: How to do? When you create a model experiment: Automated assignment of &#8220;train&#8221; and &#8220;test&#8221; metadata To apply, this method, you must first divide your dataset as explained below in order [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[14],"tags":[],"class_list":["post-4598","post","type-post","status-publish","format-standard","hentry","category-advanced-topics"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to define a train\/test set? - Kairntech Documentation<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to define a train\/test set? - Kairntech Documentation\" \/>\n<meta property=\"og:description\" content=\"The Kairntech platform allows you to create a train\/test set in two different ways: &#8220;On the fly&#8221; automated distribution You should apply this method: How to do? When you create a model experiment: Automated assignment of &#8220;train&#8221; and &#8220;test&#8221; metadata To apply, this method, you must first divide your dataset as explained below in order [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\" \/>\n<meta property=\"og:site_name\" content=\"Kairntech Documentation\" \/>\n<meta property=\"article:published_time\" content=\"2023-01-02T16:04:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-31T12:58:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"878\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"vincent.nibart\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"vincent.nibart\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\"},\"author\":{\"name\":\"vincent.nibart\",\"@id\":\"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de\"},\"headline\":\"How to define a train\/test set?\",\"datePublished\":\"2023-01-02T16:04:18+00:00\",\"dateModified\":\"2025-07-31T12:58:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\"},\"wordCount\":464,\"image\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png\",\"articleSection\":[\"Advanced Topics\"],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\",\"url\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\",\"name\":\"How to define a train\/test set? - Kairntech Documentation\",\"isPartOf\":{\"@id\":\"https:\/\/kairntech.com\/doc\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png\",\"datePublished\":\"2023-01-02T16:04:18+00:00\",\"dateModified\":\"2025-07-31T12:58:31+00:00\",\"author\":{\"@id\":\"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de\"},\"breadcrumb\":{\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage\",\"url\":\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png\",\"contentUrl\":\"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png\",\"width\":1920,\"height\":878},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/kairntech.com\/doc\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to define a train\/test set?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/kairntech.com\/doc\/#website\",\"url\":\"https:\/\/kairntech.com\/doc\/\",\"name\":\"Kairntech Documentation\",\"description\":\"All the information you need to use Kairntech Software, methodology,  user and installation guides.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/kairntech.com\/doc\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de\",\"name\":\"vincent.nibart\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/kairntech.com\/doc\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/8c6c4f0e2ce82e7f30989e62388adbfe6071cdc185ead6e4bff5281aa3255ae2?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/8c6c4f0e2ce82e7f30989e62388adbfe6071cdc185ead6e4bff5281aa3255ae2?s=96&d=mm&r=g\",\"caption\":\"vincent.nibart\"},\"url\":\"https:\/\/kairntech.com\/doc\/author\/vincent-nibart\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to define a train\/test set? - Kairntech Documentation","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/","og_locale":"en_GB","og_type":"article","og_title":"How to define a train\/test set? - Kairntech Documentation","og_description":"The Kairntech platform allows you to create a train\/test set in two different ways: &#8220;On the fly&#8221; automated distribution You should apply this method: How to do? When you create a model experiment: Automated assignment of &#8220;train&#8221; and &#8220;test&#8221; metadata To apply, this method, you must first divide your dataset as explained below in order [&hellip;]","og_url":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/","og_site_name":"Kairntech Documentation","article_published_time":"2023-01-02T16:04:18+00:00","article_modified_time":"2025-07-31T12:58:31+00:00","og_image":[{"width":1920,"height":878,"url":"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png","type":"image\/png"}],"author":"vincent.nibart","twitter_card":"summary_large_image","twitter_misc":{"Written by":"vincent.nibart","Estimated reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#article","isPartOf":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/"},"author":{"name":"vincent.nibart","@id":"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de"},"headline":"How to define a train\/test set?","datePublished":"2023-01-02T16:04:18+00:00","dateModified":"2025-07-31T12:58:31+00:00","mainEntityOfPage":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/"},"wordCount":464,"image":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage"},"thumbnailUrl":"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png","articleSection":["Advanced Topics"],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/","url":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/","name":"How to define a train\/test set? - Kairntech Documentation","isPartOf":{"@id":"https:\/\/kairntech.com\/doc\/#website"},"primaryImageOfPage":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage"},"image":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage"},"thumbnailUrl":"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112-1024x468.png","datePublished":"2023-01-02T16:04:18+00:00","dateModified":"2025-07-31T12:58:31+00:00","author":{"@id":"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de"},"breadcrumb":{"@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#primaryimage","url":"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png","contentUrl":"https:\/\/kairntech.com\/doc\/wp-content\/uploads\/sites\/2\/2023\/01\/image-112.png","width":1920,"height":878},{"@type":"BreadcrumbList","@id":"https:\/\/kairntech.com\/doc\/how-to-define-a-train-test-set\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/kairntech.com\/doc\/"},{"@type":"ListItem","position":2,"name":"How to define a train\/test set?"}]},{"@type":"WebSite","@id":"https:\/\/kairntech.com\/doc\/#website","url":"https:\/\/kairntech.com\/doc\/","name":"Kairntech Documentation","description":"All the information you need to use Kairntech Software, methodology,  user and installation guides.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/kairntech.com\/doc\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/kairntech.com\/doc\/#\/schema\/person\/e2b5ed8a33aa3f4a90dca6f0a0c5f0de","name":"vincent.nibart","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/kairntech.com\/doc\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/8c6c4f0e2ce82e7f30989e62388adbfe6071cdc185ead6e4bff5281aa3255ae2?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/8c6c4f0e2ce82e7f30989e62388adbfe6071cdc185ead6e4bff5281aa3255ae2?s=96&d=mm&r=g","caption":"vincent.nibart"},"url":"https:\/\/kairntech.com\/doc\/author\/vincent-nibart\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/posts\/4598","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/comments?post=4598"}],"version-history":[{"count":5,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/posts\/4598\/revisions"}],"predecessor-version":[{"id":5341,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/posts\/4598\/revisions\/5341"}],"wp:attachment":[{"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/media?parent=4598"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/categories?post=4598"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kairntech.com\/doc\/wp-json\/wp\/v2\/tags?post=4598"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}