Methodology Guide

Foreword

Methodology guides:

First of all, you need to define the type of project you want to do:

In the latter case, treat each project separately and you can then create a pipeline in one project that uses the model from another project.

Although you can create a multilingual project, it is recommended to create one project per language if possible to obtain the best results. In this case, you will need to define the language of your documents.


Text classification

Step 1: Initiate your project

  • Create your project (type=Text classification)
  • Upload your documents
  • Go to the Documents view
  • Inspect your documents to see what they look like, and how they are different

Step 2: Define your labels

  • Go to the Labels view
  • Create your labels
    • One label = one category
    • Ask yourself whether a document should belong to one or more categories
    • You can even create a label to get better quality results even if the label is not useful
  • Write annotation guidelines for each label (recommended)

Step 3: Pre annotate documents (option)

  • Why using automatic pre annotation?
    • Pre-annotation using an off-the-shelf model or NLP pipeline can save you time to create a dataset
  • Pre annotate documents
    • At least some of the labels (categories) of the existing model/NLP pipeline should perfectly match the labels you want to create
    • Pre annotate a small number of documents (say 50) to start with, because you will have to review all annotations to create a high quality dataset
  • Select “Labelled” in the filter “Status“ to access the dataset
  • Please note:
    • The dataset consists of all labelled documents
    • You can delete useless labels (categories) with their annotations in the Labels view
    • If you create new labels, review all the annotated documents to complete any missing annotations (only for multi-category projects)

Step 4: Annotate text

  • Annotate text at the document level
    • Single or multi-categories
    • At least 10 or 15 annotations per label (category), following the annotation guidelines
  • Continue even after the first appearance of the blue “pop up“ announcing that suggestions have been computed

Step 5: Use the suggestion engine

  • Why using the suggestion engine?
    • To speed up the dataset creation
    • To quickly assess the machine’s ability to learn
  • Go to the Suggestions view
    • Accept/correct the suggested categories then validate the document. It will be added to the dataset with its categorie(s).
  • Manage suggestions
    • Sort them according to their confidence level score
    • Filter the list on the label (category) you want to work on
  • Please note:
    • The suggestion engine is updated after a few validations
    • The suggestion engine is based on a machine learning algorithm with a fast training time (but which will not necessarily provide the best results)

Step 6: Review the dataset

  • Why reviewing the dataset?
    • Dataset quality is essential to create the best possible model
  • Go to the Labels view
    • Make sure the annotations are evenly distributed over the labels … as much as possible
  • Go to the Documents view
  • Select “Labelled” in the filter “Status“ to access your dataset
  • The dataset must be as accurate as possible: no false or missing categories and no inconsistencies between categories

Step 7: Split the dataset

  • Why spliting dataset?
    • To make sure we will use the same training and test sets to compare different model experiments
  • Go to the Model experiments view
  • Split dataset by generating train/test metadata on the dataset
  • Note:
    • If you added new annotations to the dataset, the split will be automatically updated when launching a new experiment

Step 8: Create first models

  • Go to the Model Experiments view
  • Launch the 4 predefined experiments
  • Check global quality (f-mesure) of each experiment and identify the best model
  • Note:
    • If you are below 60% qualityenrich & improve your dataset by iteration (see next steps below)
    • Do not create new experiments to test different algorithms if the f-measure is below 60%, it is useless at this stage

Step 9: Iterate steps 4-5-6 above to achieve 60% accuracy

  • In the Model experiments view
    • Identify the labels with low quality in the quality report
  • Enrich dataset on these labels either:
    • with new manually annotated segments (see above: 4 – Annotate text)
    • or using the Suggestions view (see above: 5 – Use the suggestion engine)
  • In the Model experiments view
    • Run the experiment again and see if the accuracy of the model has improved for each label
  • Iterate… until achieving at least a 60% accuracy per label

Step 10: Annotate the dataset automatically

  • Why annotating the dataset?
    • To test the dataset & model quality
    • To detect possible discrepancies
    • It is only useful if model accuracy is above 60%
  • Go to the Documents view
  • Run an automatic annotation of the dataset with the model

Step 11: Identify discrepancies

  • Go to the Documents view
  • Open the filter “Agreement: automatic-other
  • Select “Disagreement
  • Check the origin of the annotation with the letter or the tooltip on the chips
  • If the model is finally right, correct the dataset manually.
  • When you have finished with your corrections, remove the automatic annotations from the model.
  • Re-train the model. You will improve the model’s precision.

Step 12: Create the final model

  • Why a final model?
    • You may want to compare different algorithms in terms of accuracy
    • Probably neither the suggestion model nor the 4 pre-package experiments will create the model that suits you best. In this case, it is necessary to do experiments to find the final model that fits your needs.
  • Go to the Model experiments view
  • Create new experiments to test different algorithms and compare quality (f-mesure) between generated models
  • Note:
    • Your goal is to achieve an accuracy between 80% and 95% (f-mesure)
    • Don’t expect to achieve a 100% accuracy… but you might achieve this in some simple cases
    • Performance might be as important as accuracy, in which case you might not select the best model in terms of quality

Entity detection

Step 1: Initiate your project

  • Create your project (type=Entity detection)
  • Upload your documents.
    • If your documents are short (a few sentences), no need to create segments.
    • Otherwise, use the default segmentation engine to start with.
  • Inspect your documents & segments
    • Go to the Documents view and read several documents to see what they look like, how they differ
    • Go to the Segments view and check if segmentation is good and appropriate. A different and better segmentation could be necessary. You will be able to use another off-the-shelf segmenter or to build a custom segmentation pipeline (upcoming release).
Document view
Segment view

Step 2: Define your labels

  • What is a label?
    • label describes a concept (or an entity type)
    • Creating a label means that you will be able to annotate text with that label (create positive and negative examples of the concept).
  • Go to the Labels view
  • Create labels
    • It may help to create new labels to get better quality results even if these labels are not relevant for the use case
  • Write annotation guidelines for each label (recommended)

Step 3: Pre annotate documents (optional)

  • Why using automatic pre annotation?
    • Pre-annotation using an off-the-shelf model or NLP pipeline can save time in creating a dataset
  • Pre annotate your documents
    • At least some of the labels of the model/NLP pipeline that is used to pre annotate should perfectly match the labels you want to create in the project
    • Pre annotate a small number of documents (say 20-50) to start with, because all annotations need to be reviewed to create and optimize the dataset
  • Please note:
    • The dataset consists of all annotated (labelled) segments
    • You can delete useless labels and related annotations in the Labels view
    • If you create new labels, you will have to annote all annotated segments with these labels

Step 4: Annotate text

  • Go to the Segments view
  • Annotate text
    • At least 10 or 15 annotations per label, following the annotation guidelines
    • Continue even after the first appearance of the blue “pop up” appears that announces suggestions
  • If context is lacking
  • Notes:
    • Make every efforts to avoid false or missing annotations or inconsistent annotations within annotated segments
    • It is better to have few annotations without errors and inconsistencies than many annotations with possible errors and inconsistencies
    • Segments must be annotated consistently with each other
    • If an entity is not annotated when it should be, it will be considered as a counter example and confuse the algorithm and consequently lower its quality
    • The dataset consists of all annotated (labelled) segments

Step 5: Use the suggestion engine

  • Why using the suggestion engine?
    • To speed up dataset creation
    • To quickly assess the machine’s ability to learn
  • Go to the Suggestions view
    • Accept/correct/reject the suggested annotations then validate the segment
    • Each validated segment will be added to the dataset with its annotations
  • Manage suggestions
    • Sort suggestions according to their confidence level score
      • Use “high confidence” score to assess the machine’s ability to learn
      • Use the “margin sampling” or “low confidence” score to handle the segments where the machine has the most difficulty
    • Filter the suggestions on the labels you want to work on
    • If the context of the segment is insufficient to validate a suggestion, increase the context or click on the title to access the document (possibly reconsider the segmentation of documents)
    • If you reject all the suggestions and finally validate the segment, it will be added to the dataset, and thus be considered a counter example. This can be very effective for adding counter examples to a dataset to improve the accuracy of the final model.
  • Note:
    • The suggestion engine is updated after a few validations
    • The suggestion engine is based on a machine learning algorithm with a fast training time (but which will not necessarily provide the best results)

Step 6: Review the Dataset

  • Why reviewing the dataset?
    • Dataset quality is essential to create the best possible model
  • Go to the Labels view
    • Make sure the annotations are evenly distributed over the labels… as much as possible
  • Go to the Segments view
    • Filter the segments on Status=”Labelled
    • You see the dataset which consists of all annotated (labelled) segments
  • In the Segments view
    • Select the label you want to check in the filter “Label name
    • You will be able to detect possible false annotations on this label
  • In the Segments view
    • Apply the “exclusive” mode on the label
    • You will then be able to detect possible missing annotations on this label
  • Note:
    • The dataset must be as accurate as possible: without false or missing annotations, and without inconsistencies!

Step 7: Split dataset

  • Why splitting a dataset?
    • To make sure to use the same training and test sets to compare different model experiments
  • Go to the Model experiments view
    • Split dataset by generating train/test metadata on the dataset
  • Note
    • If you add new annotations to the dataset, the split will be automatically updated when launching a new experiment

Step 8: Create first models

  • Go to the Model experiments view
    • Launch the two predefined model experiments
    • Check global quality (f-mesure) of each experiment and identify the best model
  • Notes:
    • If you are below 60% qualityenrich & improve the dataset by iteration (see next steps below)
    • Do not create new experiments to test different algorithms if the f-measure is below 60%, it is useless at this stage

Step 9: Iterate steps 4-5-6 above to achieve 60% accuracy

  • In the Model experiments view
    • Identify the labels having a low quality in the quality report
  • Enrich dataset on these labels either:
    • with new manually annotated segments (see above 4 – Annotate text)
    • or using the Suggestions view (see above 5 – Use the suggestion engine)
  • In the Model experiments view
    • Run the experiment again and see if the accuracy of the model has improved for each label
  • Iterate… until achieving a 60% accuracy per label

Step 10: Annotate the dataset automatically

  • Why annotating the dataset?
    • To test the model & dataset quality
    • To detect possible discrepancies
  • Go to the Documents view
  • Run an automatic annotation of the dataset with the model
  • Note
    • This is only useful if model accuracy is above 60%

Step 11: Identify discrepancies

  • Go to the Segments view
  • Select “Disagreement” in the filter “Agreement: automatic-other
  • Look at each segment to understand the reason why there are discrepancies:
    • If you have made a mistake, delete the annotation
    • It there is a pattern with no or very few examples in the dataset use the similarity search on the segment and enrich the dataset.
    • The quality of the text is bad (especially with PDF files converted into text). In this case, the solution could be to improve the quality of the converter.
    • There is no real explanation
  • When you have finished with your corrections, remove the automatic annotations from the model.

Step 12: Create the final model

  • Why a final model?
    • You may want to compare different algorithms in terms of accuracy
    • Probably neither the suggestion model nor the 2 pre-packaged experiments will create the model that suits you best. In this case, it is necessary to do experiments to find the final model that fits your needs.
  • Go to the Model experiments view
    • Create new experiments to test different algorithms
    • Compare quality (f-mesure) between generated models
  • Note:
    • The goal is to achieve an accuracy between 80% and 95% (f-mesure)
    • Don’t expect to achieve 100% accuracy… but you might achieve this in some simple cases
    • Performance could be as important as accuracy, in which case you might not select the best model in terms of quality