Methodology and Best Practices for Document Analysis Project
Many tasks around handling information require finding, extracting and analysing specific information in documents: A scientist may need to study the diseases and genes in scientific publicatons, a lawyer may need to assess fees and dates in a batch of contracts, a marketing expert may need to inspect client feedback regarding products and services.
The Kairntech software supports these and many related tasks – and all this in a no-coding easy-to-use environment.
Often the required information can be processed with off-the-shelf analysis components: Kairntech offers tools for this: Annotating documents with person names, places, dates or with many millions of concepts from almost any topic, in many languages, always up to date.
Then there are cases where you need to apply your own vocabularies: The list of the products, competitors, substances or places that are specific for your project. Kairntech allows you to do this as well.
Finally in many cases no off-the-shelf component and no existing vocabulary exists but the analysis has to be built specifically for your project. Kairntech gives you access to powerful machine learning methods which will automate the task for you without you having to learn how to program.
In this document we will outline all these scenarios and guide you through how to use Kairntech to get the job done. You still have questions? Don’t hesitate to reach out to us at firstname.lastname@example.org – we will support you with more hints or with scheduling a quick online session to see you on your way.
What are the typical stages in a project?
How to create a project?
A “Project“ is the place that contains the documents, annotations, models etc for a given task. You can among other things define new projects, give your colleagues access to your projects, perform experiments, call the models of project via the API to annotate new content and download/archive a project.
Create a new project by clicking on the + button at the bottom right of the Home page:
- Choose a name for your project. Recommendation: Don’t call your project “Test1” or something similarly mysterious – you will have a hard time remembering what you project was all about two weeks later!
- Choose the project language (= document language)
- Define the NLP task type: Ask yourself the question: Which NLP task can solve my problem?
- Categorization (See Appendix 2)
- Named Entity Recognition (NER)? (See Appendix 3)
- Extraction of terms / business vocabulary ? Select Named Entity Recognition
- Annotate with world knowledge (Wikidata or others)? (See Appendix 4) Select Named Entity Recognition
- Summarization? Select Named Entity Recognition
- A succession of the above actions? (See Appendix 5)
- Extraction of relationships? Coming soon…
How to upload documents?
- Retrieve a certain number of documents related to your problem: this is your “Corpus”
- Size matters: 50 documents minimum. There is no clear upper limit, but it may not make much sense to go beyond 10000 documents for a given project. Contact us if you need to work on larger corpora.
- The corpus can be monolingual or possibly multilingual
- Check the format of the documents
- Word, PDF, HTML, Txt documents can be imported directly
- If you have metadata associated with your documents documents (author, creation date, source, keywords, …), start by translating your documents into the Kairntech JSON format (accessible here for NER Project and here for Categorization Project).
- If you have pre-annotated documents (document already containing the entities and/or categories that you plan to work on), also use Kairntech JSON format
- If you have XML format, contact us because we can develop a specific document converter
- If you have PDF Image, convert to txt format (Tesseract, Abbyy…) before uploading the content in the platform or do contact us.
- If it is Named Entity Extraction (NER) task… What segmentation do you want to apply for the documents?
Important note: if you want to import 100+ documents, you have to use zip format with 100 documents max per zip file.
How to inspect documents?
- Take the time to inspect them: Go to the Documents view
- Read several documents to see what they look like, how they are different.
How to define labels (entity types or categories)?
Kairntech allows you to annotate your documents with entities or categories. An entity can be thought of as a piece of text of a specific type: person names or locations in press articles, or substances or organisms in scientific texts. By contrast, document categories are properties of the whole document such as whether a press article belongs to sport, culture or economy or whether a client feedback is rather positive or negative.
You can define your own entity types or categories in the Labels menu, following these principles:
- Categorization Project
- One label = one category
- Ask yourself whether a document should belong to one or more categories.
- NER Project
- One label = one concept (or one entity type)
- Define your labels = You need to be able to annotate the text with these labels (create positive and negative examples for them)
- The square box next to the label name shows the color associated with the label. This color will be reused in other parts of the project. Click on the box to select a new color. It makes sense to select different colors for different labels to be able do distinguish them quicker.
- The sub-entities (e.g. the First Name and the Last Name for a Person entity) will be addressed in a further step
- Create possibly other labels to complete the first labels to have a better quality result at the end
How to annotate text manually?
Unless you have imported documents with annotations in json format, your documents will at the beginning not contain any annotations (the information you want to model to learn): So these annotations will have to be added now by hand. You will spend a fair amount of the effort in your project here and the software supports you doing this as quickly and effortless as possible.
- Categorization Project
- Since a document category is a property of the complete document, these annotations happen at document level, not segment level
- Single or multi-categories
- NER Project
- You can chose to label text in segments or in documents
- Select the tag, highlight the text with the mouse and the annotation is created when releasing the mouse. You can configure the way we extend the text seelction.
- A segment must be consistently annotated! Don’t proceed to the next segment or document until all your entities are added. If an entity is not annotated when it should be, it will be considered as a counter example and confuse the algorithm / lower the ultimate quality!
- The Segments view may lack context to be able to annotate: Imagine n different types of amounts of money: A total price, the VAT, the cancellation fees, … If the segment is too narrow to decide which kind of money amount you are looking at, go to the Documents view
- Warning: in the Documents view, it is impossible to make an inter-segment annotation!
How to speed up dataset creation?
- Start annotating manually. At least 5 or 10 annotations per label. The system will observe what is happening and after a while start to train a first model using your annotation so far in order to calculate “suggestions”. You may want to continue even after the first appearance of the blue “pop up“ anounces that suggestions have been computed.
- After a decent amount of annotations (at least 10 for the simplest scenario), go to the Suggestions view (See Appendix 6)
- Accept/reject/correct the suggested annotations (green check, red cross…) then validate the segment (or document). It will be added to the dataset (see Appendix 7) with its annotation and be used in subsequent training runs.
- Sort suggestions according to their confidence level score.
- The suggestion engine is updated after few validations.
- If the context of the segment is insufficient to validate a suggestion, increase the context or click on the title to access the document.
- You can filter (on the left) the list of suggestions on the tags you want to work on in particular!
While browsing the suggestions and accepting / refusing them you will normally be able to proceed much quicker (generate more good examples & counter examples) than if you were to continue manually.
How to review a dataset?
After having spent some time annotating data, you may be interested to see how much progress you have made, how many annotations you have added and how they are distributed over the set of labels.
- See Appendix 7 for dataset vs corpus
- Make sure that the annotations are correctly distributed according to the labels … as far as possible. If after your annotation effort some types have only received very few examples try to go back to your documents and add in particular sample for these types.
- Then go to the Documents or Segments view and filter the corpus by checking “in dataset = yes”.
- Filter the dataset on a specific label to do a more targeted review or just this label. This is a good way to see if any annotations are incorrect!
- Filter the dataset on a label in “exclusive” mode. This is a good way to see if annotations are missing!
The dataset must be as accurate as possible: No incorrect annotations, no inconsistencies, no missing annotations! (See Appendix 8)
How to experiment with models?
Once you decide you have enough annotations, it’s time to run an experiment: Launch a machine learning algorithm to learn the concepts you have added and construct a model that you can use to automate this task in the future. Kairntech allows you to run sophisticated training runs, using also powerful deep learning methods and all that without any coding.
- Run experiments to test different algorithms. Kairntech offers you predefined experiments with different algorithms including CRF-Suite, Spacy, Flair, Delft (Bi-LSTM), SkLearn, Bertopic… where the parameters are set to reasonable default values. Start with these or inspect and edit the parameters (may require deeper insights into the various parameters).
- Parameterize learning:
- Ideally generate train/test metadata on the dataset to have the same train/test set and thus be able to compare different experiments to each other. Contact us if needed.
- Otherwise use the parameter “Shuffle: true”. This will perform a new random split at each training run. Changes in training success may then come from properties of these random splits, especially when the number of documents is small.
- Set up the algorithm (see Appendix 9)
- Select “Embeddings” when appropriate to benefit from the potential of pre-computed information on semantic properties of words built into the system. Kairntech gives the user access to powerful precomputed embedding databases like BERT transformers and flair embeddings just with a mouse-click. Note that adding embeddings will in many cases increase the quality but also result in longer computation time for the training process.
- Evaluate the overall quality and the quality on each label… compare the algorithms between them.
- Improve the dataset on the labels with low quality (or find another solution!) by repeating the previous phases. Then iterate, until a sufficient quality is obtained. This point can be very important: In the table above experiments are listed with a global quality (say “75%”) computed over all available labels. If you ask yourself how to improve that, click on the “Quality” box to get the colored detail list in the lower part of the page. Here you may discover that some labels are already pretty good, while others still have a suboptimal performance. Often a lot of progress can be made by focusing your next annotation efforts precisely on these still weak labels.
How to test a model?
Once the experiment has finished (may take between just a few moments up to many hours and more, depending on corpus size and computing power) you will want to inspect what the resulting model does on a new piece of sample text.
An easy way to test a document already in the platform:
- Go to the Documents view and select a document
- Click on “Show in test view” menu as below
- Test the model on the selected document in the Test view. You can apply any existing component that is available or you have created (model, NLP pipeline, Gazetter, Summarizer…)
- Click on the “Back” button of your browser to return to the Documents view
How to build a Gazetteer?
A gazetteer is a component to annotate text with lexicons resources.
Go to the Gazetteer menu.
- Import a lexicon in CSV, xls, Skos…
- Check the list of imported terms (no edition possible)
- Create and configure a Gazetteer
How to build NLP pipelines?
Kairntech allows you to build NLP pipelines so as to combine a document converter, different custom-made or off-the-shelf models, processors to manipulate text or annotations and finally a formatter to provide the desired output format.
Go to the Annotations Plan menu.
Build your pipeline with custom models you have created in your projects, off-the-shelf models, technical components….
You can now validate your NLP pipeline in the Test view as we’ve done with a model.
How to automatically analyse new documents?
Kairntech allows you to automatically annotate new documents with models or NLP pipeline. The generated annotations will be created by the model name, hence making a disctinction with someone having access to the platform and annotating manually the content.
Go to the Dashboard menu | Corpus. You have a menu at the top right to annotate corpus of document.
That was the purpose of the whole exercise right from the start, right? To automate the extraction of information from documents. Now that you have annotated your documents accordingly to create a training corpus and trained a model you have two choices:
- Either you automatically annotate documents in the Kairntech software as outlined above: Select a project whose document you want to annotate, select the model to use and then after the annotation, export the annotated corpus to your machine for further use.
- Or you make now use of the REST API of Kairntech in order to integrate the annotation process into your application. This may require coding (programming) and a deeper knowledge about your specific environment. The REST API is documented here and we have described a sample client in python making use of this API here.
Appendix 1: Segmentation
- What is document segmentation?
- Dividing a document into “units”: sentence, paragraph…
- This might be customized: contact us
- Why is it interesting to segment documents?
- To save time by annotating only segments and not all documents.
- To build a good model, it is not necessary to annotate all the documents in their entirety (Gold Standard training dataset).
- Good to know
- For NER Projects, the training phase is done on segments not on documents
- For Categorization Projects, the training is done on documents.
Appendix 2: Categorization projects
- What it consists of
- Supervised categorization consists of training a model on a dataset to automatically classify documents in predefined categories
- Single category versus multiple categories
- Single category: only one category can be assigned to each document
- Multi-category: one or more categories can be assigned to each document
- “Other” category
- In some cases it may make sense to introduce explicit negative examples by defining a “other” category
Appendix 3: NER Projects
- NER = Named Entity Recognition
- What is a named entity?
- A named entity is a character string to which we can assign a label that allows to type the information. For example “John Smith” is a named entity of type “Person”.
- What is a sub-entity?
- An entity is potentially made of sub entities. For example an entity of type Person has in general two sub entities: his first name and his last name
- What is an extraction?
- This is the ability to read a text and automatically identify entities in a text
- What is a business vocabulary?
- This is an ordered list of “business” keywords about a given domain: A vocabulary on “finance” may contain for instance “loans”, “rates”, “interests”,…
Appendix 4: Wikidata
- What is a Knowledge Base?
- This is a way of organizing knowledge in a structured way. In general, a knowledge base lists “objects” belonging to “classes”, these objects can have “properties” and “links” between them
- What is Wikidata?
- It is the knowledge base built from Wikipedia.
- What is extracting in connection with a Knowledge Base?
- Extract a named entity from a text and uniquely identify it (resolve ambiguity) in the Knowledge Base. For example, the city of “Tripoli” exists in Libya and Lebanon. If the article is about Lebanon, “Tripoli” will be the city of Lebanon.
Appendix 5: NLP pipeline
- What is a NLP pipeline? It is a workflow with:
- A format conversion component
- One or several models and annotators (based on lexicons, knowledge bases…)
- A consolidation processor
- A formatter to generate a customized output (content enrichment, XLS export, database feed…)
- What is a cascade of models?
- It is a chain of models whose output of one becomes the input of the other. For example, a first model recognizes people’s names, the second model analyzes the output to extract the first and last name.
Appendix 6: Suggestions
- What is a suggestion?
- It is a named entity or a category suggested by the machine that the user can validate, correct or reject in order to enrich and improve the training dataset and thus improve the quality of the final model. The user can also verify that one or more named entities or categories can be added to the respective segment or document. The aim is to assist the user to save time in creating a training dataset or a machine learning model.
- What is a counter example?
- The machine can suggest an entity or a category in a wrong way. If the user rejects this suggestion, the machine will consider the validated segment as a counter example to that entity or category. The machine learns from both examples and counter examples.
Appendix 7: Dataset
- What is a dataset?
- It is a set of documents or segments that have been annotated.
- What is the difference between a corpus and a dataset?
- The corpus consists of raw documents without annotation
- The dataset consists of annotated documents or segments
- What does “in” or “outside” the dataset mean?
- A segment or document that has not been annotated is “outside” the dataset.
- A segment whose annotations have been suggested and finally rejected by the user will be in the dataset. It will be considered as a counter example.
Appendix 8: Inconsistencies
- What are inconsistent annotations?
- For example: If a first annotation is about “Mr. John Smith” and a second one is about “John Smith”, omitting “Mr.” first, these two annotations will be incoherent. The machine will learn less well. It is very important to follow the same guidelines when annotating.
- What are missing annotations?
- A segment must be annotated completely. No annotation should be missing or it will be considered as a counter example to the missing annotation!
Appendix 9: Language models and word embeddings
- What is a language model or word embeddings?
- It is a mathematical representation of a language built from neural networks on a very large number of examples, i.e. on a very large number of documents (hundreds of thousands or tens of millions of documents).
- Embeddings capture implicit knowledge (for instance that the word “Paris” is somewhat similar to “Madrid” because they both stand of large European capitals) without anyone having to write that explicitly into a lexicon.
- What are the word embeddings used for?
- They are used to have a first level of analysis and understanding of a text by the machine in a general way.
- It is possible to create word embeddings in fields where the sentence structures and vocabulary are very specific (biomedical field for example).
- The use of word embeddings generally allows to significantly increase the quality of a model