Named Entity Recognition relates to the extraction of sequences of words from within a document. The technique is mostly used to extract names of people, organisations or places, which are therefore the most typical named entities. However, the term named entity recognition does not capture very well the importance of a fragment of text, consisting of one or more successive words. This fragment can just as well be made up of a single verb or adjective, which are not strictly speaking named entities, but more a date, a number or an amount.
The use case below demonstrates an example.
Example: Extraction of financial fees from court decisions
The objective here is to extract from a court decision, the amount that the person condemned or dismissed by the decision must pay to the opposing party for the so-called “irreducible” costs, or more precisely lawyers’ fees.
In the project below (in French), it is the “irreducible costs” label (the second-last label in the list below) that is of interest:
The aim here is to calculate an average amount for a given set of documents (a corpus), for a particular court. What is interesting is that the term “irreducible” rarely appears in the text and that these costs are most often revealed indirectly by the presence in the vicinity, before or after, of an article of law: article 700 of the “Code Civil”.
How to proceed?
The annotator must first define his need and method. In this case, the aim is to extract each individual amount decided by the judge.
If party A has to pay 1000 Euro to party B and 800 Euro to party C, two amounts have to be extracted. Sometimes the text says that A must pay B and C the sum of X Euro “each”, in which case the sum will be impossible, but this is a rare case and has no major impact on a process that analyses many decisions.
Care should be taken to annotate each individual amount only once. The drafting convention may indeed lead to the same amount being repeated in different sentences. In the example below, the choice is to annotate from the words “FOR THESE REASONS” which announce the decision itself.
With or without a currency unit?
In the same example it should be noted that the amount has been highlighted without its currency unit, the Euro. This is a choice linked to the absence of a convention for writing currencies in court decisions. Not only could the Euro be designated alternatively with its € symbol or its ISO code EUR, but the vagaries of encoding can also replace it with special characters. The more variants there are, the more difficult it will be to learn and the more complicated the post-processing.
We have therefore taken the decision not to extract the currency unit. This is possible in this use case because the irreducible costs of French decisions are necessarily denominated in Euro since 2001 (and in francs before that). If in another case of use, where amounts could be denominated in different currency units, the latter could be included in the annotation. It will then be necessary to provide for a post-processing split of the amount and the currency unit before an arithmetic calculation can be performed.
Introduction of counterexamples in the dataset
The annotator should try to introduce counter-examples into his dataset. The example below shows two suggestions that should be rejected (the user will have to double-click on the small cross at the top left)
He will refuse the first extract because the sentence shows that the amount is the one claimed by a party, which will not necessarily be granted by the judge, and the second extract because the amount is of a different nature.
Choosing the best algorithm
It then remains to be determined which algorithm gives the best result to extract these sunk costs. Here we have compared the f-measurement of a CRF-Suite based model with other models trained on different neural network frameworks such as Delft, Flair and Spacy :
We can see that here it is the model using the DELFT framework that provides the best quality.
Other financial amounts to be extracted?
An amount does not carry semantics by itself. A decision may mention many other amounts that have nothing to do with irreparable costs. For example, rent or damages are amounts that the algorithm should not confuse with irreparable costs. Each wrongly generated annotation will lower the accuracy rate.
In this use case variant, the aim is to extract not only the unrecoverable costs obtained, but also the damages obtained:
The text of the decision also includes the amounts claimed by the parties for unrecoverable costs. If they are to be extracted, in particular for the purpose of comparing the amount claimed and the amount obtained, then the claims of the two parties must be distinguished and annotated in a different label, in a highly distinguishable colour, to avoid confusion, as in the following decision:
If other amounts are sought, the manual annotation burden will increase, but the quality of each label will be improved as the learning engine will have fewer occasions to hesitate.
Training on a subset of labels
When setting up an experiment to test the quality of an algorithm, it is possible to restrict the scope to a subset of labels. An experiment collecting all the labels of the type “amount” will give a better individual quality of each label than each of the four experiments limited to only one of the 4 labels in our example. In general, it is preferable to select all labels that may compete on the same text segment:
Associate a financial amount with a named entity
Finally, the user may want to assign the amount to a party, in our case, the legal fees the court requested parties to pay. This requires the annotation of another tag (in belows case in fuchsia color):
Then the relationship between the amount and the party will be created simply by establishing a co-occurrence between the named entities (proximity) or by the annotation of relationships between named entities. But this will be a subject of a future article.