Kairntech RAG for forensics

Data analytics on Enron emails

Kairntech RAG (Retrieval Augmented Generation) for forensics is one of the hottest use cases of Large Language Models. It allows to analyse internal information while benefitting from the power of LLMs. And all that without having to share the data with third parties or running costly retraining or finetuning campaigns.

Kairntech has implemented RAG as part its solution. Here, we briefly outline a use case around the analysis of large numbers of email documents. We use the Enron Mail Corpus: a large collection of emails that were made public after the 2001 Enron fraud case.

Kairntech RAG for forensics – document upload

Documents can be uploaded into Kairntech from a variety of document formats such as PDF, HTML, DOC and many others. In this case we want to make sure that we keep the rich metadata that comes with email documents. Who sent what to whom, when and with which subject etc. For that we translate the email documents into the Kairntech json format:

{
    "metadata": {
      "Source": "maildir/allen-p/all_documents/100.",
      "Date": "Mon, 9 Oct 2000 07:16:00 -0700 (PDT)",
      "From": "phillip.allen@enron.com",
      "To": ["keith.holst@enron.com"],
      "Subject": "Consolidated positions: Issues"
    },
    "text": " … Below is the issues & to do list as we go forward with documenting the requirements for consolidated physical/financial positions and transport trade capture. What we need to focus on is the first bullet in Allan's list …”
}

Documents contain besides a “text” element also a “metadata” field where we can store the relevant information around this specific email. Having the metadata alongside the text will become important for forensic analytics.

After document import it is time to analyse by asking natural language questions.

Talk to documents in natural language

RAG performs a semantic analysis of the question and compares that to the results of the prior semantic analysis of the imported documents. It then selects the ones that contain information relevant for answering the question. The subset of retrieved matches is then summarized by an LLM to provide a final answer.

Often a RAG answer returns condensed information that a human expert would need hours if not days to arrive at when only using traditional text-based search methods.

Embedding models

Users can get access to a number of options to finetune the retrieval and answer generation process: For instance different LLMs can differ in their quality, runtime behavior or price per query. In the setup below a user can choose which LLM to employ: GTP3.5, the more recent, more powerful and costly GPT4 or the less pricey but competitive Dolphin model from Mixtral.

Text and metadata are a winning combination

Since we have saved the email metadata, we can use them to narrow the document set for our next question to the set of documents we need. Here: What were the topics in the mails that Ann Schmidt had sent to Karen Denne?

Knowing where to look

Since we have the metadata with each document we can also combine them with the mail content. This allows for questions that need to study both the content and the metadata in order to be properly answered by the LLM. For instance in the question below, the LLM “understands” that “mails sent by Michelle Cash” requires to look for documents that contain Michelle Cash in the “From:” field.

Check here in order to see how to make RAG take into account text plus metadata.