Use case: Forensic Data Analysis

RAG (Retrieval Augmented Generation) is one of the hottest use cases of Large Language Models (LLMs such as GPT) in a business context: It allows to analyse your own company internal information while benefitting from the power of LLMs. And all that without having to share the data with third parties or running costly retraining or finetuning campaigns.

Kairntech has implemented RAG as part its software. Here, we briefly outline a use case around the analysis of large numbers of email documents. We use the Enron Mail Corpus: a large collection of emails that were made public after the 2001 Enron fraud case.

Document upload

Documents can be uploaded into Kairntech from a variety of document formats such as PDF, HTML, DOC and many others. In this case we want to make sure that we keep the rich metadata that come with email documents: Who sent what to whom, when and with which subject etc. For that we translate the email documents into the Kairntech json document format:

{
    "metadata": {
      "Source": "maildir/allen-p/all_documents/100.",
      "Date": "Mon, 9 Oct 2000 07:16:00 -0700 (PDT)",
      "From": "phillip.allen@enron.com",
      "To": ["keith.holst@enron.com"],
      "Subject": "Consolidated positions: Issues"
    },
    "text": " … Below is the issues & to do list as we go forward with documenting the requirements for consolidated physical/financial positions and transport trade capture. What we need to focus on is the first bullet in Allan's list …”
}

Here documents contain, besides a “text” element also a “metadata” field where we can store the relevant information around this specific email. Having the metadata alongside the text will become important for some of the questions below.

Having imported these documents into a Kairntech RAG project we can start to analyse them by asking natural language questions.

RAG – Talk to your documents in natural language

RAG performs a semantic analysis of the question and compares that to the results of the prior semantic analysis of the imported documents. It then selects the ones that contain information relevant for answering the question. The subset of retrieved matches is then summarized by an LLM to provide a final answer.

Often a RAG answer returns condensed information that a human expert would need hours if not days to arrive at when only using traditional text-based search and retrieval methods.

Access to different LLMs, different embedding models 

Users can get access to a number of options to finetune the retrieval and answer generation process: For instance different LLMs can differ in their quality, runtime behavior or price per query. In the setup below a user can choose which LLM to employ: GTP3.5, the more recent, more powerful and costly GPT4 or the less pricey but competitive Dolphin model from Mixtral.

Text is good – Text plus Metadata is better!

Since we have saved the email metadata, we can use them to narrow the document set for our next question to the set of documents we need. Here: What were the topics in the mails that Ann Schmidt had sent to Karen Denne?

RAG: Knowing where to look for which information

Since we have the metadata with each document we can also combine them with the mail content. This allows for questions that need to study both the content and the metadata in order to be properly answered by the LLM. For instance in the question below, the LLM “understands” that “mails sent by Michelle Cash” requires to look for documents that contain Michelle Cash in the “From:” field.

Check here in order to see how to make RAG take into account text plus metadata.