Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

RAG - A Simple Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Retrieval

Augmented
Generation

A Simple Introduction

Abhinav Kimothi
2
Table of Contents
01. What is RAG? ............................................................................. 3

02. How does RAG help? ................................................................ 6

03. What are some popular RAG use cases? ................................ 7

04. RAG Architecture ..................................................................... 8


i) Indexing Pipeline ...................................................................... 9
a) Data Loading ...................................................................... 10
b) Document Splitting .......................................................... 14
c) Embedding .......................................................................... 23
d) Vector Stores ..................................................................... 29
ii) RAG Pipeline ........................................................................... 35
a) Retrieval ............................................................................... 37
b) Augmentation and Generation ....................................... 45

05. Evaluation ............................................................................... 46

06. RAG vs Finetuning ................................................................. 56

07. Evolving RAG LLMOps Stack ................................................. 59

08. Multimodal RAG ..................................................................... 63

09. Progression of RAG Systems ................................................ 66


i) Naive RAG ................................................................................ 66
ii) Advanced RAG ....................................................................... 67
iii) Multimodal RAG .................................................................... 71

10. Acknowledgements ................................................................ 73

11. Resources ................................................................................. 74

Keep Calm & Build AI. Abhinav Kimothi


What is RAG? 3

What is RAG?
Retrieval Augmented Generation
30th November, 2022 will be remembered as the watershed moment in artificial
intelligence. OpenAI released ChatGPT and the world was mesmerised. Interest in
previously obscure terms like Generative AI and Large Language Models (LLMs),
was unstoppable over the following 12 months.
Generative AI Large Language Models
100

80

60

40

20

0
2022-11-06 2023-11-05
Google Trends - Interest Over Time (Nov’22 to Nov’23)

The Curse Of The LLMs


As usage exploded, so did the expectations. Many users started using ChatGPT as
a source of information, like an alternative to Google. As a result, they also started
encountering prominent weaknesses of the system. Concerns around copyright,
privacy, security, ability to do mathematical calculations etc. aside, people
realised that there are two major limitations of Large Language Models.

A Knowledge Cut-off date Hallucinations


Training an LLM is an expensive and Often, it was observed that LLMs
time-consuming process. LLMs are provided responses that were factually
trained on massive amount of data. The incorrect. Despite being factually
data that LLMs are trained on is incorrect, the LLM responses
therefore historical (or dated). “sounded” extremely confident and
e.g. The latest GPT4 model by OpenAI legitimate. This characteristic of “lying
has knowledge only till April 2023 and with confidence” proved to be one of
any event that happened post that the biggest criticisms of ChatGPT and
date, the information is not available to LLM techniques, in general.
the model.

Users look at LLMs for knowledge and wisdom, yet LLMs


are sophisticated predictors of what word comes next.

Keep Calm & Build AI. Abhinav Kimothi


What is RAG? 4

The Hunger For More


While the weaknesses of LLMs were being discussed, a parallel discourse around
providing context to the models started. In essence, it meant creating a ChatGPT
on proprietary data.

The Challenge
Make LLMs respond with up-to-date information
Make LLMs not respond with factually inaccurate
information
Make LLMs aware of proprietary information
Providing LLMs with information not in their memory

Providing Context
While model re-training/fine-tuning/reinforcement learning are options that solve
the aforementioned challenges, these approaches are time-consuming and
costly. In majority of the use-case, these costs are prohibitive.

In May 2020, researchers in their paper “Retrieval-Augmented Generation for


Knowledge-Intensive NLP Tasks” explored models which combine pre-trained
parametric and non-parametric memory for language generation.

Keep Calm & Build AI. Abhinav Kimothi


What is RAG? 5

So, What is RAG?


In 2023, RAG has become one of the most used technique in the domain of Large
Language Models.

R A G
{Prompt} {Prompt + Context}
Retriever

User LLM

Search Fetch
Lookup the external source to
R retrieve the relevant information
Context

A Add the retrieved information to


the user prompt

G Use LLM to generate response to


user prompt with the context
Proprietary and Non-proprietary information

What is RAG?

User enters a prompt/query

Retriever searches and fetches information relevant to the prompt


(e.g. from the internet or internet data warehouse)

Retrieved relevant information is augmented to the prompt as context

LLM is asked to generate response to the prompt in the context


(augmented information)

User receives the response


A Naive RAG workflow

Keep Calm & Build AI. Abhinav Kimothi


How does RAG help? 6

How does RAG help?


Unlimited Knowledge
The Retriever of an RAG system can have access to external sources of information. Therefore,
the LLM is not limited to its internal knowledge. The external sources can be proprietary
documents and data or even the internet.

Without RAG With RAG


Web Pages
APIs & Dynamic DBs

Retriever
Document Repos

Other Sources

Databases
An LLM has knowledge Retriever searches and fetches information that the LLM has not
only of the data it has necessarily been trained on. This adds to the LLM memory and is passed as
been trained on context in the prompts. Also called Non-Parametric Memory (information
Also called Parametric available outside the model parameters)
Memory (information Expandable to all sources
stored in the model Easier to update/maintain
parameters) Much cheaper than retraining/fine-tuning
The effort lies in creation of the knowledge base

Confidence in Responses
With the context (extra information that is retrieved) made available to the LLM,
the confidence in LLM responses is increased.

Context Awareness Source Citation Reduced Hallucinations


Added information Access to sources of RAG enabled LLM
assists LLMs in information improves the systems are observed to
generating responses transparency of the LLM be less prone to
that are accurate and responses hallucinations than the
contextually appropriate ones without RAG

Keep Calm & Build AI. Abhinav Kimothi


What are some popular RAG use cases? 7

RAG Use Cases


The development of RAG technique is rooted in use cases that were limited by the
inherent weaknesses of the LLMs. As of today some commercial applications of
RAG are in -
Document Question Answering Systems
By providing access to proprietary enterprise document to an LLM, the
responses are limited to what is provided within them. A retriever can
search for the most relevant documents and provide the information to
the LLM. Check out this blog for an example

Conversational agents
LLMs can be customised to product/service manuals, domain
knowledge, guidelines, etc. using RAG. The agent can also route users to
more specialised agents depending on their query. SearchUnify has an
LLM+RAG powered conversational agent for their users.

Real-time Event Commentary


Imagine an event like a sports or a new event. A retriever can connect to
real-time updates/data via APIs and pass this information to the LLM to
create a virtual commentator. These can further be augmented with
Text To Speech models.IBM leveraged the technology for commentary
during the 2023 US Open

Content Generation
The widest use of LLMs has probably been in content generation. Using
RAG, the generation can be personalised to readers, incorporate real-
time trends and be contextually appropriate. Yarnit is an AI based
content marketing platform that uses RAG for multiple tasks.

Personalised Recommendation
Recommendation engines have been a game changes in the digital
economy. LLMs are capable of powering the next evolution in content
recommendations. Check out Aman’s blog on the utility of LLMs in
recommendation systems.

Virtual Assistants
Virtual personal assistants like Siri, Alexa and others are in plans to use
LLMs to enhance the experience. Coupled with more context on user
behaviour, these assistants can become highly personalised.

Keep Calm & Build AI. Abhinav Kimothi


RAG Architecture 8

RAG Architecture
Let’s revisit the five high level steps of an RAG enabled system

{ }
Prompt
Search Relevant
Information

Relevant
Context Knowledge Sources
Prompt

Prompt + Context
LLM
Endpoint

Generated Response

RAG System

User writes a prompt or a query that is passed to an orchestrator

Orchestrator sends a search query to the retriever

Retriever fetches the relevant information from the knowledge sources and sends back

Orchestrator augments the prompt with the context and sends to the LLM

LLM responds with the generated text which is displayed to the user via the orchestrator

Two pipelines become important in setting up the RAG system. The first one being
setting up the knowledge sources for efficient search and retrieval and the
second one being the five steps of the generation.

Indexing Pipeline
Data for the knowledge is ingested from the source and indexed. This
involves steps like splitting, creation of embeddings and storage of
data.
RAG Pipeline
This involves the actual RAG process which takes the user query at
run time and retrieves the relevant data from the index, then passes
that to the model

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline 9

Indexing Pipeline
The indexing pipeline sets up the knowledge source for the RAG system. It is
generally considered an offline process. However, information can also be
fetched in real time. It involves four primary steps.

Loading Splitting Embedding Storing

This step involves This step involves This step involves This step involves
extracting splitting converting text storing the
information from documents into documents into embeddings
different smaller numerical vectors. vectors. Vectors
knowledge sources manageable ML models are are typically stored
a loading them into chunks. Smaller mathematical in Vector
documents. chunks are easier models and Databases which
to search and to therefore require are best suited for
use in LLM context numerical data. searching.
windows.

Offline Indexing pipelines are typically used when a knowledge base


with large amount of data is being built for repeated usage e.g. a
number of enterprise documents, manuals etc.
In cases where only a fixed small amount of one time data is required
e.g. a 300 word blog, there is no need for storing the data. The blog
text can either be directly passed in the LLM context window or a
temporary vector index can be created.

{Prompt} {Prompt + Context}


Retriever

LLM Response
User No search
needed since Fetch
context is fixed

short context On the fly indexing

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Loading Data 10

Loading Data
As we’ve been discussing, the utility of RAG is to access data for all sorts of
sources. These sources can be -
Websites & HTML pages
Documents like word, pdf etc.
Code in python, java etc.
Data in json, csv etc.
APIs
File Directories
Databases
And many more
The first step is to extract the information present in these source locations.

This is a good time to introduce two popular frameworks that are being used to
develop LLM powered applications.

Use cases: Good for applications that Use cases: Good for tasks that
need enhanced AI capabilities, like require text search and retrieval, like
language understanding tasks and more information retrieval or content
sophisticated text generation discovery

Features: Stands out for its versatility Features: Excels in data indexing and
and adaptability in building robust language model enhancement
applications with LLMs
Connectors: Provides connectors to
Agents: Makes creating agents using access data from databases,
large language models simple through external APIs, or other datasets
their agents API

Both frameworks are rapidly evolving and adding new capabilities every week.
It’s not an either/or situation and you can use both together (or neither).

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Loading Data 11

Example : Loading a YouTube Video


Transcript using LangChain Loaders
Let’s begin by sourcing the transcript from this video -
“DALL·E 2 Explained” by OpenAI
(https://www.youtube.com/watch?v=qTgPSKKjfVg)

Below is the code using YoutubeLoader from langchain.document_loaders

LangChain Document Loader : YoutubeLoader

Loader object
[Document(page_content="Have you ever seen a polar bear
playing bass? Or a robot painted like a Picasso? Didn’t think so.
DALL-E 2 is ....
....
....
.....umans\nand clever systems can work together to make new
things – amplifying our creative potential.", metadata={'source':
'qTgPSKKjfVg', 'title': 'DALL·E 2 Explained', 'description': 'Unknown',
'view_count': 853564, 'thumbnail_url':
'https://i.ytimg.com/vi/qTgPSKKjfVg/hq720.jpg', 'publish_date':
'2022-04-06 00:00:00', 'length': 167, 'author': 'OpenAI'})]

The Document object contains the page_content which is the transcript extracted
from the youtube video as well as the metadata description

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Loading Data 12

Example : Loading a Webpage Text using


LlamaIndex Reader
This is a blog published on Medium -
What is a fine-tuned LLM?
(https://medium.com/mlearning-ai/what-is-a-fine-tuned-llm-67bf0b5df081)

Below is the code using SimpleWebPageReader from llama_hub

LlamaIndex LlamaHub Web Page Reader

Loader object
[Document(id_='17761da4-6a3a-4ce5-8590-c65ee446788f',
embedding=None, metadata={}, excluded_embed_metadata_keys=[],
excluded_llm_metadata_keys=[], relationships={},
hash='6471b3ffe4d3abb1aba2ca99d1d0448e2c3cbd157ddca256fab9fa363e0
9ed85', text='<!doctype html><html lang="en"><head><title data-
rh="true">What is a fine-tuned LLM?. Fine-tuning large language models…
| by Abhinav Kimothi |

</body></html>', start_char_idx=None, end_char_idx=None,
text_template='{metadata_str}\n\n{content}', metadata_template='{key}:
{value}', metadata_seperator='\n')]

The LlamaIndex Document object contains more attributes than a LangChain


Document. Apart from text and metadata, it also has id, templates and other
customizations available

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Loading Data 13

Both LangChain and LlamaIndex offer loader integrations with more than a
hundred data sources and the list keeps on growing

LangChain Document Loaders LlamaHub Data Loaders

LangChain provides integrations with a LlamaIndex provides data loaders via


variety of sources LlamaHub

These Document Loaders are particularly helpful in quickly making connections


and accessing information. For specific sources, custom loaders can also be
developed.

It is worthwhile exploring documentation for both

LlamaIndex: https://docs.llamaindex.ai/en/stable/

LangChain: https://python.langchain.com/docs/get_started/introduction

Loading documents from a list of sources may turn out to be a complicated


process. Make sure to plan for all the sources and loaders in advance.

More often than naught, transformations/clean-ups to the loaded data will


be required like removing duplicate content, html parsing, etc. LangChain
also provides a variety of document transformers

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 14

Document Splitting
Once the data is loaded, the next step in the indexing pipeline is splitting the
documents into manageable chunks. The question arises around the need of this
step. Why is splitting of documents necessary. There are two reasons for that -

Ease of Search Context Window Size


Large chunks of data are harder to LLMs allow only a finite number of
search over. Splitting data into tokens in prompts and completions. The
smaller chunks therefore helps in context therefore cannot be larger than
better indexation. what the context window permits.

Chunking Strategies
While splitting documents into chunks might sound a simple concept, there are
certain best practices that researchers have discovered. There are a few
considerations that may influence the overall chunking strategy.
Nature of Content
Consider whether you are working with lengthy documents, such as articles or
books, or shorter content like tweets or instant messages. The chosen model for
your goal and, consequently, the appropriate chunking strategy depend on your
response.

Embedding Model being Used


We will discuss embeddings in detail in the next section but the choice of
embedding model also dictates the chunking strategy. Some models perform
better with chunks of specific length

Expected Length and Complexity of User Queries


Determine whether the content will be short and specific or long and complex.
This factor will influence the approach to chunking the content, ensuring a closer
correlation between the embedded query and the embedded chunks

Application Specific Requirements


The application use case, such as semantic search, question answering,
summarization, or other purposes will also determine how text should be
chunked. If the results need to be input into another language model with a token
limit, it is crucial to factor this into your decision-making process.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 15

Chunking Methods
Depending on the aforementioned considerations, a number of text splitters are
available. At a broad level, text splitters operate in the following manner:

Divide the text into compact, semantically meaningful units, often sentences.
Merge these smaller units into larger chunks until a specific size is achieved,
measured by a length function.
Upon reaching the predetermined size, treat that chunk as an independent
segment of text. Thereafter, start creating a new text chunk with some degree
of overlap to maintain contextual continuity between chunks.

Two areas to focus on, therefore are -

How the text is split? How the chunk size is measured?

A very common approach is where we pre-determine the size of the text chunks.

Additionally, we can specify the overlap between chunks (Remember, overlap is


preferred to maintain contextual continuity between chunks).

This approach is simple and cheap and is, therefore, widely used. Let’s look at
some examples -

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 16

Split by Character
In this approach, the text is split based on a character and the chunk size is
measured by the number of characters.

Example text : alice_in_wonderland.txt (the book in .txt format)


using LangChain’s CharacterTextSplitter

texts[0]
“TITLE: Alice's Adventures in Wonderland\nAUTHOR: Lewis Carroll\n\n\n CHAPTER I \n( Down the
Rabbit-Hole )\n\n Alice was beginning to get very tired of sitting by her sister\non the bank, and of
having nothing to do: once or twice she had\npeeped into the book her sister was reading, but it
had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without
pictures or conversation?'\n\n So she was considering in her own mind (as well as she could,\nfor
the hot day made her feel very sleepy and stupid), whether\nthe pleasure of making a daisy-chain
would be worth the trouble\nof getting up and picking the daisies, when suddenly a White\nRabbit
with pink eyes ran close by her.\n\n There was nothing so VERY remarkable in that; nor did
Alice\nthink it so VERY much out of the wayChunk 1 Rabbit say to\nitself, `Oh dear! Oh dear! I
to hear the
shall be late!' (when she thought\nit over afterwards, it occurred to her that she ought to
have\nwondered at this, but at the time it all seemed quite natural);\nbut when the Rabbit actually
TOOK A WATCH OUT OF ITS WAISTCOAT-\nPOCKET, and looked at it, and then hurried on, Alice
started to\nher feet, for it flashed across her mind that she had never\nbefore seen a rabbit with
either a waistcoat-pocket, or a watch to\ntake out of it, and burning with curiosity, she ran across
the\nfield after it, and fortunately was just in time to see it pop\ndown a large rabbit-hole under the
hedge.\n\n In another moment down went Alice after it, never once\nconsidering how in the world
she was to get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand
then dipped suddenly down, so suddenly that Alice had not a\nmoment to think about stopping
herself before she found herself\nfalling down a very deep well."
Overlap
texts[1]
"In another moment down went Alice after it, never once\nconsidering how in the world she was to
get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand then dipped
suddenly down, so suddenly that Alice had not a\nmoment to think about stopping herself before
she found herself\nfalling down a very deep well.\n\n Either the well was very deep, or she fell very
slowly, for she\nhad plenty of time as she went down to look about her and to\nwonder what was
going to happen next. First, she tried to look\ndown and make out what she was coming to, but it
Chunk
was too dark to\nsee anything; then she looked at the sides of the 2
well, and\nnoticed that they were
filled with cupboards and book-shelves;\nhere and there she saw maps and pictures hung upon
pegs. She\ntook down a jar from one of the shelves as she passed; it was\nlabelled `ORANGE
MARMALADE', but to her great disappointment it\nwas empty: she did not like to drop the jar for
fear of killing\nsomebody, so managed to put it into one of the cupboards as she\nfell past it.”

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 17

Let’s find out how many chunks were created

Total Number of Chunks Created => 93

Length of the First Chunk is => 1777 characters

Length of the Last Chunk is => 816 characters

Recursive Split by Character


A subtle variation to splitting by character is Recursive Split. The only difference
is that instead of a single character used for splitting, this technique uses a list of
characters and tries to split hierarchically till the chunk sizes are small enough.
This technique is generally recommended for generic text.

Example text : AK_BusyPersonIntroLLM.txt


(Transcript of a YouTube video by Andrej Karpathy titled [1hr Talk] Intro to Large Language
Models - https://www.youtube.com/watch?v=zjkBMFhNj_g&t=9s )

using LangChain’s RecursiveCharacterTextSplitter


This is a generic text that is not formatted. Let’s compare the two strategies.

with CharacterTextSplitter

Total Number of Chunks Created => 1


Length of the First Chunk is => 64383 characters
Length of the Last Chunk is => 64383 characters

Text splitter fails to convert the text into chunks since


there are no ‘\n\n’ character present in the raw transcript

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 18

with RecursiveCharacterTextSplitter

Total Number of Chunks Created => 40


Length of the First Chunk is => 1998 characters
Length of the Last Chunk is => 1967 characters

Recursive text splitter performs well in dealing


with generic text

Split by Tokens
For those well versed with Large Language Models, tokens is not a new concept.
All LLMs have a token limit in their respective context windows which we cannot
exceed. It is therefore a good idea to count the tokens while creating chunks. All
LLMs also have their tokenizers.

Tiktoken Tokenizer

Tiktoken tokenizer has been created by OpenAI for their family of models. Using
this strategy, the split still happens based on the character. However, the length
of the chunk is determined by the number of tokens.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 19

Example text : AK_BusyPersonIntroLLM.txt


(Transcript of a YouTube video by Andrej Karpathy titled [1hr Talk] Intro to Large Language
Models - https://www.youtube.com/watch?v=zjkBMFhNj_g&t=9s )

using LangChain’s TokenTextSplitter

Total Number of Chunks Created => 14


Total Number of Tokens in the document => 12865 tokens
Length of the First Chunk is => 1014 tokens
Length of the Last Chunk is => 1014 tokens

Tokenizers are helpful in creating chunks that sit


well in the context window of an LLM

Hugging Face Tokenizer

Hugging Face has become the go-to platform for anyone building apps using LLMs
or even other models. All models available via Hugging Face are also accompanied
by their tokenizers.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 20

Example text : AK_BusyPersonIntroLLM.txt


(Transcript of a YouTube video by Andrej Karpathy titled [1hr Talk] Intro to Large Language
Models - https://www.youtube.com/watch?v=zjkBMFhNj_g&t=9s )

using Transformers and LangChain’s RecursiveCharacterTextSplitter

Example tokenizer : GPT2TokenizerFast

texts[0]
“hi everyone so recently I gave a 30-minute talk on large language
models just kind of like an intro talk um unfortunately that talk
was not recorded but a lot of people came to me after the talk and
they told me that uh they Chunk 1
really liked the talk so I would just I
thought I would just re-record it and basically put it up on
YouTube so here we go the busy person's intro to large language
models director Scott okay so let's begin first of all what is a large
language model
No Overlap as specified
texts[1]
really well a large language model is just two files right um there
be two files in this hypothetical directory so for example work with
the specific example of the Llama 270b model this is a large
language model released by meta Ai and this is basically the Llama
series of language models the Chunk 2 iteration of it and this is the
second
70 billion parameter model of uh of this series so there's multiple
models uh belonging to the Lama 2 Series uh 7 billion um 13 billion
34 billion and 70 billion is the the

Do take a look at Hugging Face documents on Tokenizers

https://huggingface.co/docs/transformers/tokenizer_summary

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 21

Other Tokenizer
Other libraries like Spacy, NLTK and SentenceTransformers also provide splitters

Specialized Chunking
Chunking often aims to keep text with common context together. With this in
mind, we might want to specifically honour the structure of the document itself
for example HTML, Markdown, Latex or even code.

Example : https://medium.com/p/29a7e8610843

Example HTML : “Context is Key: The Significance of RAG in Language Models”


(A blog on Medium - https://medium.com/p/29a7e8610843)

using LangChain’s HTMLHeaderTextSplitter & RecursiveCharacterTextSplitter

All LangChain Splitters

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Document Splitting 22

Things to Keep in Mind


Ensure data quality by preprocessing it before determining the optimal chunk
size. Examples include removing HTML tags or eliminating specific elements
that contribute noise, particularly when data is sourced from the web.

Consider factors such as content nature (e.g., short messages or lengthy


documents), embedding model characteristics, and capabilities like token
limits in choosing chunk sizes. Aim for a balance between preserving context
and maintaining accuracy.

Test different chunk sizes. Create embeddings for the chosen chunk sizes and
store them in your index or indices. Run a series of queries to evaluate quality
and compare the performance of different chunk sizes.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 23

Embeddings
All Machine Learning/AI models work with numerical data. Before the
performance of any operation all text/image/audio/video data has to be
transformed into a numerical representation. Embeddings are vector
representations of data that capture meaningful relationships between entities.
As a general definition, embeddings are data that has been transformed into n-
dimensional matrices for use in deep learning computations. A word embedding is
a vector representation of words.

Dog [5,7,1,....] vector representation for ‘Dog’

Bark [6,7,2,....] vector representation for ‘Bark’

Fly [1,1,8,....] vector representation for ‘Fly’


algorithm
Embedding Space

The process of embedding transforms data (like text) into vectors, compresses
the input information resulting in an embedding space specific to the training
data

While we keep our discussion around embeddings limited to RAG


application and how to create embeddings for our data, a great
resource to find more about embeddings is this book by Vicky
Boykis [What are embeddings]

The good news for anyone building RAG Applications is that embeddings once
created can also generalize to other tasks and domains through transfer learning
— the ability to switch contexts — which is one of the reasons embeddings have
exploded in popularity across machine learning applications

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 24

Popular Embedding Models

word2vec embeddings. The official paper -


Google’s Word2Vec is one of the most popular pre-trained word

https://arxiv.org/pdf/1301.3781.pdf

The ‘Global Vectors’ model is so termed because it captures


GLOVE statistics directly at a global level. The official paper -
https://nlp.stanford.edu/pubs/glove.pdf

fastText Facebook’s AI research, fastText builds embeddings composed of


characters instead of words. The official paper -
https://arxiv.org/pdf/1607.04606.pdf

Elmo
Embeddings from Language Models, are learnt from the internal
state of a bidirectional LSTM. The official paper -
https://arxiv.org/pdf/1802.05365.pdf

Bidirectional Encoder Representations from Transformers is a


BERT transformer bases approach. The official paper -
https://arxiv.org/pdf/1810.04805.pdf

ada v2 by
used by GPT series of models

textembedding-gecko
by Google’s

Other Open Source


Embeddings
Checkout MTEB leaderboard at

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 25

How to Choose Embeddings?


Ever since the release of ChatGPT and the advent of the aptly described LLM
Wars, there has also been a mad rush in developing embeddings models. There
are many evolving standards of evaluating LLMs and embeddings alike.
When building RAG powered LLM apps, there is no right answer to “Which
embeddings model to use?”. However, you may notice particular embeddings
working better for specific use cases (like summarization, text generations,
classification etc.)

OpenAI used to recommend different embeddings models for different


use cases. However, now they recommend ada v2 for all tasks.

MTEB Leaderboard at Hugging Face evaluates almost all available embedding


models across seven use cases - Classification, Clustering, Pair Classification,
Reranking, Retrieval, Semantic Textual Similarity (STS) and Summarization.

Another important consideration is cost. With OpenAI models you can incur
significant costs if you are working with a lot of documents. The cost of open
source models will depend on the implementation.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 26

Creating Embeddings
Once you’ve chosen your embedding model, there are several ways of creating
the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in
pretty handy to convert documents (split into chunks) into vector embeddings.
Other times you can use the service from a provider directly or get the
embeddings from HuggingFace

Example : OpenAI text-embedding-ada-002


using Embedding.create() function from openai library

You’ll need an OpenAI apikey to create these embeddings


You can get one here - https://platform.openai.com/api-keys

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 27

Example Response

response.data[0].embedding will give the created embeddings


that can be stored for retrieval

Cost

In this example, 1014 tokens will cost about $.0001. Recall that for this youtube
transcript we got 14 chunks. So creating the embeddings for the entire transcript
will cost about 0.14 cents. This may seem low, but when you scale up to
thousands of documents being updated frequently, the cost can become a
concern.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Embeddings 28

Example : msmarco-bert-base-dot-v5
using HuggingFaceEmbeddings from langchain.embeddings

Example : embed-english-light-v3.0
using CohereEmbeddings from langchain.embeddings

All the available embeddings classes on


LangChain

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Storing 29

Storing
We are at the last step of creating the indexing pipeline. We have loaded and split
the data, and created the embeddings. Now, for us to be able to use the
information repeatedly, we need to store it so that it can be accessed on demand.
For this we use a special kind of database called the Vector Database.

What is a Vector Database?


For those familiar with databases, indexing is a data structure technique that
allows users to quickly retrieve data from a database. Vector databases specialise
in indexing and storing embeddings for fast retrieval and similarity search.

A strip down variant of a Vector Database is a Vector Index like FAISS (Facebook
AI Similarity Search). It is this vector indexing that improves the search and
retrieval of vector embeddings. Vector Databases augment the indexing with
typical database features like data management, metadata storage, scalability,
integrations, security etc.

In short, Vector Databases provide -


Scalable Embedding Storage.
Precise Similarity Search.
Faster Search Algorithm.

Popular Vector Databases

Facebook AI Similarity search Pinecone is one of the most


is a vector index released with popular managed Vector DB
a library in 2017 for large scale

Weaviate is an open source Chromadb is also an open


vector database that stores source vector database.
both objects and vectors
With the growth in demand for vector storage, it can be anticipated that all major
database players will add the vector indexing capabilities to their offerings.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Storing 30

How to choose a Vector Database?


All vector databases offer the same basic capabilities. Your choice should be
influenced by the nuance of your use case matching with the value proposition of
the database.

A few things to consider -

Balance search accuracy and query speed based on application needs.


Prioritize accuracy for precision applications or speed for real-time systems.

Weigh increased flexibility vs potential performance impacts. More


customization can add overhead and slow systems down.

Evaluate data durability and integrity requirements vs the need for fast query
performance. Additional persistence safeguards can reduce speed.

Assess tradeoffs between local storage speed and access vs cloud storage
benefits like security, redundancy and scalability.

Determine if tight integration control via direct libraries is required or if ease-


of-use abstractions like APIs better suit your use case.

Compare advanced algorithm optimizations, query features, and indexing vs


how much complexity your use case necessitates vs needs for simplicity.

Cost considerations - while you many incur regular cost in a fully managed
solution, a self hosted one might prove costlier if not managed well

User Friendly for PoCs Higher Performance Customization

There are many more Vector DBs. For a comprehensive understanding of the pros
and cons of each, this blog is highly recommended

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Storing 31

Storing Embeddings in Vector DBs


To store the embeddings, LangChain and LlamaIndex can be used for quick
prototyping. The more nuanced implementation will depend on the choice of the
DB, use case, volume etc.

Example : FAISS from langchain.vectorstores


In this example, we complete our indexing pipeline for one document.

1. Loading our text file using TextLoader,


2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using OpenAIEmbeddings
4. Storing the embeddings into FAISS vector index

You’ll have to address the following dependencies.


1. Install openai, tiktoken and faiss-cpu or faiss-gpu
2. Get an OpenAI API key

Now that our knowledge base is ready, let’s quickly see it in action. Let’s performa a
search on the FAISS index we’ve just created.

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Storing 32

Similarity search
In the YouTube video, for which we have indexed the transcript, Andrej Karpathy
talks about the idea of LLM as an operating system. Let’s perform a search on this.

Query : What did Andrej say about LLM operating system?

We can see here that out of the entire text, we have been able to retrieve the
specific chunk talking about the LLM OS. We’ll look at it in detail again in the RAG
pipeline

Keep Calm & Build AI. Abhinav Kimothi


Indexing Pipeline: Storing 33

Example : Chroma from langchain.vectorstores


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using all-MiniLM-L6-v2
4. Storing the embeddings into Chromadb

All LangChain
VectorDB Integrations
Keep Calm & Build AI. Abhinav Kimothi
Indexing Pipeline: Storing 34

Indexing Pipeline Recap


We covered the indexing pipeline in its entirety. A quick recap -

A variety of data loaders from LangChain and LlamaIndex


can be leveraged to load data from all sort of sources.
Loading documents from a list of sources may turn out to
be a complicated process. Make sure to plan for all the
sources and loaders in advance.
Loading More often than naught, transformations/clean-ups to the
loaded data will be required

Documents need to be split for ease of search and


limitations of the llm context windows
Chunking strategies are dependent on the use case,
nature of content, embeddings, query length & complexity
Chunking methods determine how the text is split and
Splitting how the chunks are measured

Embeddings are vector representations of data that


capture meaningful relationships between entities
Some embeddings work better for some use cases
Embedding

Vector databases specialise in indexing and storing


embeddings for fast retrieval and similarity search
Different vector databases present different benefits and
can be used in accordance with the use case
Storing

Keep Calm & Build AI. Abhinav Kimothi


RAG Pipeline 35

RAG Pipeline
Now that the knowledge base has been created in the indexing pipeline, the main
generation or the RAG pipeline will have to be setup for receiving the input and
generating the output.

Let’s revisit our architecture diagram.

Vector Store created via


the Indexing Pipeline

{ }
Prompt
Search Relevant
Information

Relevant
Context Knowledge Sources
Prompt

Prompt + Context
LLM
Endpoint

Generated Response

RAG System

Generation Steps
User writes a prompt or a query that is passed to an orchestrator

Orchestrator sends a search query to the retriever

Retriever fetches the relevant information from the knowledge sources and returns

Orchestrator augments the prompt with the context and sends to the LLM

LLM responds with the generated text which is displayed to the user via the orchestrator

The knowledge sources highlighted above have been set up using the indexing
pipeline. These sources can be served using “on-the-fly” indexing also

Keep Calm & Build AI. Abhinav Kimothi


RAG Pipeline 36

RAG Pipeline Steps


The three main steps in a RAG pipeline are

Search & Retrieval Augmentation Generation

This step involves This step involves This step involves


searching for the adding the context to generating the final
context from the the prompt depending response from the
source (e.g. vector db) on the use case. large language model

An important consideration is how knowledge is stored and accessed. This


has a bearing on the search & retrieval step.

Persistent Temporary Small Data


Vector DBs Vector Index

When a large volume When data is Generally, when small


of data is stored in temporarily stored in amount of data is
vector databases, the vector indices for one retrieved from pre-
retrieval and search time use, the accuracy determined external
needs to be quick. The and relevance of the sources, the
relevance and search needs to be augmentation of the
accuracy of the search ascertained data becomes more
can be tested. critical.
Indexing Pipeline

On the fly

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 37

Retrieval
Perhaps, the most critical step in the entire RAG value chain is searching and
retrieving the relevant pieces of information (known as documents). When the
user enters a query or a prompt, it is this system (Retriever) that is responsible
for accurately fetching the correct snippet of information that is used in
responding to the user query.

Retrievers accept a Query as input and return a list of


Documents as output

Popular Retrieval Methods


Similarity Search
The similarity search functionality of vector databases forms the
backbone of a Retriever. Similarity is calculated by calculating the
distance between the embedding vectors of the input and the
documents

Maximum Marginal Relevance


MMR addresses redundancy in retrieval. MMR considers the
relevance of each document only in terms of how much new
information it brings given the previous results. MMR tries to reduce
the redundancy of results while at the same time maintaining query
relevance of results for already ranked documents/phrases

Multi-query Retrieval
Multi-query Retrieval automates prompt tuning using a language
model to generate diverse queries for a user input, retrieving
relevant documents from each query and combining them to
overcome limitations and obtain a more comprehensive set of
results. This approach aims to enhance retrieval performance by
considering multiple perspectives on the same query.

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 38

Retrieval Methods
Contextual compression
Sometimes, relevant info is hidden in long documents with a lot of
extra stuff. Contextual Compression helps with this by squeezing
down the documents to only the important parts that match your
search.

Multi Vector Retrieval


Sometimes it makes sense to store more than one vectors in a
document. E.g A chapter, its summary and a few quotes. The retrieval
becomes more efficient because it can match with all the different
itypes of nformation that has been embedded.

Parent Document Retrieval


In breaking down documents for retrieval, there's a dilemma. Small
pieces capture meaning better in embeddings, but if they're too
short, context is lost. The Parent Document Retrieval finds a middle
ground by storing small chunks. During retrieval, it fetches these bits,
then gets the larger documents they came from using their parent
IDs

Self Query
A self-querying retriever is a system that can ask itself questions.
When you give it a question in normal language, it uses a special
process to turn that question into a structured query. Then, it uses
this structured query to search through its stored information. This
way, it doesn't just compare your question with the documents; it
also looks for specific details in the documents based on your
question, making the search more efficient and accurate.

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 39

Retrieval Methods
Time-weighted Retrieval
This method supplements the semantic similarity search with a time
delay. It gives more weightage, then, to documents that are fresher
or more used than the ones that are older

Ensemble Techniques
As the term suggests, multiple retrieval methods can be used in
conjunction with each other. There are many ways of implementing
ensemble techniques and use cases will define the structure of the
retriever

Top Advanced Retrieval Strategies

Source : LangChain State of AI 2023

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 40

Example : Similarity Search using LangChain


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using all-MiniLM-L6-v2
4. Storing the embeddings into Chromadb
5. Retrieving chunks using similarity_search

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 41

Example : Similarity Vector Search


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using all-MiniLM-L6-v2
4. Storing the embeddings into Chromadb
5. Converting input query into a vector embedding
6. Retrieving chunks using similarity_search_by_vector

How Similarity Vector Search is different from Similarity Search is that the query
is also converted into a vector embedding from regular text

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 42

Example : Maximum Marginal Relevance


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using OpenAI Embeddings
4. Storing the embeddings into Qdrant
5. Retrieving and ranking chunks using max_marginal_relevance_search

fetch_k = Number of documents in the initial retrieval


k = final number of reranked documents to output

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 43

Example : Multi-query Retrieval


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using OpenAI Embeddings
4. Storing the embeddings into Qdrant
5. Set the LLM as ChatOpenAI (gpt 3.5)
6. Set up logging to see the query variations generated by the LLM
7. use MultiQueryRetriever & get_relevant_documents functions

Keep Calm & Build AI. Abhinav Kimothi


Retrieval 44

Example : Contextual compression


1. Loading our text file using TextLoader,
2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using OpenAI Embeddings
4. Set up retriever as FAISS
5. Set the LLM as ChatOpenAI (gpt 3.5)
6. Use LLMChainExtractor as the compressor
7. use ContextualCompressionRetriever & get_relevant_documents functions

Keep Calm & Build AI. Abhinav Kimothi


Augmentation & Generation 45

Augmentation & Generation


Post-retrieval, the next set of steps include merging the user query and the
retrieved context (Augmentation) and passing this merged prompt as an
instruction to an LLM (Generation)

User Retrieved System Context


Query Context Instruction Augmented Prompt

Question Context Instruction Answer the question based


Who won the The 2023 Cricket Answer the only on the following context :
2023 ICC World Cup, question based The 2023 Cricket World Cup,
Cricket World concluded on 19 only on the concluded on 19 November
Cup? November 2023, following context 2023, with Australia winning
with Australia the tournament.
winningthe
tournament. Question - Who won the 2023
ICC Cricket World Cup?

Augmentation with an Illustrative Example

Answer the question based


only on the following context : Australia won the 2023
The 2023 Cricket World Cup,
ICC Cricket World Cup?
concluded on 19 November
2023, with Australia winning
the tournament.

Question - Who won the 2023


LLM
ICC Cricket World Cup?

Context Augmented Prompt Contextual Response


Generation with an Illustrative Example

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 46

Evaluation
Building a PoC RAG pipeline is not overtly complex. LangChain and LlamaIndex
have made it quite simple. Developing highly impressive Large Language Model
(LLM) applications is achievable through brief training and verification on a
limited set of examples. However, to enhance its robustness, thorough testing on
a dataset that accurately mirrors the production distribution is imperative.

RAG is a great tool to address hallucinations in LLMs but...


even RAGs can suffer from hallucinations

This can be because -


The retriever fails to retrieve relevant context or retrieves irrelevant context
The LLM, despite being provided the context, does not consider it
The LLM instead of answering the query picks irrelevant information from the
context

Two processes, therefore, to focus on from an evaluation perspective -

Search & Retrieval Generation

How good is the retrieval of the How good is the generated


context from the Vector response?
Database?
Is the response grounded in the
Is it relevant to the query? provided context?

How much noise (irrelevant Is the response relevant to the


information) is present? query?

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 47

Ragas (RAG Assessment)


Jithin James and Shahul ES from Exploding Gradients, in 2023, developed the
Ragas framework to address these questions.

https://github.com/explodinggradients/ragas

Evaluation Data
To evaluate RAG pipelines, the following four data points are recommended

A set of Queries or Retrieved Context for each


Prompts for evaluation prompt

Corresponding Response Ground Truth or known


or Answer from LLM correct response

Evaluation Metrics
Evaluating Generation
Faithfulness Is the Response faithful to the Retrieved Context?

Answer Relevance Is the Response relevant to the Prompt?

Retrieval Evaluation
Context Relevance Is the Retrieved Context relevant to the Prompt?

Context Recall Is the Retrieved Context aligned to the Ground Truth?

Context Precision is the Retrieved Context ordered correctly?

Overall Evaluation
Answer Semantic Similarity Answer Correctness
is the Response semantically is the Response semantically
similar to the Ground Truth? and factually similar to the
Ground Truth?

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 48

Evaluation Metrics
Faithfulness
Faithfulness is the measure of the extent to which the response is
factually grounded in the retrieved context
Problem addressed : The LLM, despite being provided the context, does
not consider it
or
Is the response grounded in the provided context?

Evaluated Process : Generation


Any measure of retrieval accuracy is out of scope

Score Range : (0,1) Higher score is better

Methodology
Faithfulness identifies the number of “claims” made in the response and
calculates the proportion of those “claims” present in the context.

Number of generated claims present in the context


Faithfulness = Total number of claims made in the generated response

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?
Context : The 2023 ODI Cricket World Cup concluded on 19 November 2023,
with Australia winning the tournament.

Response 1 : High Faithfulness Response 2 : Low Faithfulness


[Australia] won on [19 November [Australia] won on [15 October
2023] 2023]

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 49

Evaluation Metrics
Answer Relevance
Answer Relevance is the measure of the extent to which the response is
relevant to the query or the prompt
Problem addressed :The LLM instead of answering the query responds
with irrelevant information
or
Is the response relevant to the query?

Evaluated Process : Generation


Any measure of retrieval accuracy is out of scope

Score Range : (0,1) Higher score is better

Methodology
For this metric, a response is generated for the initial query or prompt
To compute the score, the LLM is then prompted to generate questions
for the generated response several times. The mean cosine similarity
between these questions and the original one is then calculated. The
concept is that if the answer correctly addresses the initial question, the
LLM should generate questions from it that match the original question.

Avg (
Answer Relevance = Sc (Initial Query, LLM generated Query [i])
)

Illustrative Example

Query : Who won the 2023 ODI Cricket World Cup and when?

Response 1 : High Answer Relevance Response 2 : Low Answer Relevance

India won on 19 November 2023 Cricket world cup is held once every
four years

Note
Answer Relevance is not a measure of truthfulness but only of relevance. The
response may or may not be factually accurate but may be relevant.

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 50

Evaluation Metrics
Context Relevance
Context Relevance is the measure of the extent to which the retrieved
context is relevant to the query or the prompt
Problem addressed :The retriever fails to retrieve relevant context
or
Is the retrieved context relevant to the query?

Evaluated Process : Retrieval


Indifferent to the final generated response

Score Range : (0,1) Higher score is better

Methodology
The retrieved context should contain information only relevant to the
query or the prompt. For context relevance, a metric ‘S’ is estimated. ‘S’
is the number of sentences in the retrieved context that are relevant for
responding to the query or the prompt.

S
Context Relevance = (number of relevant sentences from the context)
Total number of sentences in the retrieved context

Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?

Context 1 : High Context Relevance Context 2 : Low Context Relevance


The 2023 Cricket World Cup, The 2023 Cricket World Cup was the
concluded on 19 November 2023, 13th edition of the Cricket World
with Australia winning the Cup. It was the first Cricket World
tournament. The tournament took Cup which India hosted solely. The
place in ten different stadiums, in tournament took place in ten
ten cities across the country. The different stadiums.In the first semi-
final took place between India and final India beat New Zealand, and
Australia at Narendra Modi in the second semi-final Australia
Stadium beat South Africa.

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 51

Evaluation Metrics
Ground Truth
Ground truth is information that is known to be real or true. In RAG, or
Generative AI domain in general, Ground Truth is a prepared set of Prompt-
Response examples. It is akin to labelled data in Supervised Learning parlance.
Calculation of certain metrics necessitates the availability of Ground Truth data

Context Recall
Context recall measures the extent to which the retrieved context aligns
with the “provided” answer or Ground Truth
Problem addressed :The retriever fails to retrieve accurate context
or
Is the retrieved context good enough to provide the response?

Evaluated Process : Retrieval


Indifferent to the final generated response

Score Range : (0,1) Higher score is better

Methodology
To estimate context recall from the ground truth answer, each sentence
in the ground truth answer is analyzed to determine whether it can be
attributed to the retrieved context or not. Ideally, all sentences in the
ground truth answer should be attributable to the retrieved context.

Number of Ground Truth sentences in the context


Context Recall =
Total number of sentences in the Ground Truth

Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
Ground Truth : Australia won the world cup on 19 November, 2023.

Context 1 : High Context Recall Context 2 : Low Context Recall


The 2023 Cricket World Cup, The 2023 Cricket World Cup was the
concluded on 19 November 2023, 13th edition of the Cricket World
with Australia winning the Cup. It was the first Cricket World
tournament. Cup which India hosted solely.

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 52

Evaluation Metrics
Context Precision
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the contexts are ranked higher or not.
Problem addressed :The retriever fails to rank retrieve context correctly
or
Is the higher ranked retrieved context better to provide the response?

Evaluated Process : Retrieval


Indifferent to the final generated response

Score Range : (0,1) Higher score is better

Methodology
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the all retrieved context documents are
ranked higher or not. Ideally all the relevant chunks must appear at the
top

Context Sum( Precision@k)


=
Precision @ k Total number of relevant documents in the top results

True Positives @ k
Precision @ k =
(True Positives @ k + False Positives @ k)

Precision @ k
Precision@k is a metric used in information retrieval and recommendation
systems to evaluate the accuracy of the top k items retrieved or recommended.
It measures the proportion of relevant items among the top k items.

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 53

Evaluation Metrics
Answer semantic similarity
Answer semantic similarity evaluates whether the generated response is
similar to the “provided” response or Ground Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Score Range : (0,1) Higher score is better
Methodology
Answer semantic similarity score is calculated by measuring the
semantic similarity between the generated response and the ground
truth response.

Answer Semantic Similarity (Generated Response,


=
Similarity Ground Truth Response)

Answer Correctness
Answer correctness evaluates whether the generated response is
semantically and factually similar to the “provided” response or Ground
Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Score Range : (0,1) Higher score is better
Methodology
Answer correctness score is calculated by measuring the semantic and
the factual similarity between the generated response and the ground
truth response.

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 54

Synthetic Test Data Generation


Generating hundreds of QA (Question-Context-Answer) samples from documents
manually can be a time-consuming and labor-intensive task. Moreover, questions
created by humans may face challenges in achieving the necessary level of
complexity for a comprehensive evaluation, potentially affecting the overall
quality of the assessment.

Synthetic Data Generation uses Large Language Models to generate a variety of


Questions/Prompts and Responses/Answers from the Documents (Context). It
can greatly reduce developer time.

Multi-context
Question

Conditional
Question

Reasoning
Seed Question Question Question Evaluation
Documents Generator Evolver Dataset

Synthetic Data Generation Pipeline

Synthetic Data Generated Using Ragas

Ragas Documentation

Keep Calm & Build AI. Abhinav Kimothi


Evaluation 55

The RAG Triad (TruLens)


The RAG triad is a framework proposed by TruLens to evaluate hallucinations
along each edge of the RAG architecture.

Answer Relevance Context Relevance


Is the Response Is the Retrieved
Query/ Prompt Context relevant to
relevant to the
Prompt? the Prompt?

Answer/
Context
Response
Groundedness
Is the Response faithful to
the Retrieved Context?

Context Relevance:
Verify quality by ensuring each context chunk is relevant to the input query

Groundedness:
Verify groundedness by breaking down the response into individual claims.
Independently search for evidence supporting each claim in the retrieved
context.

Answer Relevance:
Ensure the response effectively addresses the original question.
Verify by evaluating the relevance of the final response to user input.

Trulens Documentation

Keep Calm & Build AI. Abhinav Kimothi


RAG vs Finetuning 56

RAG vs Finetuning vs Both


Supervised Finetuning (SFT) has fast become a popular method to customise and
adapt foundation models for specific objectives. There has been a growing debate
in the applied AI community around the application of fine-tuning or RAG to
accomplish tasks.

RAG & SFT should considered as complementary, rather than


competing, techniques.

RAG enhances the non- SFT changes the parameters


parametric memory of a of a foundation model and
foundation model without therefore impacting the
changing the parameters parametric memory

If the requirement dictates changes to the parametric memory and an increase in


the non-parametric memory, then RAG and SFT can be used in conjunction

RAG Features SFT Features


Connect to dynamic external Change the style, vocabulary,
data sources tone of the foundation model

Reduce hallucinations Can reduce model size

Increase transparency (in terms Useful for deep domain expertise


of source of information)

Works well only with very large May not address the problem of
foundation models hallucinations

Does not impact the style, tone, No improvement in transparency


vocabulary of the foundation (as black box as foundation
model models)

Keep Calm & Build AI. Abhinav Kimothi


RAG vs Finetuning 57

Important Use Case Considerations


Do you require usage of dynamic Do you require changing the writing
external data? style, tonality, vocabulary of the model?
RAG preferred over SFT SFT preferred over RAG
External/dynamic knowledge

SFT + RAG
(hybrid approach)
RAG preferred
over SFT

SFT preferred
over RAG

Change in model (style, tone, vocab, etc.)

RAG should be implemented (with or without SFT) if the use case requires
Access to an external data source, especially, if the data is dynamic

Resolving Hallucinations

Transparency in terms of the source of information

For SFT, you’ll need to have access to labelled training data

Keep Calm & Build AI. Abhinav Kimothi


RAG vs Finetuning 58

Other Considerations
Latency
RAG pipelines require an additional step of searching and retrieving context
which introduces an inherent latency in the system

Scalability
RAG pipelines are modular and therefore can be scaled relatively easily when
compared to SFT. SFT will require retraining the model with each additional data
source

Cost
Both RAG and SFT warrant upfront investment. Training cost for SFT can vary
depending on the technique and the choice of foundation model. Setting up the
knowledge base and integration can be costly for RAG

Expertise
Creating RAG pipelines has become moderately simple with frameworks like
LangChain and LlamaIndex. Fine-tuning on the other hand requires deep
understanding of the techniques and creation of training data

Keep Calm & Build AI. Abhinav Kimothi


LLMOps Stack 59

Evolving RAG LLMOps Stack


The production ecosystem for RAG and LLM applications is still evolving. Early
tooling and design patterns have emerged.

10 App Hosting

11 Monitoring

9 Deployment & Inference

8
Application/
Orchestration

6 Prompt Engg 7 Evaluation

4 Foundation LLM 5 SFT Model

1 Data Preparation 2 Embeddings 3 Vector Storage

Data Layer
The foundation of RAG applications is the data layer. This involves -
Data preparation - Sourcing, Cleaning, Loading & Chunking
Creation of Embeddings
Storing the embeddings in a vector store
We’ve seen this process in the creation of the indexing pipeline

Data Preparation Embeddings Vector Storage

Popular Data Layer Vendors (Non Exhaustive)

Keep Calm & Build AI. Abhinav Kimothi


LLMOps Stack 60

Model Layer
2023 can be considered a year of LLM wars. Almost every other week in the
second half of the year a new model was released. Like there is no RAG without
data, there is no RAG without an LLM. There are four broad categories of LLMs
that can be a part of a RAG application

1. A Proprietary Foundation Model - Developed and maintained by providers


(like OpenAI, Anthropic, Google) and is generally available via an API
2. Open Source Foundation Model - Available in public domain (like Falcon,
Llama, Mistral) and has to be hosted and maintained by you.
3. A Supervised Fine-Tuned Proprietary Model - Providers enable fine-tuning of
their proprietary models with your data. The fine-tuned models are still
hosted and maintained by the providers and are available via an API
4. A Supervised Fine-Tuned Open Source Model - All Open Source models can
be fine-tuned by you on your data using full fine-tuning or PEFT methods.

There are a lot of vendors that have enabled access to open source models and
also facilitate easy fine tuning of these models

Proprietary Models Open Source Models


GPT3.5/GPT4 Llama 2 by Meta
Claude
Mistral & Mixtral

Falcon
phi2 by MicroSoft

Popular proprietary and open source LLMs (Non Exhaustive)

Proprietary Models Open Source Models


GPT Series
Claude, Jurassic &
Titan
AWS Sagemaker
Jumpstart

Popular vendors providing access to LLMs (Non Exhaustive)

Note : For Open Source models it is important to check the license type. Some
open source models are not available for commercial use

Keep Calm & Build AI. Abhinav Kimothi


LLMOps Stack 61

Prompt Layer
Prompt Engineering is more than writing questions in natural language. There are
several prompting techniques and developers need to create prompts tailored
to the use cases. This process often involves experimentation: the developer
creates a prompt, observes the results and then iterates on the prompts to
improve the effectiveness of the app. This requires tracking and collaboration

Popular prompt engineering platforms (Non Exhaustive)

Evaluation
It is easy to build a RAG pipeline but to get it ready for production involves
robust evaluation of the performance of the pipeline. For checking
hallucinations, relevance and accuracy there are several frameworks and tools
that have come up.

Ragas
Popular RAG evaluation frameworks and tools (Non Exhaustive)

App Orchestration
An RAG application involves interaction of multiple tools and services. To run
the RAG pipeline, a solid orchestration framework is required that invokes these
different processes.

Popular App orchestration frameworks (Non Exhaustive)

Keep Calm & Build AI. Abhinav Kimothi


LLMOps Stack 62

Deployment Layer
Deployment of the RAG application can be done on any of the available cloud
providers and platforms. Some important factors to consider while deployment
are also -
Security and Governance
Logging
Inference costs and latency

Popular cloud providers and LLMOps platforms (Non Exhaustive)

Application Layer
The application finally needs to be hosted for the intended users or systems to
interact with it. You can create your own application layer or use the available
platforms.

Popular app hosting platforms (Non Exhaustive)

Monitoring
Deployed application needs to be continuously monitored for both accuracy and
relevance as well as cost and latency.

Popular monitoring platforms (Non Exhaustive)

Other Considerations
LLM Cache - To reduce costs by saving responses for popular queries
LLM Guardrails - To add additional layer of scrutiny on generations

Keep Calm & Build AI. Abhinav Kimothi


Multimodal RAG 63

Multimodal RAG
Up until now, most AI models have been limited to a single modality (a single type
of data like text or images or video). Recently, there has been significant progress
in AI models being able to handle multiple modalities (majorly text and images).
With the emergence of these Large Multimodal Models (LMMs) a multimodal RAG
system becomes possible.

“Generate any type of output from any type of


input providing any type of context”

The high-level features of multimodal RAG are -

1. Ability to query/prompt in one or more modalities like sending both


text and image as input.
2. Ability to search and retrieve not only text but also images, tables,
audio files related to the query
3. Ability to generate text, image, video etc. irrespective of the
mode(s) in which the input is provided.

Approaches

Using MultiModal Embeddings Using LMMs Only

Large MultiModel Models

Flamingo BLIP KOSMOS-1 Macaw-LLM GPT4 Gemini

LlaVA LAVIN LLaMA - Adapter FUYU

Keep Calm & Build AI. Abhinav Kimothi


Multimodal RAG 64

Multimodal RAG Approaches


Using MultiModal Embeddings
Multimodal embeddings (like CLIP) are used to embed images and text
User Query is used to retrieve context which can be image and/or text
The image and/or text context is passed to an LMM with the prompt.
The LMM generates the final response based on the prompt

Query/
Prompt

Data Multimodal Vector


Loading Embedding Store

Retrieved Multimodal
LMM
Context Response

Indexing Pipeline RAG Pipeline

Multimodal RAG using Multimodal Embeddings

CLIP : Contrastive Language-Image Pre-training OpenAI's CLIP


(Contrastive
Mapping data of different modalities into a shared embedding space
Language-Image
Multimodal Pre-training), maps
Embeddings of
Text both images and
Language
Projection Matrix
text into the same
semantic
embedding space.
Similarity This allows CLIP to
Text Encoder Text Embeddings Score
"understand" the
Image Encoder Image Embeddings
relationship
between texts and
images for
Vision Projection powerful
Matrix
Multimodal applications
Embeddings of
Image

CLIP is an example of training multimodal embeddings

Keep Calm & Build AI. Abhinav Kimothi


Multimodal RAG 65

Using LMMs to produce text summaries from images


Indexing
An LLM is used to generate captions for images in the data
The image captions and text summaries are stored as text embeddings in a
vector database
A mapping is maintained from the image captions to the image files
Generation
User enters a query (with text and image)
Image captions are generated using an LLM and embeddings are generated
Text summaries and image captions are searched. Images are retrieved based
on the relevant image captions.
Retrieved text summaries, captions and images are passed to the LMM with
the prompt. The LMM generates a multimodal response

Data
Loading
Indexing Pipeline

LMM

Image Caption
Text Summary

Text
Embeddings

Stored Vector
Images Store

Retrieved LMM Multimodal


Query/ Text & Image Response
Prompt RAG Pipeline

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 66

Progression of RAG Systems


Ever since its introduction in mid-2020, RAG approaches have followed a
progression aiming to achieve the redressal of the hallucination problem in LLMs

Naive RAG
At its most basic, Retrieval Augmented Generation can be summarized in three
steps -
1. Indexing of the documents
2. Retrieval of the context with respect to an input query
3. Generation of the response using the input query and retrieved context

LLM
Indexing
Documents

Response
Retrieval

Prompt

User Query

This basic RAG approach can also be termed “Naive RAG”

Challenges in Naive RAG


Retrieval Quality Augmentation Generation Quality
Low Precision leading Redundancy and Generations are not
to Repetition when grounded in the context
Hallucinations/Mid-air multiple retrieved Potential of toxicity and
drops documents have bias in the response
Low Recall resulting similar information Excessive dependence
in missing relevant Context Length on augmented context
info challenges
Outdated information

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 67

Advanced RAG
To address the inefficiencies of the Naive RAG approach, Advanced RAG
approaches implement strategies focussed on three processes -

Pre-Retrieval Retrieval Post Retrieval

Documents User Query


Pre-Retrieval

Chunk Optimisation
Metadata Integration
Indexing Structure
Alignment

Indexing

Retrieval

Retrieval Fine-tuned Embeddings Iterative Retrieval Query Rewriting


Dynamic Embeddings Hybrid Search Sub Queries
Adapters HyDE Query Routing

Post Retrieval

Information Compression
Re-ranking
Prompt LLM

Response

* Indicative, non-exhaustive list

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 68

Advanced RAG Concepts


Pre-retrieval/Retrieval Stage
Chunk Optimization
When managing external documents, it's important to break them into the right-
sized chunks for accurate results. The choice of how to do this depends on
factors like content type, user queries, and application needs. No one-size-fits-all
strategy exists, so flexibility is crucial. Current research explores techniques like
sliding windows and "small2big" methods

Metadata Integration
Information like dates, purpose, chapter summaries, etc. can be embedded into
chunks. This improves the retriever efficiency by not only searching the
documents but also by assessing the similarity to the metadata.

Indexing Structure
Introduction of graph structures can greatly enhance retrieval by leveraging
nodes and their relationships. Multi-index paths can be created aimed at
increasing efficiency.

Alignment
Understanding complex data, like tables, can be tricky for RAG. One way to
improve the indexing is by using counterfactual training, where we create
hypothetical (what-if) questions. This increases the alignment and reduces
disparity between documents.

Query Rewriting
To bring better alignment between the user query and documents, several
rewriting approaches exists. LLMs are sometimes used to create pseudo
documents from the query for better matching with existing documents.
Sometimes, LLMs perform abstract reasoning. Multi-querying is employed to
solve complex user queries.

Hybrid Search Exploration


The RAG system employs different types of searches like keyword, semantic and
vector search, depending upon the user query and the type of data available.

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 69

Sub Queries
Sub querying involves breaking down a complex query into sub questions for
each relevant data source, then gather all the intermediate responses and
synthesize a final response.

Query Routing
A query router identifies a downstream task and decides the subsequent action
that the RAG system should take. During retrieval, the query router also identifies
the most appropriate data source for resolving the query.

Iterative Retrieval
Documents are collected repeatedly based on the query and the generated
response to create a more comprehensive knowledge base.

Recursive Retrieval
Recursive retrieval also iteratively retrieves documents. However, it also refines
the search queries depending on the results obtained from the previous retrieval.
It is like a continuous learning process.

Adaptive Retrieval
Enhance the RAG framework by empowering Language Models (LLMs) to
proactively identify the most suitable moments and content for retrieval. This
refinement aims to improve the efficiency and relevance of the information
obtained, allowing the models to dynamically choose when and what to retrieve,
leading to more precise and effective results

Hypothetical Document Embeddings (HyDE)


Using the Language Model (LLM), HyDE forms a hypothetical document (answer)
in response to a query, embeds it, and then retrieves real documents similar to
this hypothetical one. Instead of relying on embedding similarity based on the
query, it emphasizes the similarity between embeddings of different answers.

Fine-tuned Embeddings
This process involves tailoring embedding models to improve retrieval accuracy,
particularly in specialized domains dealing with uncommon or evolving terms. The
fine-tuning process utilizes training data generated with language models where
questions grounded in document chunks are generated.

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 70

Post Retrieval Stage

Information Compression
While the retriever is proficient in extracting relevant information from extensive
knowledge bases, managing the vast amount of information within retrieval
documents poses a challenge. The retrieved information is compressed to extract
the most relevant points before passing it to the LLM.

Reranking
The re-ranking model plays a crucial role in optimizing the document set retrieved
by the retriever. The main idea is to rearrange document records to prioritize the
most relevant ones at the top, effectively managing the total number of
documents. This not only resolves challenges related to context window
expansion during retrieval but also improves efficiency and responsiveness.

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 71

Modular RAG
The SOTA in Retrieval Augmented Generation is a modular approach which allows
components like search, memory, and reranking modules to be configured

Routing Modules

Search Predict

Retrieve Advanced

Rewrite Naive Rerank

Read

Demonstrate Fusion

Memory

Naive RAG is essentially a Retrieve -> Read approach which focusses on retrieving
information and comprehending it.
Advanced RAG is adds to the Retrieve -> Read approach by adding it into a
Rewrite and Rerank components to improve relevance and groundedness.
Modular RAG takes everything a notch ahead by providing flexibility and adding
modules like Search, Routing, etc.

Naive, Advanced & Modular RAGs are not exclusive approaches but a
progression. Naive RAG is a special case of Advanced which, in turn, is a special
case of Modular RAG

Keep Calm & Build AI. Abhinav Kimothi


Progression of RAG Systems 72

Some RAG Modules


Search
The search module is aimed at performing search on different data sources. It is
customised to different data sources and aimed at increasing the source data for
better response generation

Memory
This module leverages the parametric memory capabilities of the Language Model
(LLM) to guide retrieval. The module may use a retrieval-enhanced generator to
create an unbounded memory pool iteratively, combining the "original question"
and "dual question." By employing a retrieval-enhanced generative model that
improves itself using its own outputs, the text becomes more aligned with the
data distribution during the reasoning process.

Fusion
RAG-Fusion improves traditional search systems by overcoming their limitations
through a multi-query approach. It expands user queries into multiple diverse
perspectives using a Language Model (LLM). This strategy goes beyond capturing
explicit information and delves into uncovering deeper, transformative
knowledge. The fusion process involves conducting parallel vector searches for
both the original and expanded queries, intelligently re-ranking to optimize
results, and pairing the best outcomes with new queries.

Extra Generation
Rather than directly fetching information from a data source, this module
employs the Language Model (LLM) to generate the required context. The content
produced by the LLM is more likely to contain pertinent information, addressing
issues related to repetition and irrelevant details in the retrieved content.

Task Adaptable Module


This module makes RAG adaptable to various downstream tasks allowing the
development of task-specific end-to-end retrievers with minimal examples,
demonstrating flexibility in handling different tasks.

Keep Calm & Build AI. Abhinav Kimothi


Acknowledgements 73

Acknowledgements
Retrieval Augmented Generation continues to be a pivotal approach for any
Generative AI led application and it is only going to grow. There are several
individuals and organisations that have provided learning resources and made
understanding RAG fun.

I’d like to thank -


My team at Yarnit.app for taking a bet on RAG and helping me explore and
execute RAG pipelines for content generation
Andrew Ng and the good folks at deeplearning.ai for their short courses
allowing everyone access to generative AI
OpenAI and HuggingFace for all that they do
Harrison Chase and all the folks at LangChain for not only building the
framework but also making it easy to execute
Jerry Liu and others at LlamaIndex for their perspectives and tutorials on RAG
TruEra for demystifying observability and the tech stack for LLMOps
PineCone for their amazing documentation and the learning center
The team at Exploding Gradients for creating Ragas and explaining RAG
evaluation in detail
TruLens for their triad of RAG evaluations
Aman Chadha for his curation of all thing AI, ML and Data Science
Above all, to my colleagues and friends, who endeavour to learn, discover and
apply technology everyday in their effort to make the world a better place.

With lots of love,

Abhinav If you like’d Detailed Notes from Generative


AI with Large Language Models
what you
Course by Deeplearning.ai and
read
AWS.

I talk about :
#AI #MachineLearning #DataScience
#GenerativeAI #Analytics #LLMs
#Technology #RAG #EthicalAI
let’s connect... Download free
ebook
... please

Keep Calm & Build AI. Abhinav Kimothi


Resources 74

Resources
Official Documentations

Python Documentation Python Documentation Learning Center

Ragas

🍑
Documentation Documentation Documentation Documentation

Thought Leaders and Influencers

Aman Lillian Weng’s Leonie Monigatti’s Chip Huyen


Chadha’s Blog Log Blogroll Blogs

Research Papers

Retrieval-Augmented Retrieval-Augmented KG-Augmented Language


Generation for Large Multimodal Language Models for Knowledge-
Language Models: A Survey Modeling Grounded Dialogue
(Gao, et al, 2023) (Yasunaga, et al, 2023) (Kang, et al, 2023)

Learning Resources and Tutorials

Short 1-hour Python Tutorials &


Courses Cookbook Webinars

Keep Calm & Build AI. Abhinav Kimothi


Epilogue 75

Hello!
I’m Abhinav...
A data science and AI professional with over 15
years in the industry. Passionate about AI
advancements, I constantly explore emerging
technologies to push the boundaries and create
positive impacts in the world. Let’s build the future,
together!

Please share your feedback on these notes with me

LinkedIn Github Medium Insta email X Linktree Gumroad

Talk to me Checkout Yarnit Magic Newsletter

Book a meeting

5-in-1 Generative AI Powered


Content Marketing Application

www.yarnit.app

$$ Contribute $$
Kee
p Ca
& Bu lm
ild A
Subscribe I.

Follow on LinkedIn

Keep Calm & Build AI. Abhinav Kimothi

You might also like