RAG - A Simple Introduction

Retrieval
Augmented
Generation
A Simple Introduction
Abhinav Kimothi
2
Table of Contents
01. What is RAG? ............................................................................. 3
02. How does RAG help? ................................................................ 6
03. What are some popular RAG use cases? ................................ 7
04. RAG Architecture ..................................................................... 8

i) Indexing Pipeline ...................................................................... 9
a) Data Loading ...................................................................... 10
b) Document Splitting .......................................................... 14
c) Embedding .......................................................................... 23
d) Vector Stores ..................................................................... 29
ii) RAG Pipeline ........................................................................... 35
a) Retrieval ............................................................................... 37
b) Augmentation and Generation ....................................... 45
05. Evaluation ............................................................................... 46
06. RAG vs Finetuning ................................................................. 56
07. Evolving RAG LLMOps Stack ................................................. 59
08. Multimodal RAG ..................................................................... 63
09. Progression of RAG Systems ................................................ 66

i) Naive RAG ................................................................................ 66
ii) Advanced RAG ....................................................................... 67
iii) Multimodal RAG .................................................................... 71
10. Acknowledgements ................................................................ 73
11. Resources ................................................................................. 74
Keep Calm & Build AI. Abhinav Kimothi

What is RAG? 3
What is RAG?
Retrieval Augmented Generation
30th November, 2022 will be remembered as the watershed moment in artificial
intelligence. OpenAI released ChatGPT and the world was mesmerised. Interest in
previously obscure terms like Generative AI and Large Language Models (LLMs),
was unstoppable over the following 12 months.
Generative AI Large Language Models
100
80
60
40
20
0
2022-11-06 2023-11-05
Google Trends - Interest Over Time (Nov’22 to Nov’23)
The Curse Of The LLMs

As usage exploded, so did the expectations. Many users started using ChatGPT as
a source of information, like an alternative to Google. As a result, they also started
encountering prominent weaknesses of the system. Concerns around copyright,
privacy, security, ability to do mathematical calculations etc. aside, people
realised that there are two major limitations of Large Language Models.
A Knowledge Cut-off date Hallucinations

Training an LLM is an expensive and Often, it was observed that LLMs
time-consuming process. LLMs are provided responses that were factually
trained on massive amount of data. The incorrect. Despite being factually
data that LLMs are trained on is incorrect, the LLM responses
therefore historical (or dated). “sounded” extremely confident and
e.g. The latest GPT4 model by OpenAI legitimate. This characteristic of “lying
has knowledge only till April 2023 and with confidence” proved to be one of
any event that happened post that the biggest criticisms of ChatGPT and
date, the information is not available to LLM techniques, in general.
the model.
Users look at LLMs for knowledge and wisdom, yet LLMs

are sophisticated predictors of what word comes next.

What is RAG? 4
The Hunger For More

While the weaknesses of LLMs were being discussed, a parallel discourse around
providing context to the models started. In essence, it meant creating a ChatGPT
on proprietary data.
The Challenge
Make LLMs respond with up-to-date information
Make LLMs not respond with factually inaccurate
information
Make LLMs aware of proprietary information
Providing LLMs with information not in their memory
Providing Context
While model re-training/fine-tuning/reinforcement learning are options that solve
the aforementioned challenges, these approaches are time-consuming and
costly. In majority of the use-case, these costs are prohibitive.
In May 2020, researchers in their paper “Retrieval-Augmented Generation for

Knowledge-Intensive NLP Tasks” explored models which combine pre-trained
parametric and non-parametric memory for language generation.

What is RAG? 5
So, What is RAG?

In 2023, RAG has become one of the most used technique in the domain of Large
Language Models.
R A G
{Prompt} {Prompt + Context}
Retriever
User LLM
Search Fetch
Lookup the external source to
R retrieve the relevant information
Context
A Add the retrieved information to

the user prompt
G Use LLM to generate response to

user prompt with the context
Proprietary and Non-proprietary information
What is RAG?
User enters a prompt/query
Retriever searches and fetches information relevant to the prompt

(e.g. from the internet or internet data warehouse)
Retrieved relevant information is augmented to the prompt as context
LLM is asked to generate response to the prompt in the context

(augmented information)
User receives the response

A Naive RAG workflow

How does RAG help? 6
How does RAG help?

Unlimited Knowledge
The Retriever of an RAG system can have access to external sources of information. Therefore,
the LLM is not limited to its internal knowledge. The external sources can be proprietary
documents and data or even the internet.
Without RAG With RAG

Web Pages
APIs & Dynamic DBs
Retriever
Document Repos
Other Sources
Databases
An LLM has knowledge Retriever searches and fetches information that the LLM has not
only of the data it has necessarily been trained on. This adds to the LLM memory and is passed as
been trained on context in the prompts. Also called Non-Parametric Memory (information
Also called Parametric available outside the model parameters)
Memory (information Expandable to all sources
stored in the model Easier to update/maintain
parameters) Much cheaper than retraining/fine-tuning
The effort lies in creation of the knowledge base
Confidence in Responses
With the context (extra information that is retrieved) made available to the LLM,
the confidence in LLM responses is increased.
Context Awareness Source Citation Reduced Hallucinations

Added information Access to sources of RAG enabled LLM
assists LLMs in information improves the systems are observed to
generating responses transparency of the LLM be less prone to
that are accurate and responses hallucinations than the
contextually appropriate ones without RAG

What are some popular RAG use cases? 7
RAG Use Cases

The development of RAG technique is rooted in use cases that were limited by the
inherent weaknesses of the LLMs. As of today some commercial applications of
RAG are in -
Document Question Answering Systems
By providing access to proprietary enterprise document to an LLM, the
responses are limited to what is provided within them. A retriever can
search for the most relevant documents and provide the information to
the LLM. Check out this blog for an example
Conversational agents
LLMs can be customised to product/service manuals, domain
knowledge, guidelines, etc. using RAG. The agent can also route users to
more specialised agents depending on their query. SearchUnify has an
LLM+RAG powered conversational agent for their users.
Real-time Event Commentary

Imagine an event like a sports or a new event. A retriever can connect to
real-time updates/data via APIs and pass this information to the LLM to
create a virtual commentator. These can further be augmented with
Text To Speech models.IBM leveraged the technology for commentary
during the 2023 US Open
Content Generation
The widest use of LLMs has probably been in content generation. Using
RAG, the generation can be personalised to readers, incorporate real-
time trends and be contextually appropriate. Yarnit is an AI based
content marketing platform that uses RAG for multiple tasks.
Personalised Recommendation
Recommendation engines have been a game changes in the digital
economy. LLMs are capable of powering the next evolution in content
recommendations. Check out Aman’s blog on the utility of LLMs in
recommendation systems.
Virtual Assistants
Virtual personal assistants like Siri, Alexa and others are in plans to use
LLMs to enhance the experience. Coupled with more context on user
behaviour, these assistants can become highly personalised.

RAG Architecture 8
RAG Architecture
Let’s revisit the five high level steps of an RAG enabled system
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
User writes a prompt or a query that is passed to an orchestrator
Orchestrator sends a search query to the retriever
Retriever fetches the relevant information from the knowledge sources and sends back
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
Two pipelines become important in setting up the RAG system. The first one being
setting up the knowledge sources for efficient search and retrieval and the
second one being the five steps of the generation.
Indexing Pipeline
Data for the knowledge is ingested from the source and indexed. This
involves steps like splitting, creation of embeddings and storage of
data.
RAG Pipeline
This involves the actual RAG process which takes the user query at
run time and retrieves the relevant data from the index, then passes
that to the model

Indexing Pipeline 9
Indexing Pipeline
The indexing pipeline sets up the knowledge source for the RAG system. It is
generally considered an offline process. However, information can also be
fetched in real time. It involves four primary steps.
Loading Splitting Embedding Storing
This step involves This step involves This step involves This step involves
extracting splitting converting text storing the
information from documents into documents into embeddings
different smaller numerical vectors. vectors. Vectors
knowledge sources manageable ML models are are typically stored
a loading them into chunks. Smaller mathematical in Vector
documents. chunks are easier models and Databases which
to search and to therefore require are best suited for
use in LLM context numerical data. searching.
windows.
Offline Indexing pipelines are typically used when a knowledge base

with large amount of data is being built for repeated usage e.g. a
number of enterprise documents, manuals etc.
In cases where only a fixed small amount of one time data is required
e.g. a 300 word blog, there is no need for storing the data. The blog
text can either be directly passed in the LLM context window or a
temporary vector index can be created.
{Prompt} {Prompt + Context}

Retriever
LLM Response
User No search
needed since Fetch
context is fixed
short context On the fly indexing

Indexing Pipeline: Loading Data 10
Loading Data
As we’ve been discussing, the utility of RAG is to access data for all sorts of
sources. These sources can be -
Websites & HTML pages
Documents like word, pdf etc.
Code in python, java etc.
Data in json, csv etc.
APIs
File Directories
Databases
And many more
The first step is to extract the information present in these source locations.
This is a good time to introduce two popular frameworks that are being used to
develop LLM powered applications.
Use cases: Good for applications that Use cases: Good for tasks that
need enhanced AI capabilities, like require text search and retrieval, like
language understanding tasks and more information retrieval or content
sophisticated text generation discovery
Features: Stands out for its versatility Features: Excels in data indexing and
and adaptability in building robust language model enhancement
applications with LLMs
Connectors: Provides connectors to
Agents: Makes creating agents using access data from databases,
large language models simple through external APIs, or other datasets
their agents API
Both frameworks are rapidly evolving and adding new capabilities every week.
It’s not an either/or situation and you can use both together (or neither).

Example : Loading a YouTube Video

Transcript using LangChain Loaders
Let’s begin by sourcing the transcript from this video -
“DALL·E 2 Explained” by OpenAI
(https://www.youtube.com/watch?v=qTgPSKKjfVg)
Below is the code using YoutubeLoader from langchain.document_loaders
LangChain Document Loader : YoutubeLoader
Loader object
[Document(page_content="Have you ever seen a polar bear
playing bass? Or a robot painted like a Picasso? Didn’t think so.
DALL-E 2 is ....
....
....
.....umans\nand clever systems can work together to make new
things – amplifying our creative potential.", metadata={'source':
'qTgPSKKjfVg', 'title': 'DALL·E 2 Explained', 'description': 'Unknown',
'view_count': 853564, 'thumbnail_url':
'https://i.ytimg.com/vi/qTgPSKKjfVg/hq720.jpg', 'publish_date':
'2022-04-06 00:00:00', 'length': 167, 'author': 'OpenAI'})]
The Document object contains the page_content which is the transcript extracted
from the youtube video as well as the metadata description

Example : Loading a Webpage Text using

LlamaIndex Reader
This is a blog published on Medium -
What is a fine-tuned LLM?
(https://medium.com/mlearning-ai/what-is-a-fine-tuned-llm-67bf0b5df081)
Below is the code using SimpleWebPageReader from llama_hub
LlamaIndex LlamaHub Web Page Reader
Loader object
[Document(id_='17761da4-6a3a-4ce5-8590-c65ee446788f',
embedding=None, metadata={}, excluded_embed_metadata_keys=[],
excluded_llm_metadata_keys=[], relationships={},
hash='6471b3ffe4d3abb1aba2ca99d1d0448e2c3cbd157ddca256fab9fa363e0
9ed85', text='<!doctype html><html lang="en"><head><title data-
rh="true">What is a fine-tuned LLM?. Fine-tuning large language models…
| by Abhinav Kimothi |
…
</body></html>', start_char_idx=None, end_char_idx=None,
text_template='{metadata_str}\n\n{content}', metadata_template='{key}:
{value}', metadata_seperator='\n')]
The LlamaIndex Document object contains more attributes than a LangChain

Document. Apart from text and metadata, it also has id, templates and other
customizations available

Both LangChain and LlamaIndex offer loader integrations with more than a
hundred data sources and the list keeps on growing
LangChain Document Loaders LlamaHub Data Loaders
LangChain provides integrations with a LlamaIndex provides data loaders via

variety of sources LlamaHub
These Document Loaders are particularly helpful in quickly making connections

and accessing information. For specific sources, custom loaders can also be
developed.
It is worthwhile exploring documentation for both
LlamaIndex: https://docs.llamaindex.ai/en/stable/
LangChain: https://python.langchain.com/docs/get_started/introduction
Loading documents from a list of sources may turn out to be a complicated

process. Make sure to plan for all the sources and loaders in advance.
More often than naught, transformations/clean-ups to the loaded data will

be required like removing duplicate content, html parsing, etc. LangChain
also provides a variety of document transformers

Indexing Pipeline: Document Splitting 14
Document Splitting
Once the data is loaded, the next step in the indexing pipeline is splitting the
documents into manageable chunks. The question arises around the need of this
step. Why is splitting of documents necessary. There are two reasons for that -
Ease of Search Context Window Size

Large chunks of data are harder to LLMs allow only a finite number of
search over. Splitting data into tokens in prompts and completions. The
smaller chunks therefore helps in context therefore cannot be larger than
better indexation. what the context window permits.
Chunking Strategies
While splitting documents into chunks might sound a simple concept, there are
certain best practices that researchers have discovered. There are a few
considerations that may influence the overall chunking strategy.
Nature of Content
Consider whether you are working with lengthy documents, such as articles or
books, or shorter content like tweets or instant messages. The chosen model for
your goal and, consequently, the appropriate chunking strategy depend on your
response.
Embedding Model being Used

We will discuss embeddings in detail in the next section but the choice of
embedding model also dictates the chunking strategy. Some models perform
better with chunks of specific length
Expected Length and Complexity of User Queries

Determine whether the content will be short and specific or long and complex.
This factor will influence the approach to chunking the content, ensuring a closer
correlation between the embedded query and the embedded chunks
Application Specific Requirements

The application use case, such as semantic search, question answering,
summarization, or other purposes will also determine how text should be
chunked. If the results need to be input into another language model with a token
limit, it is crucial to factor this into your decision-making process.

Chunking Methods
Depending on the aforementioned considerations, a number of text splitters are
available. At a broad level, text splitters operate in the following manner:
Divide the text into compact, semantically meaningful units, often sentences.
Merge these smaller units into larger chunks until a specific size is achieved,
measured by a length function.
Upon reaching the predetermined size, treat that chunk as an independent
segment of text. Thereafter, start creating a new text chunk with some degree
of overlap to maintain contextual continuity between chunks.
Two areas to focus on, therefore are -
How the text is split? How the chunk size is measured?
A very common approach is where we pre-determine the size of the text chunks.
Additionally, we can specify the overlap between chunks (Remember, overlap is

preferred to maintain contextual continuity between chunks).
This approach is simple and cheap and is, therefore, widely used. Let’s look at
some examples -

Split by Character
In this approach, the text is split based on a character and the chunk size is
measured by the number of characters.
Example text : alice_in_wonderland.txt (the book in .txt format)

using LangChain’s CharacterTextSplitter
texts[0]
“TITLE: Alice's Adventures in Wonderland\nAUTHOR: Lewis Carroll\n\n\n CHAPTER I \n( Down the
Rabbit-Hole )\n\n Alice was beginning to get very tired of sitting by her sister\non the bank, and of
having nothing to do: once or twice she had\npeeped into the book her sister was reading, but it
had no\npictures or conversations in it, ànd what is the use of a book,'\nthought Alice `without
pictures or conversation?'\n\n So she was considering in her own mind (as well as she could,\nfor
the hot day made her feel very sleepy and stupid), whether\nthe pleasure of making a daisy-chain
would be worth the trouble\nof getting up and picking the daisies, when suddenly a White\nRabbit
with pink eyes ran close by her.\n\n There was nothing so VERY remarkable in that; nor did
Alice\nthink it so VERY much out of the wayChunk 1 Rabbit say to\nitself, Òh dear! Oh dear! I
to hear the
shall be late!' (when she thought\nit over afterwards, it occurred to her that she ought to
have\nwondered at this, but at the time it all seemed quite natural);\nbut when the Rabbit actually
TOOK A WATCH OUT OF ITS WAISTCOAT-\nPOCKET, and looked at it, and then hurried on, Alice
started to\nher feet, for it flashed across her mind that she had never\nbefore seen a rabbit with
either a waistcoat-pocket, or a watch to\ntake out of it, and burning with curiosity, she ran across
the\nfield after it, and fortunately was just in time to see it pop\ndown a large rabbit-hole under the
hedge.\n\n In another moment down went Alice after it, never once\nconsidering how in the world
she was to get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand
then dipped suddenly down, so suddenly that Alice had not a\nmoment to think about stopping
herself before she found herself\nfalling down a very deep well."
Overlap
texts[1]
"In another moment down went Alice after it, never once\nconsidering how in the world she was to
get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand then dipped
suddenly down, so suddenly that Alice had not a\nmoment to think about stopping herself before
she found herself\nfalling down a very deep well.\n\n Either the well was very deep, or she fell very
slowly, for she\nhad plenty of time as she went down to look about her and to\nwonder what was
going to happen next. First, she tried to look\ndown and make out what she was coming to, but it
Chunk
was too dark to\nsee anything; then she looked at the sides of the 2
well, and\nnoticed that they were
filled with cupboards and book-shelves;\nhere and there she saw maps and pictures hung upon
pegs. She\ntook down a jar from one of the shelves as she passed; it was\nlabelled ÒRANGE
MARMALADE', but to her great disappointment it\nwas empty: she did not like to drop the jar for
fear of killing\nsomebody, so managed to put it into one of the cupboards as she\nfell past it.”

Let’s find out how many chunks were created
Total Number of Chunks Created => 93
Length of the First Chunk is => 1777 characters
Length of the Last Chunk is => 816 characters
Recursive Split by Character

A subtle variation to splitting by character is Recursive Split. The only difference
is that instead of a single character used for splitting, this technique uses a list of
characters and tries to split hierarchically till the chunk sizes are small enough.
This technique is generally recommended for generic text.
Example text : AK_BusyPersonIntroLLM.txt

(Transcript of a YouTube video by Andrej Karpathy titled [1hr Talk] Intro to Large Language
Models - https://www.youtube.com/watch?v=zjkBMFhNj_g&t=9s )
using LangChain’s RecursiveCharacterTextSplitter

This is a generic text that is not formatted. Let’s compare the two strategies.
with CharacterTextSplitter

Text splitter fails to convert the text into chunks since

there are no ‘\n\n’ character present in the raw transcript

with RecursiveCharacterTextSplitter

Recursive text splitter performs well in dealing

with generic text
Split by Tokens
For those well versed with Large Language Models, tokens is not a new concept.
All LLMs have a token limit in their respective context windows which we cannot
exceed. It is therefore a good idea to count the tokens while creating chunks. All
LLMs also have their tokenizers.
Tiktoken Tokenizer
Tiktoken tokenizer has been created by OpenAI for their family of models. Using
this strategy, the split still happens based on the character. However, the length
of the chunk is determined by the number of tokens.


using LangChain’s TokenTextSplitter

Total Number of Tokens in the document => 12865 tokens
Length of the First Chunk is => 1014 tokens
Length of the Last Chunk is => 1014 tokens
Tokenizers are helpful in creating chunks that sit

well in the context window of an LLM
Hugging Face Tokenizer
Hugging Face has become the go-to platform for anyone building apps using LLMs
or even other models. All models available via Hugging Face are also accompanied
by their tokenizers.


using Transformers and LangChain’s RecursiveCharacterTextSplitter
Example tokenizer : GPT2TokenizerFast
texts[0]
“hi everyone so recently I gave a 30-minute talk on large language
models just kind of like an intro talk um unfortunately that talk
was not recorded but a lot of people came to me after the talk and
they told me that uh they Chunk 1
really liked the talk so I would just I
thought I would just re-record it and basically put it up on
YouTube so here we go the busy person's intro to large language
models director Scott okay so let's begin first of all what is a large
language model
No Overlap as specified
texts[1]
really well a large language model is just two files right um there
be two files in this hypothetical directory so for example work with
the specific example of the Llama 270b model this is a large
language model released by meta Ai and this is basically the Llama
series of language models the Chunk 2 iteration of it and this is the
second
70 billion parameter model of uh of this series so there's multiple
models uh belonging to the Lama 2 Series uh 7 billion um 13 billion
34 billion and 70 billion is the the
Do take a look at Hugging Face documents on Tokenizers
https://huggingface.co/docs/transformers/tokenizer_summary

Other Tokenizer
Other libraries like Spacy, NLTK and SentenceTransformers also provide splitters
Specialized Chunking
Chunking often aims to keep text with common context together. With this in
mind, we might want to specifically honour the structure of the document itself
for example HTML, Markdown, Latex or even code.
Example : https://medium.com/p/29a7e8610843
Example HTML : “Context is Key: The Significance of RAG in Language Models”

(A blog on Medium - https://medium.com/p/29a7e8610843)
using LangChain’s HTMLHeaderTextSplitter & RecursiveCharacterTextSplitter
All LangChain Splitters

Things to Keep in Mind

Ensure data quality by preprocessing it before determining the optimal chunk
size. Examples include removing HTML tags or eliminating specific elements
that contribute noise, particularly when data is sourced from the web.
Consider factors such as content nature (e.g., short messages or lengthy

documents), embedding model characteristics, and capabilities like token
limits in choosing chunk sizes. Aim for a balance between preserving context
and maintaining accuracy.
Test different chunk sizes. Create embeddings for the chosen chunk sizes and
store them in your index or indices. Run a series of queries to evaluate quality
and compare the performance of different chunk sizes.

Indexing Pipeline: Embeddings 23
Embeddings
All Machine Learning/AI models work with numerical data. Before the
performance of any operation all text/image/audio/video data has to be
transformed into a numerical representation. Embeddings are vector
representations of data that capture meaningful relationships between entities.
As a general definition, embeddings are data that has been transformed into n-
dimensional matrices for use in deep learning computations. A word embedding is
a vector representation of words.
Dog [5,7,1,....] vector representation for ‘Dog’
Bark [6,7,2,....] vector representation for ‘Bark’
Fly [1,1,8,....] vector representation for ‘Fly’

algorithm
Embedding Space
The process of embedding transforms data (like text) into vectors, compresses
the input information resulting in an embedding space specific to the training
data
While we keep our discussion around embeddings limited to RAG

application and how to create embeddings for our data, a great
resource to find more about embeddings is this book by Vicky
Boykis [What are embeddings]
The good news for anyone building RAG Applications is that embeddings once
created can also generalize to other tasks and domains through transfer learning
— the ability to switch contexts — which is one of the reasons embeddings have
exploded in popularity across machine learning applications

Popular Embedding Models
word2vec embeddings. The official paper -

Google’s Word2Vec is one of the most popular pre-trained word
https://arxiv.org/pdf/1301.3781.pdf
The ‘Global Vectors’ model is so termed because it captures

GLOVE statistics directly at a global level. The official paper -
https://nlp.stanford.edu/pubs/glove.pdf
fastText Facebook’s AI research, fastText builds embeddings composed of

characters instead of words. The official paper -
Elmo
Embeddings from Language Models, are learnt from the internal
state of a bidirectional LSTM. The official paper -
Bidirectional Encoder Representations from Transformers is a

BERT transformer bases approach. The official paper -
ada v2 by
used by GPT series of models
textembedding-gecko
by Google’s
Other Open Source

Embeddings
Checkout MTEB leaderboard at

How to Choose Embeddings?

Ever since the release of ChatGPT and the advent of the aptly described LLM
Wars, there has also been a mad rush in developing embeddings models. There
are many evolving standards of evaluating LLMs and embeddings alike.
When building RAG powered LLM apps, there is no right answer to “Which
embeddings model to use?”. However, you may notice particular embeddings
working better for specific use cases (like summarization, text generations,
classification etc.)
OpenAI used to recommend different embeddings models for different

use cases. However, now they recommend ada v2 for all tasks.
MTEB Leaderboard at Hugging Face evaluates almost all available embedding

models across seven use cases - Classification, Clustering, Pair Classification,
Reranking, Retrieval, Semantic Textual Similarity (STS) and Summarization.
Another important consideration is cost. With OpenAI models you can incur
significant costs if you are working with a lot of documents. The cost of open
source models will depend on the implementation.

Creating Embeddings
Once you’ve chosen your embedding model, there are several ways of creating
the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in
pretty handy to convert documents (split into chunks) into vector embeddings.
Other times you can use the service from a provider directly or get the
embeddings from HuggingFace
Example : OpenAI text-embedding-ada-002

using Embedding.create() function from openai library
You’ll need an OpenAI apikey to create these embeddings

You can get one here - https://platform.openai.com/api-keys

Example Response
response.data[0].embedding will give the created embeddings

that can be stored for retrieval
Cost
In this example, 1014 tokens will cost about $.0001. Recall that for this youtube
transcript we got 14 chunks. So creating the embeddings for the entire transcript
will cost about 0.14 cents. This may seem low, but when you scale up to
thousands of documents being updated frequently, the cost can become a
concern.

Example : msmarco-bert-base-dot-v5
using HuggingFaceEmbeddings from langchain.embeddings
Example : embed-english-light-v3.0
using CohereEmbeddings from langchain.embeddings
All the available embeddings classes on

LangChain

Indexing Pipeline: Storing 29
Storing
We are at the last step of creating the indexing pipeline. We have loaded and split
the data, and created the embeddings. Now, for us to be able to use the
information repeatedly, we need to store it so that it can be accessed on demand.
For this we use a special kind of database called the Vector Database.
What is a Vector Database?

For those familiar with databases, indexing is a data structure technique that
allows users to quickly retrieve data from a database. Vector databases specialise
in indexing and storing embeddings for fast retrieval and similarity search.
A strip down variant of a Vector Database is a Vector Index like FAISS (Facebook
AI Similarity Search). It is this vector indexing that improves the search and
retrieval of vector embeddings. Vector Databases augment the indexing with
typical database features like data management, metadata storage, scalability,
integrations, security etc.
In short, Vector Databases provide -

Scalable Embedding Storage.
Precise Similarity Search.
Faster Search Algorithm.
Popular Vector Databases
Facebook AI Similarity search Pinecone is one of the most

is a vector index released with popular managed Vector DB
a library in 2017 for large scale
Weaviate is an open source Chromadb is also an open

vector database that stores source vector database.
both objects and vectors
With the growth in demand for vector storage, it can be anticipated that all major
database players will add the vector indexing capabilities to their offerings.

How to choose a Vector Database?

All vector databases offer the same basic capabilities. Your choice should be
influenced by the nuance of your use case matching with the value proposition of
the database.
A few things to consider -
Balance search accuracy and query speed based on application needs.

Prioritize accuracy for precision applications or speed for real-time systems.
Weigh increased flexibility vs potential performance impacts. More

customization can add overhead and slow systems down.
Evaluate data durability and integrity requirements vs the need for fast query
performance. Additional persistence safeguards can reduce speed.
Assess tradeoffs between local storage speed and access vs cloud storage
benefits like security, redundancy and scalability.
Determine if tight integration control via direct libraries is required or if ease-

of-use abstractions like APIs better suit your use case.
Compare advanced algorithm optimizations, query features, and indexing vs

how much complexity your use case necessitates vs needs for simplicity.
Cost considerations - while you many incur regular cost in a fully managed
solution, a self hosted one might prove costlier if not managed well
User Friendly for PoCs Higher Performance Customization
There are many more Vector DBs. For a comprehensive understanding of the pros
and cons of each, this blog is highly recommended

Storing Embeddings in Vector DBs

To store the embeddings, LangChain and LlamaIndex can be used for quick
prototyping. The more nuanced implementation will depend on the choice of the
DB, use case, volume etc.
Example : FAISS from langchain.vectorstores

In this example, we complete our indexing pipeline for one document.
1. Loading our text file using TextLoader,

2. Splitting the text into chunks using RecursiveCharacterTextSplitter,
3. Creating embeddings using OpenAIEmbeddings
4. Storing the embeddings into FAISS vector index
You’ll have to address the following dependencies.

1. Install openai, tiktoken and faiss-cpu or faiss-gpu
2. Get an OpenAI API key
Now that our knowledge base is ready, let’s quickly see it in action. Let’s performa a
search on the FAISS index we’ve just created.

Similarity search
In the YouTube video, for which we have indexed the transcript, Andrej Karpathy
talks about the idea of LLM as an operating system. Let’s perform a search on this.
Query : What did Andrej say about LLM operating system?
We can see here that out of the entire text, we have been able to retrieve the
specific chunk talking about the LLM OS. We’ll look at it in detail again in the RAG
pipeline

Example : Chroma from langchain.vectorstores

3. Creating embeddings using all-MiniLM-L6-v2
4. Storing the embeddings into Chromadb
All LangChain
VectorDB Integrations
Indexing Pipeline Recap

We covered the indexing pipeline in its entirety. A quick recap -
A variety of data loaders from LangChain and LlamaIndex

can be leveraged to load data from all sort of sources.
Loading documents from a list of sources may turn out to
be a complicated process. Make sure to plan for all the
sources and loaders in advance.
Loading More often than naught, transformations/clean-ups to the
loaded data will be required
Documents need to be split for ease of search and

limitations of the llm context windows
Chunking strategies are dependent on the use case,
nature of content, embeddings, query length & complexity
Chunking methods determine how the text is split and
Splitting how the chunks are measured
Embeddings are vector representations of data that

capture meaningful relationships between entities
Some embeddings work better for some use cases
Embedding
Vector databases specialise in indexing and storing

embeddings for fast retrieval and similarity search
Different vector databases present different benefits and
can be used in accordance with the use case
Storing

RAG Pipeline 35
RAG Pipeline
Now that the knowledge base has been created in the indexing pipeline, the main
generation or the RAG pipeline will have to be setup for receiving the input and
generating the output.
Let’s revisit our architecture diagram.
Vector Store created via

the Indexing Pipeline
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
Generation Steps
User writes a prompt or a query that is passed to an orchestrator
Orchestrator sends a search query to the retriever
Retriever fetches the relevant information from the knowledge sources and returns
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
The knowledge sources highlighted above have been set up using the indexing
pipeline. These sources can be served using “on-the-fly” indexing also

RAG Pipeline 36
RAG Pipeline Steps

The three main steps in a RAG pipeline are
Search & Retrieval Augmentation Generation
This step involves This step involves This step involves

searching for the adding the context to generating the final
context from the the prompt depending response from the
source (e.g. vector db) on the use case. large language model
An important consideration is how knowledge is stored and accessed. This

has a bearing on the search & retrieval step.
Persistent Temporary Small Data

Vector DBs Vector Index
When a large volume When data is Generally, when small

of data is stored in temporarily stored in amount of data is
vector databases, the vector indices for one retrieved from pre-
retrieval and search time use, the accuracy determined external
needs to be quick. The and relevance of the sources, the
relevance and search needs to be augmentation of the
accuracy of the search ascertained data becomes more
can be tested. critical.
Indexing Pipeline
On the fly

Retrieval 37
Retrieval
Perhaps, the most critical step in the entire RAG value chain is searching and
retrieving the relevant pieces of information (known as documents). When the
user enters a query or a prompt, it is this system (Retriever) that is responsible
for accurately fetching the correct snippet of information that is used in
responding to the user query.
Retrievers accept a Query as input and return a list of

Documents as output
Popular Retrieval Methods

Similarity Search
The similarity search functionality of vector databases forms the
backbone of a Retriever. Similarity is calculated by calculating the
distance between the embedding vectors of the input and the
documents
Maximum Marginal Relevance

MMR addresses redundancy in retrieval. MMR considers the
relevance of each document only in terms of how much new
information it brings given the previous results. MMR tries to reduce
the redundancy of results while at the same time maintaining query
relevance of results for already ranked documents/phrases
Multi-query Retrieval
Multi-query Retrieval automates prompt tuning using a language
model to generate diverse queries for a user input, retrieving
relevant documents from each query and combining them to
overcome limitations and obtain a more comprehensive set of
results. This approach aims to enhance retrieval performance by
considering multiple perspectives on the same query.

Retrieval 38
Retrieval Methods
Contextual compression
Sometimes, relevant info is hidden in long documents with a lot of
extra stuff. Contextual Compression helps with this by squeezing
down the documents to only the important parts that match your
search.
Multi Vector Retrieval

Sometimes it makes sense to store more than one vectors in a
document. E.g A chapter, its summary and a few quotes. The retrieval
becomes more efficient because it can match with all the different
itypes of nformation that has been embedded.
Parent Document Retrieval

In breaking down documents for retrieval, there's a dilemma. Small
pieces capture meaning better in embeddings, but if they're too
short, context is lost. The Parent Document Retrieval finds a middle
ground by storing small chunks. During retrieval, it fetches these bits,
then gets the larger documents they came from using their parent
IDs
Self Query
A self-querying retriever is a system that can ask itself questions.
When you give it a question in normal language, it uses a special
process to turn that question into a structured query. Then, it uses
this structured query to search through its stored information. This
way, it doesn't just compare your question with the documents; it
also looks for specific details in the documents based on your
question, making the search more efficient and accurate.

Retrieval 39
Retrieval Methods
Time-weighted Retrieval
This method supplements the semantic similarity search with a time
delay. It gives more weightage, then, to documents that are fresher
or more used than the ones that are older
Ensemble Techniques
As the term suggests, multiple retrieval methods can be used in
conjunction with each other. There are many ways of implementing
ensemble techniques and use cases will define the structure of the
retriever
Top Advanced Retrieval Strategies
Source : LangChain State of AI 2023

Retrieval 40
Example : Similarity Search using LangChain

5. Retrieving chunks using similarity_search

Retrieval 41
Example : Similarity Vector Search

5. Converting input query into a vector embedding
6. Retrieving chunks using similarity_search_by_vector
How Similarity Vector Search is different from Similarity Search is that the query
is also converted into a vector embedding from regular text

Retrieval 42
Example : Maximum Marginal Relevance

3. Creating embeddings using OpenAI Embeddings
4. Storing the embeddings into Qdrant
5. Retrieving and ranking chunks using max_marginal_relevance_search
fetch_k = Number of documents in the initial retrieval

k = final number of reranked documents to output

Retrieval 43
Example : Multi-query Retrieval

4. Storing the embeddings into Qdrant
5. Set the LLM as ChatOpenAI (gpt 3.5)
6. Set up logging to see the query variations generated by the LLM
7. use MultiQueryRetriever & get_relevant_documents functions

Retrieval 44
Example : Contextual compression

4. Set up retriever as FAISS
5. Set the LLM as ChatOpenAI (gpt 3.5)
6. Use LLMChainExtractor as the compressor
7. use ContextualCompressionRetriever & get_relevant_documents functions

Augmentation & Generation 45
Augmentation & Generation

Post-retrieval, the next set of steps include merging the user query and the
retrieved context (Augmentation) and passing this merged prompt as an
instruction to an LLM (Generation)
User Retrieved System Context

Query Context Instruction Augmented Prompt
Question Context Instruction Answer the question based

Who won the The 2023 Cricket Answer the only on the following context :
2023 ICC World Cup, question based The 2023 Cricket World Cup,
Cricket World concluded on 19 only on the concluded on 19 November
Cup? November 2023, following context 2023, with Australia winning
with Australia the tournament.
winningthe
tournament. Question - Who won the 2023
ICC Cricket World Cup?
Augmentation with an Illustrative Example
Answer the question based

only on the following context : Australia won the 2023
The 2023 Cricket World Cup,
concluded on 19 November
2023, with Australia winning
the tournament.
Question - Who won the 2023

LLM
Context Augmented Prompt Contextual Response

Generation with an Illustrative Example

Evaluation 46
Evaluation
Building a PoC RAG pipeline is not overtly complex. LangChain and LlamaIndex
have made it quite simple. Developing highly impressive Large Language Model
(LLM) applications is achievable through brief training and verification on a
limited set of examples. However, to enhance its robustness, thorough testing on
a dataset that accurately mirrors the production distribution is imperative.
RAG is a great tool to address hallucinations in LLMs but...

even RAGs can suffer from hallucinations
This can be because -

The retriever fails to retrieve relevant context or retrieves irrelevant context
The LLM, despite being provided the context, does not consider it
The LLM instead of answering the query picks irrelevant information from the
context
Two processes, therefore, to focus on from an evaluation perspective -
Search & Retrieval Generation
How good is the retrieval of the How good is the generated

context from the Vector response?
Database?
Is the response grounded in the
Is it relevant to the query? provided context?
How much noise (irrelevant Is the response relevant to the

information) is present? query?

Evaluation 47
Ragas (RAG Assessment)

Jithin James and Shahul ES from Exploding Gradients, in 2023, developed the
Ragas framework to address these questions.
https://github.com/explodinggradients/ragas
Evaluation Data
To evaluate RAG pipelines, the following four data points are recommended
A set of Queries or Retrieved Context for each

Prompts for evaluation prompt
Corresponding Response Ground Truth or known

or Answer from LLM correct response
Evaluation Metrics
Evaluating Generation
Faithfulness Is the Response faithful to the Retrieved Context?
Answer Relevance Is the Response relevant to the Prompt?
Retrieval Evaluation
Context Relevance Is the Retrieved Context relevant to the Prompt?
Context Recall Is the Retrieved Context aligned to the Ground Truth?
Context Precision is the Retrieved Context ordered correctly?
Overall Evaluation
Answer Semantic Similarity Answer Correctness
is the Response semantically is the Response semantically
similar to the Ground Truth? and factually similar to the
Ground Truth?

Evaluation 48
Evaluation Metrics
Faithfulness
Faithfulness is the measure of the extent to which the response is
factually grounded in the retrieved context
Problem addressed : The LLM, despite being provided the context, does
not consider it
or
Is the response grounded in the provided context?
Evaluated Process : Generation

Any measure of retrieval accuracy is out of scope
Score Range : (0,1) Higher score is better
Methodology
Faithfulness identifies the number of “claims” made in the response and
calculates the proportion of those “claims” present in the context.
Number of generated claims present in the context

Faithfulness = Total number of claims made in the generated response
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
Context : The 2023 ODI Cricket World Cup concluded on 19 November 2023,
with Australia winning the tournament.
Response 1 : High Faithfulness Response 2 : Low Faithfulness

[Australia] won on [19 November [Australia] won on [15 October
2023] 2023]

Evaluation 49
Evaluation Metrics
Answer Relevance
Answer Relevance is the measure of the extent to which the response is
relevant to the query or the prompt
Problem addressed :The LLM instead of answering the query responds
with irrelevant information
or
Is the response relevant to the query?
Evaluated Process : Generation

Any measure of retrieval accuracy is out of scope
Methodology
For this metric, a response is generated for the initial query or prompt
To compute the score, the LLM is then prompted to generate questions
for the generated response several times. The mean cosine similarity
between these questions and the original one is then calculated. The
concept is that if the answer correctly addresses the initial question, the
LLM should generate questions from it that match the original question.
Avg (
Answer Relevance = Sc (Initial Query, LLM generated Query [i])
)
Response 1 : High Answer Relevance Response 2 : Low Answer Relevance
India won on 19 November 2023 Cricket world cup is held once every
four years
Note
Answer Relevance is not a measure of truthfulness but only of relevance. The
response may or may not be factually accurate but may be relevant.

Evaluation 50
Evaluation Metrics
Context Relevance
Context Relevance is the measure of the extent to which the retrieved
context is relevant to the query or the prompt
Problem addressed :The retriever fails to retrieve relevant context
or
Is the retrieved context relevant to the query?
Evaluated Process : Retrieval

Indifferent to the final generated response
Methodology
The retrieved context should contain information only relevant to the
query or the prompt. For context relevance, a metric ‘S’ is estimated. ‘S’
is the number of sentences in the retrieved context that are relevant for
responding to the query or the prompt.
S
Context Relevance = (number of relevant sentences from the context)
Total number of sentences in the retrieved context
Context 1 : High Context Relevance Context 2 : Low Context Relevance

The 2023 Cricket World Cup, The 2023 Cricket World Cup was the
concluded on 19 November 2023, 13th edition of the Cricket World
with Australia winning the Cup. It was the first Cricket World
tournament. The tournament took Cup which India hosted solely. The
place in ten different stadiums, in tournament took place in ten
ten cities across the country. The different stadiums.In the first semi-
final took place between India and final India beat New Zealand, and
Australia at Narendra Modi in the second semi-final Australia
Stadium beat South Africa.

Evaluation 51
Evaluation Metrics
Ground Truth
Ground truth is information that is known to be real or true. In RAG, or
Generative AI domain in general, Ground Truth is a prepared set of Prompt-
Response examples. It is akin to labelled data in Supervised Learning parlance.
Calculation of certain metrics necessitates the availability of Ground Truth data
Context Recall
Context recall measures the extent to which the retrieved context aligns
with the “provided” answer or Ground Truth
Problem addressed :The retriever fails to retrieve accurate context
or
Is the retrieved context good enough to provide the response?

Methodology
To estimate context recall from the ground truth answer, each sentence
in the ground truth answer is analyzed to determine whether it can be
attributed to the retrieved context or not. Ideally, all sentences in the
ground truth answer should be attributable to the retrieved context.
Number of Ground Truth sentences in the context

Context Recall =
Total number of sentences in the Ground Truth
Ground Truth : Australia won the world cup on 19 November, 2023.
Context 1 : High Context Recall Context 2 : Low Context Recall

The 2023 Cricket World Cup, The 2023 Cricket World Cup was the
concluded on 19 November 2023, 13th edition of the Cricket World
with Australia winning the Cup. It was the first Cricket World
tournament. Cup which India hosted solely.

Evaluation 52
Evaluation Metrics
Context Precision
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the contexts are ranked higher or not.
Problem addressed :The retriever fails to rank retrieve context correctly
or
Is the higher ranked retrieved context better to provide the response?

Methodology
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the all retrieved context documents are
ranked higher or not. Ideally all the relevant chunks must appear at the
top
Context Sum( Precision@k)

=
Precision @ k Total number of relevant documents in the top results
True Positives @ k
Precision @ k =
(True Positives @ k + False Positives @ k)
Precision @ k
Precision@k is a metric used in information retrieval and recommendation
systems to evaluate the accuracy of the top k items retrieved or recommended.
It measures the proportion of relevant items among the top k items.

Evaluation 53
Evaluation Metrics
Answer semantic similarity
Answer semantic similarity evaluates whether the generated response is
similar to the “provided” response or Ground Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Methodology
Answer semantic similarity score is calculated by measuring the
semantic similarity between the generated response and the ground
truth response.
Answer Semantic Similarity (Generated Response,

=
Similarity Ground Truth Response)
Answer Correctness
Answer correctness evaluates whether the generated response is
semantically and factually similar to the “provided” response or Ground
Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Methodology
Answer correctness score is calculated by measuring the semantic and
the factual similarity between the generated response and the ground
truth response.

Evaluation 54
Synthetic Test Data Generation

Generating hundreds of QA (Question-Context-Answer) samples from documents
manually can be a time-consuming and labor-intensive task. Moreover, questions
created by humans may face challenges in achieving the necessary level of
complexity for a comprehensive evaluation, potentially affecting the overall
quality of the assessment.
Synthetic Data Generation uses Large Language Models to generate a variety of

Questions/Prompts and Responses/Answers from the Documents (Context). It
can greatly reduce developer time.
Multi-context
Question
Conditional
Question
Reasoning
Seed Question Question Question Evaluation
Documents Generator Evolver Dataset
Synthetic Data Generation Pipeline
Synthetic Data Generated Using Ragas
Ragas Documentation

Evaluation 55
The RAG Triad (TruLens)

The RAG triad is a framework proposed by TruLens to evaluate hallucinations
along each edge of the RAG architecture.
Answer Relevance Context Relevance

Is the Response Is the Retrieved
Query/ Prompt Context relevant to
relevant to the
Prompt? the Prompt?
Answer/
Context
Response
Groundedness
Is the Response faithful to
the Retrieved Context?
Context Relevance:
Verify quality by ensuring each context chunk is relevant to the input query
Groundedness:
Verify groundedness by breaking down the response into individual claims.
Independently search for evidence supporting each claim in the retrieved
context.
Answer Relevance:
Ensure the response effectively addresses the original question.
Verify by evaluating the relevance of the final response to user input.
Trulens Documentation

RAG vs Finetuning 56
RAG vs Finetuning vs Both

Supervised Finetuning (SFT) has fast become a popular method to customise and
adapt foundation models for specific objectives. There has been a growing debate
in the applied AI community around the application of fine-tuning or RAG to
accomplish tasks.
RAG & SFT should considered as complementary, rather than

competing, techniques.
RAG enhances the non- SFT changes the parameters

parametric memory of a of a foundation model and
foundation model without therefore impacting the
changing the parameters parametric memory
If the requirement dictates changes to the parametric memory and an increase in

the non-parametric memory, then RAG and SFT can be used in conjunction
RAG Features SFT Features

Connect to dynamic external Change the style, vocabulary,
data sources tone of the foundation model
Reduce hallucinations Can reduce model size
Increase transparency (in terms Useful for deep domain expertise

of source of information)
Works well only with very large May not address the problem of
foundation models hallucinations
Does not impact the style, tone, No improvement in transparency

vocabulary of the foundation (as black box as foundation
model models)

Important Use Case Considerations

Do you require usage of dynamic Do you require changing the writing
external data? style, tonality, vocabulary of the model?
RAG preferred over SFT SFT preferred over RAG
External/dynamic knowledge
SFT + RAG
(hybrid approach)
RAG preferred
over SFT
SFT preferred
over RAG
Change in model (style, tone, vocab, etc.)
RAG should be implemented (with or without SFT) if the use case requires
Access to an external data source, especially, if the data is dynamic
Resolving Hallucinations
Transparency in terms of the source of information
For SFT, you’ll need to have access to labelled training data

Other Considerations
Latency
RAG pipelines require an additional step of searching and retrieving context
which introduces an inherent latency in the system
Scalability
RAG pipelines are modular and therefore can be scaled relatively easily when
compared to SFT. SFT will require retraining the model with each additional data
source
Cost
Both RAG and SFT warrant upfront investment. Training cost for SFT can vary
depending on the technique and the choice of foundation model. Setting up the
knowledge base and integration can be costly for RAG
Expertise
Creating RAG pipelines has become moderately simple with frameworks like
LangChain and LlamaIndex. Fine-tuning on the other hand requires deep
understanding of the techniques and creation of training data

LLMOps Stack 59
Evolving RAG LLMOps Stack

The production ecosystem for RAG and LLM applications is still evolving. Early
tooling and design patterns have emerged.
10 App Hosting
11 Monitoring
9 Deployment & Inference
8
Application/
Orchestration
6 Prompt Engg 7 Evaluation
4 Foundation LLM 5 SFT Model
1 Data Preparation 2 Embeddings 3 Vector Storage
Data Layer
The foundation of RAG applications is the data layer. This involves -
Data preparation - Sourcing, Cleaning, Loading & Chunking
Creation of Embeddings
Storing the embeddings in a vector store
We’ve seen this process in the creation of the indexing pipeline
Data Preparation Embeddings Vector Storage
Popular Data Layer Vendors (Non Exhaustive)

LLMOps Stack 60
Model Layer
2023 can be considered a year of LLM wars. Almost every other week in the
second half of the year a new model was released. Like there is no RAG without
data, there is no RAG without an LLM. There are four broad categories of LLMs
that can be a part of a RAG application
1. A Proprietary Foundation Model - Developed and maintained by providers

(like OpenAI, Anthropic, Google) and is generally available via an API
2. Open Source Foundation Model - Available in public domain (like Falcon,
Llama, Mistral) and has to be hosted and maintained by you.
3. A Supervised Fine-Tuned Proprietary Model - Providers enable fine-tuning of
their proprietary models with your data. The fine-tuned models are still
hosted and maintained by the providers and are available via an API
4. A Supervised Fine-Tuned Open Source Model - All Open Source models can
be fine-tuned by you on your data using full fine-tuning or PEFT methods.
There are a lot of vendors that have enabled access to open source models and
also facilitate easy fine tuning of these models
Proprietary Models Open Source Models

GPT3.5/GPT4 Llama 2 by Meta
Claude
Mistral & Mixtral
Falcon
phi2 by MicroSoft
Popular proprietary and open source LLMs (Non Exhaustive)
Proprietary Models Open Source Models

GPT Series
Claude, Jurassic &
Titan
AWS Sagemaker
Jumpstart
Popular vendors providing access to LLMs (Non Exhaustive)
Note : For Open Source models it is important to check the license type. Some
open source models are not available for commercial use

LLMOps Stack 61
Prompt Layer
Prompt Engineering is more than writing questions in natural language. There are
several prompting techniques and developers need to create prompts tailored
to the use cases. This process often involves experimentation: the developer
creates a prompt, observes the results and then iterates on the prompts to
improve the effectiveness of the app. This requires tracking and collaboration
Popular prompt engineering platforms (Non Exhaustive)
Evaluation
It is easy to build a RAG pipeline but to get it ready for production involves
robust evaluation of the performance of the pipeline. For checking
hallucinations, relevance and accuracy there are several frameworks and tools
that have come up.
Ragas
Popular RAG evaluation frameworks and tools (Non Exhaustive)
App Orchestration
An RAG application involves interaction of multiple tools and services. To run
the RAG pipeline, a solid orchestration framework is required that invokes these
different processes.
Popular App orchestration frameworks (Non Exhaustive)

LLMOps Stack 62
Deployment Layer
Deployment of the RAG application can be done on any of the available cloud
providers and platforms. Some important factors to consider while deployment
are also -
Security and Governance
Logging
Inference costs and latency
Popular cloud providers and LLMOps platforms (Non Exhaustive)
Application Layer
The application finally needs to be hosted for the intended users or systems to
interact with it. You can create your own application layer or use the available
platforms.
Popular app hosting platforms (Non Exhaustive)
Monitoring
Deployed application needs to be continuously monitored for both accuracy and
relevance as well as cost and latency.
Popular monitoring platforms (Non Exhaustive)
Other Considerations
LLM Cache - To reduce costs by saving responses for popular queries
LLM Guardrails - To add additional layer of scrutiny on generations

Multimodal RAG 63
Multimodal RAG
Up until now, most AI models have been limited to a single modality (a single type
of data like text or images or video). Recently, there has been significant progress
in AI models being able to handle multiple modalities (majorly text and images).
With the emergence of these Large Multimodal Models (LMMs) a multimodal RAG
system becomes possible.
“Generate any type of output from any type of

input providing any type of context”
The high-level features of multimodal RAG are -
1. Ability to query/prompt in one or more modalities like sending both

text and image as input.
2. Ability to search and retrieve not only text but also images, tables,
audio files related to the query
3. Ability to generate text, image, video etc. irrespective of the
mode(s) in which the input is provided.
Approaches
Using MultiModal Embeddings Using LMMs Only
Large MultiModel Models
Flamingo BLIP KOSMOS-1 Macaw-LLM GPT4 Gemini
LlaVA LAVIN LLaMA - Adapter FUYU

Multimodal RAG 64
Multimodal RAG Approaches

Using MultiModal Embeddings
Multimodal embeddings (like CLIP) are used to embed images and text
User Query is used to retrieve context which can be image and/or text
The image and/or text context is passed to an LMM with the prompt.
The LMM generates the final response based on the prompt
Query/
Prompt
Data Multimodal Vector

Loading Embedding Store
Retrieved Multimodal
LMM
Context Response
Indexing Pipeline RAG Pipeline
Multimodal RAG using Multimodal Embeddings
CLIP : Contrastive Language-Image Pre-training OpenAI's CLIP

(Contrastive
Mapping data of different modalities into a shared embedding space
Language-Image
Multimodal Pre-training), maps
Embeddings of
Text both images and
Language
Projection Matrix
text into the same
semantic
embedding space.
Similarity This allows CLIP to
Text Encoder Text Embeddings Score
"understand" the
Image Encoder Image Embeddings
relationship
between texts and
images for
Vision Projection powerful
Matrix
Multimodal applications
Embeddings of
Image
CLIP is an example of training multimodal embeddings

Multimodal RAG 65
Using LMMs to produce text summaries from images

Indexing
An LLM is used to generate captions for images in the data
The image captions and text summaries are stored as text embeddings in a
vector database
A mapping is maintained from the image captions to the image files
Generation
User enters a query (with text and image)
Image captions are generated using an LLM and embeddings are generated
Text summaries and image captions are searched. Images are retrieved based
on the relevant image captions.
Retrieved text summaries, captions and images are passed to the LMM with
the prompt. The LMM generates a multimodal response
Data
Loading
Indexing Pipeline
LMM
Image Caption
Text Summary
Text
Embeddings
Stored Vector
Images Store
Retrieved LMM Multimodal

Query/ Text & Image Response
Prompt RAG Pipeline

Progression of RAG Systems 66
Progression of RAG Systems

Ever since its introduction in mid-2020, RAG approaches have followed a
progression aiming to achieve the redressal of the hallucination problem in LLMs
Naive RAG
At its most basic, Retrieval Augmented Generation can be summarized in three
steps -
1. Indexing of the documents
2. Retrieval of the context with respect to an input query
3. Generation of the response using the input query and retrieved context
LLM
Indexing
Documents
Response
Retrieval
Prompt
User Query
This basic RAG approach can also be termed “Naive RAG”
Challenges in Naive RAG

Retrieval Quality Augmentation Generation Quality
Low Precision leading Redundancy and Generations are not
to Repetition when grounded in the context
Hallucinations/Mid-air multiple retrieved Potential of toxicity and
drops documents have bias in the response
Low Recall resulting similar information Excessive dependence
in missing relevant Context Length on augmented context
info challenges
Outdated information

Advanced RAG
To address the inefficiencies of the Naive RAG approach, Advanced RAG
approaches implement strategies focussed on three processes -
Pre-Retrieval Retrieval Post Retrieval
Documents User Query

Pre-Retrieval
Chunk Optimisation
Metadata Integration
Indexing Structure
Alignment
Indexing
Retrieval
Retrieval Fine-tuned Embeddings Iterative Retrieval Query Rewriting

Dynamic Embeddings Hybrid Search Sub Queries
Adapters HyDE Query Routing
Post Retrieval
Information Compression
Re-ranking
Prompt LLM
Response
* Indicative, non-exhaustive list

Advanced RAG Concepts

Pre-retrieval/Retrieval Stage
Chunk Optimization
When managing external documents, it's important to break them into the right-
sized chunks for accurate results. The choice of how to do this depends on
factors like content type, user queries, and application needs. No one-size-fits-all
strategy exists, so flexibility is crucial. Current research explores techniques like
sliding windows and "small2big" methods
Metadata Integration
Information like dates, purpose, chapter summaries, etc. can be embedded into
chunks. This improves the retriever efficiency by not only searching the
documents but also by assessing the similarity to the metadata.
Indexing Structure
Introduction of graph structures can greatly enhance retrieval by leveraging
nodes and their relationships. Multi-index paths can be created aimed at
increasing efficiency.
Alignment
Understanding complex data, like tables, can be tricky for RAG. One way to
improve the indexing is by using counterfactual training, where we create
hypothetical (what-if) questions. This increases the alignment and reduces
disparity between documents.
Query Rewriting
To bring better alignment between the user query and documents, several
rewriting approaches exists. LLMs are sometimes used to create pseudo
documents from the query for better matching with existing documents.
Sometimes, LLMs perform abstract reasoning. Multi-querying is employed to
solve complex user queries.
Hybrid Search Exploration

The RAG system employs different types of searches like keyword, semantic and
vector search, depending upon the user query and the type of data available.

Sub Queries
Sub querying involves breaking down a complex query into sub questions for
each relevant data source, then gather all the intermediate responses and
synthesize a final response.
Query Routing
A query router identifies a downstream task and decides the subsequent action
that the RAG system should take. During retrieval, the query router also identifies
the most appropriate data source for resolving the query.
Iterative Retrieval
Documents are collected repeatedly based on the query and the generated
response to create a more comprehensive knowledge base.
Recursive Retrieval
Recursive retrieval also iteratively retrieves documents. However, it also refines
the search queries depending on the results obtained from the previous retrieval.
It is like a continuous learning process.
Adaptive Retrieval
Enhance the RAG framework by empowering Language Models (LLMs) to
proactively identify the most suitable moments and content for retrieval. This
refinement aims to improve the efficiency and relevance of the information
obtained, allowing the models to dynamically choose when and what to retrieve,
leading to more precise and effective results
Hypothetical Document Embeddings (HyDE)

Using the Language Model (LLM), HyDE forms a hypothetical document (answer)
in response to a query, embeds it, and then retrieves real documents similar to
this hypothetical one. Instead of relying on embedding similarity based on the
query, it emphasizes the similarity between embeddings of different answers.
Fine-tuned Embeddings
This process involves tailoring embedding models to improve retrieval accuracy,
particularly in specialized domains dealing with uncommon or evolving terms. The
fine-tuning process utilizes training data generated with language models where
questions grounded in document chunks are generated.

Post Retrieval Stage
Information Compression
While the retriever is proficient in extracting relevant information from extensive
knowledge bases, managing the vast amount of information within retrieval
documents poses a challenge. The retrieved information is compressed to extract
the most relevant points before passing it to the LLM.
Reranking
The re-ranking model plays a crucial role in optimizing the document set retrieved
by the retriever. The main idea is to rearrange document records to prioritize the
most relevant ones at the top, effectively managing the total number of
documents. This not only resolves challenges related to context window
expansion during retrieval but also improves efficiency and responsiveness.

Modular RAG
The SOTA in Retrieval Augmented Generation is a modular approach which allows
components like search, memory, and reranking modules to be configured
Routing Modules
Search Predict
Retrieve Advanced
Rewrite Naive Rerank
Read
Demonstrate Fusion
Memory
Naive RAG is essentially a Retrieve -> Read approach which focusses on retrieving
information and comprehending it.
Advanced RAG is adds to the Retrieve -> Read approach by adding it into a
Rewrite and Rerank components to improve relevance and groundedness.
Modular RAG takes everything a notch ahead by providing flexibility and adding
modules like Search, Routing, etc.
Naive, Advanced & Modular RAGs are not exclusive approaches but a
progression. Naive RAG is a special case of Advanced which, in turn, is a special
case of Modular RAG

Some RAG Modules

Search
The search module is aimed at performing search on different data sources. It is
customised to different data sources and aimed at increasing the source data for
better response generation
Memory
This module leverages the parametric memory capabilities of the Language Model
(LLM) to guide retrieval. The module may use a retrieval-enhanced generator to
create an unbounded memory pool iteratively, combining the "original question"
and "dual question." By employing a retrieval-enhanced generative model that
improves itself using its own outputs, the text becomes more aligned with the
data distribution during the reasoning process.
Fusion
RAG-Fusion improves traditional search systems by overcoming their limitations
through a multi-query approach. It expands user queries into multiple diverse
perspectives using a Language Model (LLM). This strategy goes beyond capturing
explicit information and delves into uncovering deeper, transformative
knowledge. The fusion process involves conducting parallel vector searches for
both the original and expanded queries, intelligently re-ranking to optimize
results, and pairing the best outcomes with new queries.
Extra Generation
Rather than directly fetching information from a data source, this module
employs the Language Model (LLM) to generate the required context. The content
produced by the LLM is more likely to contain pertinent information, addressing
issues related to repetition and irrelevant details in the retrieved content.
Task Adaptable Module

This module makes RAG adaptable to various downstream tasks allowing the
development of task-specific end-to-end retrievers with minimal examples,
demonstrating flexibility in handling different tasks.

Acknowledgements 73
Acknowledgements
Retrieval Augmented Generation continues to be a pivotal approach for any
Generative AI led application and it is only going to grow. There are several
individuals and organisations that have provided learning resources and made
understanding RAG fun.
I’d like to thank -

My team at Yarnit.app for taking a bet on RAG and helping me explore and
execute RAG pipelines for content generation
Andrew Ng and the good folks at deeplearning.ai for their short courses
allowing everyone access to generative AI
OpenAI and HuggingFace for all that they do
Harrison Chase and all the folks at LangChain for not only building the
framework but also making it easy to execute
Jerry Liu and others at LlamaIndex for their perspectives and tutorials on RAG
TruEra for demystifying observability and the tech stack for LLMOps
PineCone for their amazing documentation and the learning center
The team at Exploding Gradients for creating Ragas and explaining RAG
evaluation in detail
TruLens for their triad of RAG evaluations
Aman Chadha for his curation of all thing AI, ML and Data Science
Above all, to my colleagues and friends, who endeavour to learn, discover and
apply technology everyday in their effort to make the world a better place.
With lots of love,
Abhinav If you like’d Detailed Notes from Generative

AI with Large Language Models
what you
Course by Deeplearning.ai and
read
AWS.
I talk about :
#AI #MachineLearning #DataScience
#GenerativeAI #Analytics #LLMs
#Technology #RAG #EthicalAI
let’s connect... Download free
ebook
... please

Resources 74
Resources
Official Documentations
Python Documentation Python Documentation Learning Center
Ragas
🍑
Documentation Documentation Documentation Documentation
Thought Leaders and Influencers
Aman Lillian Weng’s Leonie Monigatti’s Chip Huyen

Chadha’s Blog Log Blogroll Blogs
Research Papers
Retrieval-Augmented Retrieval-Augmented KG-Augmented Language

Generation for Large Multimodal Language Models for Knowledge-
Language Models: A Survey Modeling Grounded Dialogue
(Gao, et al, 2023) (Yasunaga, et al, 2023) (Kang, et al, 2023)
Learning Resources and Tutorials
Short 1-hour Python Tutorials &

Courses Cookbook Webinars

Epilogue 75
Hello!
I’m Abhinav...
A data science and AI professional with over 15
years in the industry. Passionate about AI
advancements, I constantly explore emerging
technologies to push the boundaries and create
positive impacts in the world. Let’s build the future,
together!
Please share your feedback on these notes with me
LinkedIn Github Medium Insta email X Linktree Gumroad
Talk to me Checkout Yarnit Magic Newsletter
Book a meeting
5-in-1 Generative AI Powered

Content Marketing Application
www.yarnit.app
$$ Contribute $$
Kee
p Ca
& Bu lm
ild A
Subscribe I.
Follow on LinkedIn

RAG - A Simple Introduction

Uploaded by

Copyright:

Available Formats

RAG - A Simple Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAG - A Simple Introduction

Uploaded by

Copyright:

Available Formats

Retrieval

02. How does RAG help? ................................................................ 6

03. What are some popular RAG use cases? ................................ 7

04. RAG Architecture ..................................................................... 8

05. Evaluation ............................................................................... 46

06. RAG vs Finetuning ................................................................. 56

07. Evolving RAG LLMOps Stack ................................................. 59

08. Multimodal RAG ..................................................................... 63

09. Progression of RAG Systems ................................................ 66

10. Acknowledgements ................................................................ 73

11. Resources ................................................................................. 74

Keep Calm & Build AI. Abhinav Kimothi

The Curse Of The LLMs

A Knowledge Cut-off date Hallucinations

Users look at LLMs for knowledge and wisdom, yet LLMs

Keep Calm & Build AI. Abhinav Kimothi

The Hunger For More

In May 2020, researchers in their paper “Retrieval-Augmented Generation for

Keep Calm & Build AI. Abhinav Kimothi

So, What is RAG?

A Add the retrieved information to

G Use LLM to generate response to

User enters a prompt/query

Retriever searches and fetches information relevant to the prompt

Retrieved relevant information is augmented to the prompt as context

LLM is asked to generate response to the prompt in the context

User receives the response

Keep Calm & Build AI. Abhinav Kimothi

How does RAG help?

Without RAG With RAG

Context Awareness Source Citation Reduced Hallucinations

Keep Calm & Build AI. Abhinav Kimothi

RAG Use Cases

Real-time Event Commentary

Keep Calm & Build AI. Abhinav Kimothi

User writes a prompt or a query that is passed to an orchestrator

Orchestrator sends a search query to the retriever

Keep Calm & Build AI. Abhinav Kimothi

Loading Splitting Embedding Storing

Offline Indexing pipelines are typically used when a knowledge base

{Prompt} {Prompt + Context}

short context On the fly indexing

Keep Calm & Build AI. Abhinav Kimothi

Keep Calm & Build AI. Abhinav Kimothi

Example : Loading a YouTube Video

Below is the code using YoutubeLoader from langchain.document_loaders

LangChain Document Loader : YoutubeLoader

Keep Calm & Build AI. Abhinav Kimothi

Example : Loading a Webpage Text using

Below is the code using SimpleWebPageReader from llama_hub

LlamaIndex LlamaHub Web Page Reader

The LlamaIndex Document object contains more attributes than a LangChain

Keep Calm & Build AI. Abhinav Kimothi

LangChain Document Loaders LlamaHub Data Loaders

LangChain provides integrations with a LlamaIndex provides data loaders via

These Document Loaders are particularly helpful in quickly making connections

It is worthwhile exploring documentation for both

Loading documents from a list of sources may turn out to be a complicated

More often than naught, transformations/clean-ups to the loaded data will

Keep Calm & Build AI. Abhinav Kimothi