RAG - A Simple Introduction
RAG - A Simple Introduction
RAG - A Simple Introduction
Augmented
Generation
A Simple Introduction
Abhinav Kimothi
2
Table of Contents
01. What is RAG? ............................................................................. 3
What is RAG?
Retrieval Augmented Generation
30th November, 2022 will be remembered as the watershed moment in artificial
intelligence. OpenAI released ChatGPT and the world was mesmerised. Interest in
previously obscure terms like Generative AI and Large Language Models (LLMs),
was unstoppable over the following 12 months.
Generative AI Large Language Models
100
80
60
40
20
0
2022-11-06 2023-11-05
Google Trends - Interest Over Time (Nov’22 to Nov’23)
The Challenge
Make LLMs respond with up-to-date information
Make LLMs not respond with factually inaccurate
information
Make LLMs aware of proprietary information
Providing LLMs with information not in their memory
Providing Context
While model re-training/fine-tuning/reinforcement learning are options that solve
the aforementioned challenges, these approaches are time-consuming and
costly. In majority of the use-case, these costs are prohibitive.
R A G
{Prompt} {Prompt + Context}
Retriever
User LLM
Search Fetch
Lookup the external source to
R retrieve the relevant information
Context
What is RAG?
Retriever
Document Repos
Other Sources
Databases
An LLM has knowledge Retriever searches and fetches information that the LLM has not
only of the data it has necessarily been trained on. This adds to the LLM memory and is passed as
been trained on context in the prompts. Also called Non-Parametric Memory (information
Also called Parametric available outside the model parameters)
Memory (information Expandable to all sources
stored in the model Easier to update/maintain
parameters) Much cheaper than retraining/fine-tuning
The effort lies in creation of the knowledge base
Confidence in Responses
With the context (extra information that is retrieved) made available to the LLM,
the confidence in LLM responses is increased.
Conversational agents
LLMs can be customised to product/service manuals, domain
knowledge, guidelines, etc. using RAG. The agent can also route users to
more specialised agents depending on their query. SearchUnify has an
LLM+RAG powered conversational agent for their users.
Content Generation
The widest use of LLMs has probably been in content generation. Using
RAG, the generation can be personalised to readers, incorporate real-
time trends and be contextually appropriate. Yarnit is an AI based
content marketing platform that uses RAG for multiple tasks.
Personalised Recommendation
Recommendation engines have been a game changes in the digital
economy. LLMs are capable of powering the next evolution in content
recommendations. Check out Aman’s blog on the utility of LLMs in
recommendation systems.
Virtual Assistants
Virtual personal assistants like Siri, Alexa and others are in plans to use
LLMs to enhance the experience. Coupled with more context on user
behaviour, these assistants can become highly personalised.
RAG Architecture
Let’s revisit the five high level steps of an RAG enabled system
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
Retriever fetches the relevant information from the knowledge sources and sends back
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
Two pipelines become important in setting up the RAG system. The first one being
setting up the knowledge sources for efficient search and retrieval and the
second one being the five steps of the generation.
Indexing Pipeline
Data for the knowledge is ingested from the source and indexed. This
involves steps like splitting, creation of embeddings and storage of
data.
RAG Pipeline
This involves the actual RAG process which takes the user query at
run time and retrieves the relevant data from the index, then passes
that to the model
Indexing Pipeline
The indexing pipeline sets up the knowledge source for the RAG system. It is
generally considered an offline process. However, information can also be
fetched in real time. It involves four primary steps.
This step involves This step involves This step involves This step involves
extracting splitting converting text storing the
information from documents into documents into embeddings
different smaller numerical vectors. vectors. Vectors
knowledge sources manageable ML models are are typically stored
a loading them into chunks. Smaller mathematical in Vector
documents. chunks are easier models and Databases which
to search and to therefore require are best suited for
use in LLM context numerical data. searching.
windows.
LLM Response
User No search
needed since Fetch
context is fixed
Loading Data
As we’ve been discussing, the utility of RAG is to access data for all sorts of
sources. These sources can be -
Websites & HTML pages
Documents like word, pdf etc.
Code in python, java etc.
Data in json, csv etc.
APIs
File Directories
Databases
And many more
The first step is to extract the information present in these source locations.
This is a good time to introduce two popular frameworks that are being used to
develop LLM powered applications.
Use cases: Good for applications that Use cases: Good for tasks that
need enhanced AI capabilities, like require text search and retrieval, like
language understanding tasks and more information retrieval or content
sophisticated text generation discovery
Features: Stands out for its versatility Features: Excels in data indexing and
and adaptability in building robust language model enhancement
applications with LLMs
Connectors: Provides connectors to
Agents: Makes creating agents using access data from databases,
large language models simple through external APIs, or other datasets
their agents API
Both frameworks are rapidly evolving and adding new capabilities every week.
It’s not an either/or situation and you can use both together (or neither).
Loader object
[Document(page_content="Have you ever seen a polar bear
playing bass? Or a robot painted like a Picasso? Didn’t think so.
DALL-E 2 is ....
....
....
.....umans\nand clever systems can work together to make new
things – amplifying our creative potential.", metadata={'source':
'qTgPSKKjfVg', 'title': 'DALL·E 2 Explained', 'description': 'Unknown',
'view_count': 853564, 'thumbnail_url':
'https://i.ytimg.com/vi/qTgPSKKjfVg/hq720.jpg', 'publish_date':
'2022-04-06 00:00:00', 'length': 167, 'author': 'OpenAI'})]
The Document object contains the page_content which is the transcript extracted
from the youtube video as well as the metadata description
Loader object
[Document(id_='17761da4-6a3a-4ce5-8590-c65ee446788f',
embedding=None, metadata={}, excluded_embed_metadata_keys=[],
excluded_llm_metadata_keys=[], relationships={},
hash='6471b3ffe4d3abb1aba2ca99d1d0448e2c3cbd157ddca256fab9fa363e0
9ed85', text='<!doctype html><html lang="en"><head><title data-
rh="true">What is a fine-tuned LLM?. Fine-tuning large language models…
| by Abhinav Kimothi |
…
</body></html>', start_char_idx=None, end_char_idx=None,
text_template='{metadata_str}\n\n{content}', metadata_template='{key}:
{value}', metadata_seperator='\n')]
Both LangChain and LlamaIndex offer loader integrations with more than a
hundred data sources and the list keeps on growing
LlamaIndex: https://docs.llamaindex.ai/en/stable/
LangChain: https://python.langchain.com/docs/get_started/introduction
Document Splitting
Once the data is loaded, the next step in the indexing pipeline is splitting the
documents into manageable chunks. The question arises around the need of this
step. Why is splitting of documents necessary. There are two reasons for that -
Chunking Strategies
While splitting documents into chunks might sound a simple concept, there are
certain best practices that researchers have discovered. There are a few
considerations that may influence the overall chunking strategy.
Nature of Content
Consider whether you are working with lengthy documents, such as articles or
books, or shorter content like tweets or instant messages. The chosen model for
your goal and, consequently, the appropriate chunking strategy depend on your
response.
Chunking Methods
Depending on the aforementioned considerations, a number of text splitters are
available. At a broad level, text splitters operate in the following manner:
Divide the text into compact, semantically meaningful units, often sentences.
Merge these smaller units into larger chunks until a specific size is achieved,
measured by a length function.
Upon reaching the predetermined size, treat that chunk as an independent
segment of text. Thereafter, start creating a new text chunk with some degree
of overlap to maintain contextual continuity between chunks.
A very common approach is where we pre-determine the size of the text chunks.
This approach is simple and cheap and is, therefore, widely used. Let’s look at
some examples -
Split by Character
In this approach, the text is split based on a character and the chunk size is
measured by the number of characters.
texts[0]
“TITLE: Alice's Adventures in Wonderland\nAUTHOR: Lewis Carroll\n\n\n CHAPTER I \n( Down the
Rabbit-Hole )\n\n Alice was beginning to get very tired of sitting by her sister\non the bank, and of
having nothing to do: once or twice she had\npeeped into the book her sister was reading, but it
had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without
pictures or conversation?'\n\n So she was considering in her own mind (as well as she could,\nfor
the hot day made her feel very sleepy and stupid), whether\nthe pleasure of making a daisy-chain
would be worth the trouble\nof getting up and picking the daisies, when suddenly a White\nRabbit
with pink eyes ran close by her.\n\n There was nothing so VERY remarkable in that; nor did
Alice\nthink it so VERY much out of the wayChunk 1 Rabbit say to\nitself, `Oh dear! Oh dear! I
to hear the
shall be late!' (when she thought\nit over afterwards, it occurred to her that she ought to
have\nwondered at this, but at the time it all seemed quite natural);\nbut when the Rabbit actually
TOOK A WATCH OUT OF ITS WAISTCOAT-\nPOCKET, and looked at it, and then hurried on, Alice
started to\nher feet, for it flashed across her mind that she had never\nbefore seen a rabbit with
either a waistcoat-pocket, or a watch to\ntake out of it, and burning with curiosity, she ran across
the\nfield after it, and fortunately was just in time to see it pop\ndown a large rabbit-hole under the
hedge.\n\n In another moment down went Alice after it, never once\nconsidering how in the world
she was to get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand
then dipped suddenly down, so suddenly that Alice had not a\nmoment to think about stopping
herself before she found herself\nfalling down a very deep well."
Overlap
texts[1]
"In another moment down went Alice after it, never once\nconsidering how in the world she was to
get out again.\n\n The rabbit-hole went straight on like a tunnel for some way,\nand then dipped
suddenly down, so suddenly that Alice had not a\nmoment to think about stopping herself before
she found herself\nfalling down a very deep well.\n\n Either the well was very deep, or she fell very
slowly, for she\nhad plenty of time as she went down to look about her and to\nwonder what was
going to happen next. First, she tried to look\ndown and make out what she was coming to, but it
Chunk
was too dark to\nsee anything; then she looked at the sides of the 2
well, and\nnoticed that they were
filled with cupboards and book-shelves;\nhere and there she saw maps and pictures hung upon
pegs. She\ntook down a jar from one of the shelves as she passed; it was\nlabelled `ORANGE
MARMALADE', but to her great disappointment it\nwas empty: she did not like to drop the jar for
fear of killing\nsomebody, so managed to put it into one of the cupboards as she\nfell past it.”
with CharacterTextSplitter
with RecursiveCharacterTextSplitter
Split by Tokens
For those well versed with Large Language Models, tokens is not a new concept.
All LLMs have a token limit in their respective context windows which we cannot
exceed. It is therefore a good idea to count the tokens while creating chunks. All
LLMs also have their tokenizers.
Tiktoken Tokenizer
Tiktoken tokenizer has been created by OpenAI for their family of models. Using
this strategy, the split still happens based on the character. However, the length
of the chunk is determined by the number of tokens.
Hugging Face has become the go-to platform for anyone building apps using LLMs
or even other models. All models available via Hugging Face are also accompanied
by their tokenizers.
texts[0]
“hi everyone so recently I gave a 30-minute talk on large language
models just kind of like an intro talk um unfortunately that talk
was not recorded but a lot of people came to me after the talk and
they told me that uh they Chunk 1
really liked the talk so I would just I
thought I would just re-record it and basically put it up on
YouTube so here we go the busy person's intro to large language
models director Scott okay so let's begin first of all what is a large
language model
No Overlap as specified
texts[1]
really well a large language model is just two files right um there
be two files in this hypothetical directory so for example work with
the specific example of the Llama 270b model this is a large
language model released by meta Ai and this is basically the Llama
series of language models the Chunk 2 iteration of it and this is the
second
70 billion parameter model of uh of this series so there's multiple
models uh belonging to the Lama 2 Series uh 7 billion um 13 billion
34 billion and 70 billion is the the
https://huggingface.co/docs/transformers/tokenizer_summary
Other Tokenizer
Other libraries like Spacy, NLTK and SentenceTransformers also provide splitters
Specialized Chunking
Chunking often aims to keep text with common context together. With this in
mind, we might want to specifically honour the structure of the document itself
for example HTML, Markdown, Latex or even code.
Example : https://medium.com/p/29a7e8610843
Test different chunk sizes. Create embeddings for the chosen chunk sizes and
store them in your index or indices. Run a series of queries to evaluate quality
and compare the performance of different chunk sizes.
Embeddings
All Machine Learning/AI models work with numerical data. Before the
performance of any operation all text/image/audio/video data has to be
transformed into a numerical representation. Embeddings are vector
representations of data that capture meaningful relationships between entities.
As a general definition, embeddings are data that has been transformed into n-
dimensional matrices for use in deep learning computations. A word embedding is
a vector representation of words.
The process of embedding transforms data (like text) into vectors, compresses
the input information resulting in an embedding space specific to the training
data
The good news for anyone building RAG Applications is that embeddings once
created can also generalize to other tasks and domains through transfer learning
— the ability to switch contexts — which is one of the reasons embeddings have
exploded in popularity across machine learning applications
https://arxiv.org/pdf/1301.3781.pdf
Elmo
Embeddings from Language Models, are learnt from the internal
state of a bidirectional LSTM. The official paper -
https://arxiv.org/pdf/1802.05365.pdf
ada v2 by
used by GPT series of models
textembedding-gecko
by Google’s
Another important consideration is cost. With OpenAI models you can incur
significant costs if you are working with a lot of documents. The cost of open
source models will depend on the implementation.
Creating Embeddings
Once you’ve chosen your embedding model, there are several ways of creating
the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in
pretty handy to convert documents (split into chunks) into vector embeddings.
Other times you can use the service from a provider directly or get the
embeddings from HuggingFace
Example Response
Cost
In this example, 1014 tokens will cost about $.0001. Recall that for this youtube
transcript we got 14 chunks. So creating the embeddings for the entire transcript
will cost about 0.14 cents. This may seem low, but when you scale up to
thousands of documents being updated frequently, the cost can become a
concern.
Example : msmarco-bert-base-dot-v5
using HuggingFaceEmbeddings from langchain.embeddings
Example : embed-english-light-v3.0
using CohereEmbeddings from langchain.embeddings
Storing
We are at the last step of creating the indexing pipeline. We have loaded and split
the data, and created the embeddings. Now, for us to be able to use the
information repeatedly, we need to store it so that it can be accessed on demand.
For this we use a special kind of database called the Vector Database.
A strip down variant of a Vector Database is a Vector Index like FAISS (Facebook
AI Similarity Search). It is this vector indexing that improves the search and
retrieval of vector embeddings. Vector Databases augment the indexing with
typical database features like data management, metadata storage, scalability,
integrations, security etc.
Evaluate data durability and integrity requirements vs the need for fast query
performance. Additional persistence safeguards can reduce speed.
Assess tradeoffs between local storage speed and access vs cloud storage
benefits like security, redundancy and scalability.
Cost considerations - while you many incur regular cost in a fully managed
solution, a self hosted one might prove costlier if not managed well
There are many more Vector DBs. For a comprehensive understanding of the pros
and cons of each, this blog is highly recommended
Now that our knowledge base is ready, let’s quickly see it in action. Let’s performa a
search on the FAISS index we’ve just created.
Similarity search
In the YouTube video, for which we have indexed the transcript, Andrej Karpathy
talks about the idea of LLM as an operating system. Let’s perform a search on this.
We can see here that out of the entire text, we have been able to retrieve the
specific chunk talking about the LLM OS. We’ll look at it in detail again in the RAG
pipeline
All LangChain
VectorDB Integrations
Keep Calm & Build AI. Abhinav Kimothi
Indexing Pipeline: Storing 34
RAG Pipeline
Now that the knowledge base has been created in the indexing pipeline, the main
generation or the RAG pipeline will have to be setup for receiving the input and
generating the output.
{ }
Prompt
Search Relevant
Information
Relevant
Context Knowledge Sources
Prompt
Prompt + Context
LLM
Endpoint
Generated Response
RAG System
Generation Steps
User writes a prompt or a query that is passed to an orchestrator
Retriever fetches the relevant information from the knowledge sources and returns
Orchestrator augments the prompt with the context and sends to the LLM
LLM responds with the generated text which is displayed to the user via the orchestrator
The knowledge sources highlighted above have been set up using the indexing
pipeline. These sources can be served using “on-the-fly” indexing also
On the fly
Retrieval
Perhaps, the most critical step in the entire RAG value chain is searching and
retrieving the relevant pieces of information (known as documents). When the
user enters a query or a prompt, it is this system (Retriever) that is responsible
for accurately fetching the correct snippet of information that is used in
responding to the user query.
Multi-query Retrieval
Multi-query Retrieval automates prompt tuning using a language
model to generate diverse queries for a user input, retrieving
relevant documents from each query and combining them to
overcome limitations and obtain a more comprehensive set of
results. This approach aims to enhance retrieval performance by
considering multiple perspectives on the same query.
Retrieval Methods
Contextual compression
Sometimes, relevant info is hidden in long documents with a lot of
extra stuff. Contextual Compression helps with this by squeezing
down the documents to only the important parts that match your
search.
Self Query
A self-querying retriever is a system that can ask itself questions.
When you give it a question in normal language, it uses a special
process to turn that question into a structured query. Then, it uses
this structured query to search through its stored information. This
way, it doesn't just compare your question with the documents; it
also looks for specific details in the documents based on your
question, making the search more efficient and accurate.
Retrieval Methods
Time-weighted Retrieval
This method supplements the semantic similarity search with a time
delay. It gives more weightage, then, to documents that are fresher
or more used than the ones that are older
Ensemble Techniques
As the term suggests, multiple retrieval methods can be used in
conjunction with each other. There are many ways of implementing
ensemble techniques and use cases will define the structure of the
retriever
How Similarity Vector Search is different from Similarity Search is that the query
is also converted into a vector embedding from regular text
Evaluation
Building a PoC RAG pipeline is not overtly complex. LangChain and LlamaIndex
have made it quite simple. Developing highly impressive Large Language Model
(LLM) applications is achievable through brief training and verification on a
limited set of examples. However, to enhance its robustness, thorough testing on
a dataset that accurately mirrors the production distribution is imperative.
https://github.com/explodinggradients/ragas
Evaluation Data
To evaluate RAG pipelines, the following four data points are recommended
Evaluation Metrics
Evaluating Generation
Faithfulness Is the Response faithful to the Retrieved Context?
Retrieval Evaluation
Context Relevance Is the Retrieved Context relevant to the Prompt?
Overall Evaluation
Answer Semantic Similarity Answer Correctness
is the Response semantically is the Response semantically
similar to the Ground Truth? and factually similar to the
Ground Truth?
Evaluation Metrics
Faithfulness
Faithfulness is the measure of the extent to which the response is
factually grounded in the retrieved context
Problem addressed : The LLM, despite being provided the context, does
not consider it
or
Is the response grounded in the provided context?
Methodology
Faithfulness identifies the number of “claims” made in the response and
calculates the proportion of those “claims” present in the context.
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
Context : The 2023 ODI Cricket World Cup concluded on 19 November 2023,
with Australia winning the tournament.
Evaluation Metrics
Answer Relevance
Answer Relevance is the measure of the extent to which the response is
relevant to the query or the prompt
Problem addressed :The LLM instead of answering the query responds
with irrelevant information
or
Is the response relevant to the query?
Methodology
For this metric, a response is generated for the initial query or prompt
To compute the score, the LLM is then prompted to generate questions
for the generated response several times. The mean cosine similarity
between these questions and the original one is then calculated. The
concept is that if the answer correctly addresses the initial question, the
LLM should generate questions from it that match the original question.
Avg (
Answer Relevance = Sc (Initial Query, LLM generated Query [i])
)
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
India won on 19 November 2023 Cricket world cup is held once every
four years
Note
Answer Relevance is not a measure of truthfulness but only of relevance. The
response may or may not be factually accurate but may be relevant.
Evaluation Metrics
Context Relevance
Context Relevance is the measure of the extent to which the retrieved
context is relevant to the query or the prompt
Problem addressed :The retriever fails to retrieve relevant context
or
Is the retrieved context relevant to the query?
Methodology
The retrieved context should contain information only relevant to the
query or the prompt. For context relevance, a metric ‘S’ is estimated. ‘S’
is the number of sentences in the retrieved context that are relevant for
responding to the query or the prompt.
S
Context Relevance = (number of relevant sentences from the context)
Total number of sentences in the retrieved context
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
Evaluation Metrics
Ground Truth
Ground truth is information that is known to be real or true. In RAG, or
Generative AI domain in general, Ground Truth is a prepared set of Prompt-
Response examples. It is akin to labelled data in Supervised Learning parlance.
Calculation of certain metrics necessitates the availability of Ground Truth data
Context Recall
Context recall measures the extent to which the retrieved context aligns
with the “provided” answer or Ground Truth
Problem addressed :The retriever fails to retrieve accurate context
or
Is the retrieved context good enough to provide the response?
Methodology
To estimate context recall from the ground truth answer, each sentence
in the ground truth answer is analyzed to determine whether it can be
attributed to the retrieved context or not. Ideally, all sentences in the
ground truth answer should be attributable to the retrieved context.
Illustrative Example
Query : Who won the 2023 ODI Cricket World Cup and when?
Ground Truth : Australia won the world cup on 19 November, 2023.
Evaluation Metrics
Context Precision
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the contexts are ranked higher or not.
Problem addressed :The retriever fails to rank retrieve context correctly
or
Is the higher ranked retrieved context better to provide the response?
Methodology
Context Precision is a metric that evaluates whether all of the ground-
truth relevant items present in the all retrieved context documents are
ranked higher or not. Ideally all the relevant chunks must appear at the
top
True Positives @ k
Precision @ k =
(True Positives @ k + False Positives @ k)
Precision @ k
Precision@k is a metric used in information retrieval and recommendation
systems to evaluate the accuracy of the top k items retrieved or recommended.
It measures the proportion of relevant items among the top k items.
Evaluation Metrics
Answer semantic similarity
Answer semantic similarity evaluates whether the generated response is
similar to the “provided” response or Ground Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Score Range : (0,1) Higher score is better
Methodology
Answer semantic similarity score is calculated by measuring the
semantic similarity between the generated response and the ground
truth response.
Answer Correctness
Answer correctness evaluates whether the generated response is
semantically and factually similar to the “provided” response or Ground
Truth.
Problem addressed : The generated response is incorrect
or
Does the pipeline generate the right response?
Evaluated Process : Retrieval & Generation
Score Range : (0,1) Higher score is better
Methodology
Answer correctness score is calculated by measuring the semantic and
the factual similarity between the generated response and the ground
truth response.
Multi-context
Question
Conditional
Question
Reasoning
Seed Question Question Question Evaluation
Documents Generator Evolver Dataset
Ragas Documentation
Answer/
Context
Response
Groundedness
Is the Response faithful to
the Retrieved Context?
Context Relevance:
Verify quality by ensuring each context chunk is relevant to the input query
Groundedness:
Verify groundedness by breaking down the response into individual claims.
Independently search for evidence supporting each claim in the retrieved
context.
Answer Relevance:
Ensure the response effectively addresses the original question.
Verify by evaluating the relevance of the final response to user input.
Trulens Documentation
Works well only with very large May not address the problem of
foundation models hallucinations
SFT + RAG
(hybrid approach)
RAG preferred
over SFT
SFT preferred
over RAG
RAG should be implemented (with or without SFT) if the use case requires
Access to an external data source, especially, if the data is dynamic
Resolving Hallucinations
Other Considerations
Latency
RAG pipelines require an additional step of searching and retrieving context
which introduces an inherent latency in the system
Scalability
RAG pipelines are modular and therefore can be scaled relatively easily when
compared to SFT. SFT will require retraining the model with each additional data
source
Cost
Both RAG and SFT warrant upfront investment. Training cost for SFT can vary
depending on the technique and the choice of foundation model. Setting up the
knowledge base and integration can be costly for RAG
Expertise
Creating RAG pipelines has become moderately simple with frameworks like
LangChain and LlamaIndex. Fine-tuning on the other hand requires deep
understanding of the techniques and creation of training data
10 App Hosting
11 Monitoring
8
Application/
Orchestration
Data Layer
The foundation of RAG applications is the data layer. This involves -
Data preparation - Sourcing, Cleaning, Loading & Chunking
Creation of Embeddings
Storing the embeddings in a vector store
We’ve seen this process in the creation of the indexing pipeline
Model Layer
2023 can be considered a year of LLM wars. Almost every other week in the
second half of the year a new model was released. Like there is no RAG without
data, there is no RAG without an LLM. There are four broad categories of LLMs
that can be a part of a RAG application
There are a lot of vendors that have enabled access to open source models and
also facilitate easy fine tuning of these models
Falcon
phi2 by MicroSoft
Note : For Open Source models it is important to check the license type. Some
open source models are not available for commercial use
Prompt Layer
Prompt Engineering is more than writing questions in natural language. There are
several prompting techniques and developers need to create prompts tailored
to the use cases. This process often involves experimentation: the developer
creates a prompt, observes the results and then iterates on the prompts to
improve the effectiveness of the app. This requires tracking and collaboration
Evaluation
It is easy to build a RAG pipeline but to get it ready for production involves
robust evaluation of the performance of the pipeline. For checking
hallucinations, relevance and accuracy there are several frameworks and tools
that have come up.
Ragas
Popular RAG evaluation frameworks and tools (Non Exhaustive)
App Orchestration
An RAG application involves interaction of multiple tools and services. To run
the RAG pipeline, a solid orchestration framework is required that invokes these
different processes.
Deployment Layer
Deployment of the RAG application can be done on any of the available cloud
providers and platforms. Some important factors to consider while deployment
are also -
Security and Governance
Logging
Inference costs and latency
Application Layer
The application finally needs to be hosted for the intended users or systems to
interact with it. You can create your own application layer or use the available
platforms.
Monitoring
Deployed application needs to be continuously monitored for both accuracy and
relevance as well as cost and latency.
Other Considerations
LLM Cache - To reduce costs by saving responses for popular queries
LLM Guardrails - To add additional layer of scrutiny on generations
Multimodal RAG
Up until now, most AI models have been limited to a single modality (a single type
of data like text or images or video). Recently, there has been significant progress
in AI models being able to handle multiple modalities (majorly text and images).
With the emergence of these Large Multimodal Models (LMMs) a multimodal RAG
system becomes possible.
Approaches
Query/
Prompt
Retrieved Multimodal
LMM
Context Response
Data
Loading
Indexing Pipeline
LMM
Image Caption
Text Summary
Text
Embeddings
Stored Vector
Images Store
Naive RAG
At its most basic, Retrieval Augmented Generation can be summarized in three
steps -
1. Indexing of the documents
2. Retrieval of the context with respect to an input query
3. Generation of the response using the input query and retrieved context
LLM
Indexing
Documents
Response
Retrieval
Prompt
User Query
Advanced RAG
To address the inefficiencies of the Naive RAG approach, Advanced RAG
approaches implement strategies focussed on three processes -
Chunk Optimisation
Metadata Integration
Indexing Structure
Alignment
Indexing
Retrieval
Post Retrieval
Information Compression
Re-ranking
Prompt LLM
Response
Metadata Integration
Information like dates, purpose, chapter summaries, etc. can be embedded into
chunks. This improves the retriever efficiency by not only searching the
documents but also by assessing the similarity to the metadata.
Indexing Structure
Introduction of graph structures can greatly enhance retrieval by leveraging
nodes and their relationships. Multi-index paths can be created aimed at
increasing efficiency.
Alignment
Understanding complex data, like tables, can be tricky for RAG. One way to
improve the indexing is by using counterfactual training, where we create
hypothetical (what-if) questions. This increases the alignment and reduces
disparity between documents.
Query Rewriting
To bring better alignment between the user query and documents, several
rewriting approaches exists. LLMs are sometimes used to create pseudo
documents from the query for better matching with existing documents.
Sometimes, LLMs perform abstract reasoning. Multi-querying is employed to
solve complex user queries.
Sub Queries
Sub querying involves breaking down a complex query into sub questions for
each relevant data source, then gather all the intermediate responses and
synthesize a final response.
Query Routing
A query router identifies a downstream task and decides the subsequent action
that the RAG system should take. During retrieval, the query router also identifies
the most appropriate data source for resolving the query.
Iterative Retrieval
Documents are collected repeatedly based on the query and the generated
response to create a more comprehensive knowledge base.
Recursive Retrieval
Recursive retrieval also iteratively retrieves documents. However, it also refines
the search queries depending on the results obtained from the previous retrieval.
It is like a continuous learning process.
Adaptive Retrieval
Enhance the RAG framework by empowering Language Models (LLMs) to
proactively identify the most suitable moments and content for retrieval. This
refinement aims to improve the efficiency and relevance of the information
obtained, allowing the models to dynamically choose when and what to retrieve,
leading to more precise and effective results
Fine-tuned Embeddings
This process involves tailoring embedding models to improve retrieval accuracy,
particularly in specialized domains dealing with uncommon or evolving terms. The
fine-tuning process utilizes training data generated with language models where
questions grounded in document chunks are generated.
Information Compression
While the retriever is proficient in extracting relevant information from extensive
knowledge bases, managing the vast amount of information within retrieval
documents poses a challenge. The retrieved information is compressed to extract
the most relevant points before passing it to the LLM.
Reranking
The re-ranking model plays a crucial role in optimizing the document set retrieved
by the retriever. The main idea is to rearrange document records to prioritize the
most relevant ones at the top, effectively managing the total number of
documents. This not only resolves challenges related to context window
expansion during retrieval but also improves efficiency and responsiveness.
Modular RAG
The SOTA in Retrieval Augmented Generation is a modular approach which allows
components like search, memory, and reranking modules to be configured
Routing Modules
Search Predict
Retrieve Advanced
Read
Demonstrate Fusion
Memory
Naive RAG is essentially a Retrieve -> Read approach which focusses on retrieving
information and comprehending it.
Advanced RAG is adds to the Retrieve -> Read approach by adding it into a
Rewrite and Rerank components to improve relevance and groundedness.
Modular RAG takes everything a notch ahead by providing flexibility and adding
modules like Search, Routing, etc.
Naive, Advanced & Modular RAGs are not exclusive approaches but a
progression. Naive RAG is a special case of Advanced which, in turn, is a special
case of Modular RAG
Memory
This module leverages the parametric memory capabilities of the Language Model
(LLM) to guide retrieval. The module may use a retrieval-enhanced generator to
create an unbounded memory pool iteratively, combining the "original question"
and "dual question." By employing a retrieval-enhanced generative model that
improves itself using its own outputs, the text becomes more aligned with the
data distribution during the reasoning process.
Fusion
RAG-Fusion improves traditional search systems by overcoming their limitations
through a multi-query approach. It expands user queries into multiple diverse
perspectives using a Language Model (LLM). This strategy goes beyond capturing
explicit information and delves into uncovering deeper, transformative
knowledge. The fusion process involves conducting parallel vector searches for
both the original and expanded queries, intelligently re-ranking to optimize
results, and pairing the best outcomes with new queries.
Extra Generation
Rather than directly fetching information from a data source, this module
employs the Language Model (LLM) to generate the required context. The content
produced by the LLM is more likely to contain pertinent information, addressing
issues related to repetition and irrelevant details in the retrieved content.
Acknowledgements
Retrieval Augmented Generation continues to be a pivotal approach for any
Generative AI led application and it is only going to grow. There are several
individuals and organisations that have provided learning resources and made
understanding RAG fun.
I talk about :
#AI #MachineLearning #DataScience
#GenerativeAI #Analytics #LLMs
#Technology #RAG #EthicalAI
let’s connect... Download free
ebook
... please
Resources
Official Documentations
Ragas
🍑
Documentation Documentation Documentation Documentation
Research Papers
Hello!
I’m Abhinav...
A data science and AI professional with over 15
years in the industry. Passionate about AI
advancements, I constantly explore emerging
technologies to push the boundaries and create
positive impacts in the world. Let’s build the future,
together!
Book a meeting
www.yarnit.app
$$ Contribute $$
Kee
p Ca
& Bu lm
ild A
Subscribe I.
Follow on LinkedIn