Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
298 views

Definitive Guide To Testing LLM Applications

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
298 views

Definitive Guide To Testing LLM Applications

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

The Definitive

Guide to
Testing LLM
Applications
Table of Contents
0.0 Introduction 3

1.0 Testing techniques across the app development cycle 6

2.0 Design Phase Use Case: Self-corrective code generation


8

Use Case: Self-corrective RAG 11

3.0 Pre-Production Prerequisites: Build a dataset & Define evaluation criteria


13

14

Advanced: Pairwise Evaluation 18

Advanced: Few-shot feedback for LLM-as-Judge evaluators 19

Regression testing 20

Use Case: Testing agents 22

Implementation: Integrating into your CI flow 24

4.0 Post-Production Prerequisites: Set up tracing & Collect feedback in production


25

26

Use Case: Evaluating a RAG application in production 29

Bootstrapping 31

5.0 Conclusion 32

6.0 Glossary 34

The Definitive Testing guide by langchain 2


INTRODUCTION

Why testing
matters for LLM
applications
INTRODUCTION

Why testing matters for LLM applications


While Large Language Models (LLMs) promise to solve previously unthinkable
tasks, they have also introduced new challenges for successful deployment.
First, LLMs are non-deterministic, generating a distribution of possible
outcomes from a single input. This can lead to inconsistent or hallucinated
outputs. Additionally, LLMs ingest arbitrary text, forcing developers to grapple
with a broad domain of possible user inputs. And, finally, LLMs output natural
language, often requiring new metrics to assess style or accuracy.

With these challenges in mind, we find that companies must develop new
testing approaches to guard against harmful / misleading responses or even
embarrassing brand moments caused by their LLM app.

Follow along with


LangSmith
Throughout this guide, we’ll refer to
LangSmith, our platform for testing
and monitoring LLM applications.

If you’d like to follow along on your


own account, sign up for free at
smith.langchain.com

The Definitive Testing guide by langchain 4


INTRODUCTION

At LangChain, we think of testing as the way to measure your LLM


application’s performance. The goal of testing is to help you iterate faster on
your LLM app, enabling you to make quick decisions amidst a vast sea of
choices – including what models, prompts, tools, or retrieval strategies to use.

We’ve crafted this short guide to help you add rigor to your LLM app testing.
In this guide, you’ll learn various testing techniques across phases of the LLM
app development lifecycle — from application design, to pre-production, to
post-production. We also give advice on evaluating specific use cases such as
Retrieval Augmented Generation (RAG) applications and agents.


Before LangSmith, we didn’t have a systematic way to improve
the quality of our LLM applications. By integrating LangSmith
into our application framework, we now have a cohesive
approach to benchmark prompts and models for 200+ of our
applications. This supports our data-driven culture at Grab and
allows us to drive continuous refinement of our LLM-powered
solutions.

Padarn Wilson
Head of Engineering, ML Platform at Grab

The Definitive Testing guide by langchain 5


Testing
techniques

across the app


development

cycle
T e s t i n g t e c h n i q u e s ac r o s s t h e a p p d e v e lo p m e n t c yc l e

Tests can be applied at three different stages in your application


development cycle:

Design Pre-Production Post-Production


Tests can be added directly to Tests can be run before Tests can be run on your
your application. These are deploying your application into application in production. These
typically simple assertions fast production. These typically help monitor and catch errors or
enough to be executed at cover key scenarios your undesirable behavior affecting
runtime. Using an orchestrator application must pass, based on real users. The goal is to identify
system can help execute these data you’ve collected. The goal issues and feed them back into
tests and feed failures back to the is to catch and fix any the design or pre-production
LLM. The goal is to catch and regressions before the app is phases, creating a continuous
self-correct errors within your app released to real users. cycle of "design, test, deploy,
itself before they affect users. monitor, fix, and redesign".

Deploy app
Design Phase Pre-Production Phase Production Phase
Error handling in your app Test app before deploying Monitor app in production

App design App Testing Monitoring


Regression?
Add nodes for error handling Test app for regressions No Test app inputs / outputs

Self correction es
Y

Together, these testing phases form a flywheel for continuous improvement of


your LLM system, helping you identify and fix production issues to prevent
regressions in new versions of your app. Now, let’s dive into techniques and
tips for each phase.

The Definitive Testing guide by langchain 7


Design

Phase
DESIGN PHASE

When designing your application, you may want To learn more about
to consider adding checks within your LangGraph, check

application logic to prevent unwanted outputs. out our documentation
Applications with built-in error handling can and tutorials.
leverage the ability for LLMs to self-correct by
executing tests within the application and
feeding errors back to the LLM. 

Below we discuss a few design patterns that


can be used for self-correction. In each case
discussed below, we use LangGraph as the
framework to orchestrate error handling.

Use Case:

Self-corrective code generation


In traditional systems, incorrect or incomplete code often arises from
misunderstandings or lack of context, such as incorrect import statements or
syntax mistakes. To identify and correct these errors, you would typically
manually review code, analyze error messages, correct the issues, and re-run
tests until the code functions correctly.

The Definitive Testing guide by langchain 9


DESIGN PHASE

Encoding these checks into a system like LangGraph and having it make fixes
can greatly speed up this process of code generation. Error handling should
ideally focus on simple assertions to minimize cost and latency. In this
example focused on code generation, our question-answer system,
ChatLangChain, sometimes hallucinates import statements — which degrades
the user experience.

To address this, we can design a simple LangGraph control flow that tests the
imports and, if they fail, passes back any errors to the LLM for correction. This
simple assertion improved performance considerably on our evaluation set.

Tip
Always start with the simplest possible
tests (e.g., assertions that can be
hard-coded) because they are cheap
and fast to run!
Code Execution Check
No Answer
(Node)
Fail?

Yes
Answer
Pydantic Object
Generation
Import Check

(Node) Preamble (Node) No

Question Imports Fail?


Code
Yes

The Definitive Testing guide by langchain 10


DESIGN PHASE

Use Case:

Self-corrective RAG
Sometimes, simple assertions are not a sufficient
test. For example, RAG systems can suffer from Tip
hallucinations and low-quality retrieval (e.g., if a Use of LLM-as-judge evaluators for
user’s question falls outside the domain of the error handling should be done with
indexed documents).
care to minimize latency and cost. See
the “Pre-Production” section for more
on LLM-as-judge evaluators.
In both cases, tests that can reason over text are
needed and cannot be easily captured as simple
assertions. In these cases, we can use an LLM to
perform the testing.

As an example, here is a self-corrective RAG application with several stages of


error handling that uses an LLM to grade retrieval relevance, answer
hallucination, and answer usefulness.

The application control flow is laid out explicitly using LangGraph


Each node should be assigned a specific task.
The control flow performs assertions in logical order, starting with retrieval
If the retrieval documents are relevant, it then proceeds with answer
generation
It checks if the answer contains hallucinations related to the retrieved
documents
Finally, it checks whether the answer addresses the questions.

The Definitive Testing guide by langchain 11


DESIGN PHASE

Generate
Yes
Answer

No Hallucinations?
Retrieve
Grade

Documents Documents
[related

to index]
Any doc

irrelevant
No Yes Answer
Routing
Answers question?
Question
Yes
No
[unrelated

to index]
Web Search

Using LangGraph, this control flow To see step-by-step how to


is highly reliable. We’ve even run it implement corrective RAG,
using strictly local LLMs (e.g., 8b check out these resources:
parameter model).

Building reliable, fully-local RAG


agents with Llama3

Conceptual guide on self-corrective


generation in RAG

The Definitive Testing guide by langchain 12


Pre-Production
PRE-PRODUCTION

Pre-Production
The goal of pre-production testing is to measure the performance of your
application, ensuring continuous improvement and catching any regressions
on scenarios that you expect to pass. To begin testing, you’ll first need to
define a set of scenarios that you want to test. Below, we’ll walk through
prepping a dataset and evaluation criteria, and how to then use regression
testing to measure and improve LLM app performance over time.

Prerequisites:

Build a dataset
Datasets are collections of examples that serve as inputs and (optionally)
expected outputs used to evaluate your LLM app.

Gathering good data is often the hardest part of testing, given the large
amount of time and context needed to pull together a quality dataset to
benchmark your system. However, you don’t need many data points to get
started.

The Definitive Testing guide by langchain 14


PRE-PRODUCTION

Below, we outline common methods for


gathering data: If you’re using
LangSmith, it’s easy

Manually curated examples: These are to save debugging and
hand-chosen / written examples. We production traces to
often see teams start small with their datasets, or to go back
volume of data for a new project, and edit your dataset
collecting 10-50 quality examples by as you gather more
hand. You can add to these manual data information.
points over time to cover edge cases that
emerge during production usage.

Application logs: These are logs of past


interactions with your application. Adding
in production logs from older app
versions ensures your dataset is realistic
and can cover commonly-recurring user
questions as you iterate on your
application.

Synthetic data: These are artificially-generated examples that simulate


various scenarios and / or edge cases. This approach can be useful for
augmenting your dataset when you may not have enough real data to test
on. You can generate new, plausible inputs by sampling from existing
inputs — or you can paraphrase existing inputs to diversify your dataset
without changing the underlying context.

The Definitive Testing guide by langchain 15


PRE-PRODUCTION

Define evaluation criteria


After creating your dataset, you’ll want to define evaluation metrics to assess
your application’s responses before shipping to production. This batch
evaluation on a predetermined test suite is often referred to as Offline
Evaluation.

For offline evaluation, you can also optionally label expected outputs (i.e.
ground truth references) for the data points you are testing on. This lets you
compare your application’s response with the ground truth references.

There are a few ways to score your



LLM app performance:

Heuristic evaluators: These allow you to define


assertions and hard-coded rules on your outputs Tip
to score their quality. You can use reference-free Start with simple (e.g. heuristic)
heuristics (e.g. checking if output is valid JSON) evaluations as much as possible. Then,
perform human review. Finally, use
or reference-based heuristics like accuracy. LLM-as-Judge to automate your
Reference-based evaluation compares an output human review. This order of operations
lets you add depth and scale once
to a predefined ground truth, whereas reference- your criteria are well-defined.
free evaluation assesses qualitative
characteristics without a ground truth.
Custom heuristic evaluators are useful for code generation tasks like
schema checking and unit testing with hard-coded evaluation logic; in
contrast, evaluations on natural language cannot be hardcoded as rules,
requiring human or LLM-as-Judge evaluators.

The Definitive Testing guide by langchain 16


PRE-PRODUCTION

Human evaluators: Using human feedback is a good starting point if you


can’t express your testing requirements as code and an LLM evaluator is
not consistent enough. When looking at qualitative characteristics, humans
can label app responses with scores. LangSmith speeds up the process of
collecting and incorporating human feedback with annotation queues.

LLM-as-Judge evaluators: These capture human grading rules into an LLM


prompt, in which you use the LLM to judge whether the output is correct
(e.g. relative to the reference answer) or whether it adheres to specific
criteria (e.g. if it’s reference-free and you’d like to check if the output
contains offensive content). Answer correctness is a common LLM-as-
Judge evaluator for offline evaluation, in which the reference is a correct
answer supplied from the dataset output. As you iterate in pre-production,
you’ll want to audit the scores and tune the LLM-as-Judge to produce
reliable scores.

Tip For LLM-as-Judge evaluators, you can


Minimize cognitive load by using binary (yes, no) or simple (1, 0) scores versus a continuous or
more complex (e.g., Likert scale) possible range of scores
Use straightforward prompts that can easily be replicated and understood by a human. For
example, avoid asking an LLM to produce scores on a range of 0-10 with vague distinctions
between scores.

The Definitive Testing guide by langchain 17


PRE-PRODUCTION

Advanced:

Few-shot feedback for improving LLM-as-


Judge evaluators
How can you trust the results of LLM-as-Judge evaluation? Typically, this
requires another round of prompt engineering to ensure accurate
performance. But by leveraging few-shot learning, “self-improving” evaluation
is now possible in LangSmith with minimal setup. 

Few-shot learning can improve accuracy and align outputs with human
preferences by providing examples of correct behavior. This is especially
useful when it’s tough to explain in instructions how the LLM should behave,
or if the output is expected to have a particular format. In LangSmith, self-
improving evaluation looks like the following
The LLM evaluator provides feedback on generated outputs, assessing
factors like correctness, relevance, or other criteria
Add in human corrections to modify or correct the LLM evaluator’s
feedback in LangSmith. This is where human preferences and judgment are
captured
These corrections are stored as few-shot examples in LangSmith, with an
option to leave explanations for corrections
The few-shot examples are incorporated into future prompts as
subsequent evaluation runs.

Over time, the LLM-as-Judge evaluator becomes increasingly aligned with


human preferences. This self-improving approach eliminates the need for
time-consuming prompt engineering, while improving the accuracy and
relevance of LLM-as-Judge evaluations. To learn more, read this blog post.

The Definitive Testing guide by langchain 18


PRE-PRODUCTION

Advanced:

Pairwise evaluation
Evaluating LLM app outputs can be challenging in isolation, especially for
tasks where human preference is hard to encode in a set of rules. Ranking
outputs preferentially (for example, which one of these two outputs is more
informative, vague, safe etc., versus grading each one individually) can be less
cognitively-demanding for human or LLM-as-Judge evaluators.

This is where pairwise evaluation comes in handy. Pairwise evaluation


compares two outputs simultaneously from different versions of an application
to determine which version better meets evaluation criteria like creativity, etc.

LangSmith natively supports


running and visualizing pairwise
LLM app generations, highlighting
preference for one generation
over another based on guidelines
set by the pairwise evaluator.

Read more in this blog post

The Definitive Testing guide by langchain 19


PRE-PRODUCTION

Regression testing
In traditional software, tests pass based on functional requirements, leading to
more stable behavior once validated. In contrast, AI models show variable
performance due to model drift (i.e. degradation due to changes in data
distribution or updates to the model) and sensitivity. As such, regression testing
for LLM applications is even more critical and should be done frequently.



Once you've defined a dataset and evaluator, you may want to:

(1) Measure your LLM application's performance across experiments to identify
improved app versions to ship.

(2) Monitor app performance over time to prevent regressions.



LangSmith supports these needs with a built-in comparison view, as shown


below.

The Definitive Testing guide by langchain 20


PRE-PRODUCTION

For (1), you can select and compare experiments associated with a dataset.
Runs that regressed are auto-highlighted in red, while runs that improved are
in green — showing aggregate stats and allowing you to comb through
specific examples. This is useful for migrating models or prompts, which may
result in performance improvements or regression on specific examples.

For (2), you can set a baseline (e.g., the current deployed version of your app)
and compare it against prior versions to detect unexpected regressions. If a
regression occurs, you can isolate both the app version and the specific
examples that contain performance changes.


LangSmith has made it easier than ever to curate and maintain
high-signal LLM testing suites. Using LangSmith for testing, we’ve
seen

A 43% performance increase over production systems,


bolstering executive confidence to invest millions in new
opportunities
A 15% reduction in engineering time needed for regression
testing, by eliminating the “whack-a-mole” effect of prompt
changes
Walker Ward
Staff Software Engineer Architect at Podium

The Definitive Testing guide by langchain 21


Tip PRE-PRODUCTION

For building agents, you can


Customize your agent: Build custom or Use Case:

domain-specific agents with a


controllable agent orchestrator like
LangGraph. This improves agent
Testing agents
reliability over general purpose
architectures
Diversify LLMs: Test multiple tool- Agents show promise for autonomously
calling LLMs to optimize performance performing tasks and automating
and manage costs effectively,
leveraging different strengths for workflows, but testing an agent can be
specific applications.
challenging. Agents use an LLM to decide
For testing agents, you can
the control flow of the application, which
Reduce noise: Implement repetitions
(i.e. run same test multiple times and means every agent run can look quite
aggregate the results) to reduce
variability in tool selection and agent
different. For example, different tools might
response be called, agents might get stuck in a loop,
Isolate failures: Partition your dataset
into different splits to identify specific
or the number of steps from start to finish
subsets of data causing an issue. This can vary significantly.
also helps you save computational
resources.

We recommend you test agents at three


different levels of granularity To learn more about

The agent’s final response to strictly testing agents, check out

focus on end-to-end performanc these tutorials or explore
Any single, important step of the this notebook.

agent to drill into specific tool calls / You can also watch our
decision workshop on building and
The trajectory of the agent to examine testing reliable agents.
the full reasoning trajectory

The Definitive Testing guide by langchain 22


Testing
 Testing
 Testing

an agent’s final a single step of an agent’s
response an agent trajectory
To assess the overall performance of Testing an individual action (where the Looking back on the steps an agent
an agent on a task, treat it as a black LLM call makes a decision) lets you took (often referred to as the
box and define success as whether or zoom in to see where your application is trajectory) lets you assess whether or
not it completes the task. Keep in failing. It can also be faster to run (with not the trajectory lined up with
mind that this method is hard to just one LLM call invoked). Keep in mind expectations of the agent, e.g. the
debug when failures occur.
that testing a single step doesn’t number of steps or sequence of steps
Testing for the agent’s final response capture the scope of the full agent. taken.

involves Dataset creation may also be more Testing an agent's trajectory involves
difficult- as it’s easier to generate a
Inputs: User input and (optionally) Inputs: User input and (optionally)
dataset for earlier steps in an agent
predefined tools predefined tools
trajectory, but harder to do so for later
Output: Agent's final response Output: Expected sequence of
steps which must account for prior
Evaluator: LLM-as-Judge tool calls (in other words, the
agent actions and responses.

evaluators, which can assess task “exact” trajectory), or a list of tool


completion directly from a text Testing for a single step of an agent calls in any order
response involves: Evaluator: Function over the steps
Inputs: User input to a single step taken. To test the outputs, you can
(e.g., user prompt, set of tools). Can look at an exact match binary
also include previously completed score (note: may be multiple
steps correct paths) or metrics that
Output: LLM response from that step focus on the incorrect steps count.
(often contains tool calls indicating To test inputs to the tools, LLM-
what action the agent should take as-Judge may be more fruitful.
next You’d need to evaluate the full
Evaluator: Binary score for correct agent’s trajectory against a
tool selection and heuristic reference trajectory, then compile
assessment of the tool input's as a set of messages to pass into
accuracy. the LLM-as-Judge.

Tools
Tool #1

Tool #2

Testing agent output


Tool #3
Answer Reference answer

Input

Agent Testing agent reasoning Reference tool (single step)

Selection

of tools
Reference tool trajectory (all steps)

23
PRE-PRODUCTION

Implementation:

Integrating into your CI flow


Integrating your LLM app testing into your Continuous Integration (CI) flow
can help you automate testing each time changes are made to the codebase.
But, there are a couple challenges with this workflow
Tests on LLM apps tend to be flakey, as many tests also use an LLM to do
the evaluation
Running an experiment on every PR can be costly, as calls to LLMs are
expensive.

To adapt your CI workflow to deal with LLM application testing, we


recommend the following:

Use a cache: Instead of making a request to the LLM every time,


pull from a cache if the input to the LLM hasn’t changed from
what’s stored in the cache
Isolate datasets for CI: Instead of triggering your full experiment
on each commit push, use a subset of the dataset that tests the
most critical examples. Reserve running the experiment over the
full dataset when you have more substantial changes
Plan for human assistance: Despite a desire for full automation,
you’ll likely need a workflow that allows a human to correct failing
tests in order to avoid blocking merges on finicky LLM evaluators.

The Definitive Testing guide by langchain 24


Post-Production
P O S T- P R O D U C T I O N

Post-Production
Though essential, testing in the pre-production phase won’t catch everything.
Only after you’ve shipped to production can you get insights into how your
LLM application fares under real user scenarios. Beyond checking for spikes in
latency or errors, you’ll need to assess characteristics such as relevance,
accuracy, or toxicity. In post-production, a monitoring system can help detect
when your LLM app performance veers off course, allowing you to isolate
valuable failure cases.

Prerequisites:

Set up tracing
If you haven’t yet, you’ll need to set up tracing to gain visibility into a
meaningful portion of your production traffic. LangSmith makes this easy to
do and offers insights into not only the user input and application response,
but also all interim steps the application took to arrive at the response. This
level of detail is helpful for writing specific step assertions or debugging
issues later on.

LangSmith will additionally provide helpful metrics out-of-the-box, such as


Trace volum
Success / failure rate
Latency & time to first toke
LLM calls per trac Check out our quickstart guide
Token count & cost to set up tracing in LangSmith
within minutes.

The Definitive Testing guide by langchain 26


P O S T- P R O D U C T I O N

LLM Call Count


Success Pending Error
4,000

3200
LLM Call Count/1d

2400

1600

800

0
Jun 29 Jul 01 Jul 03 Jul 05 Jul 07 Jul 11 Jul 13 Jul 15 Jul 17 Jul 19 Jul 21 Jul 23 Jul 25 Jul 27

This can provide baseline insight into your LLM app performance, on top of
which you’ll track qualitative metrics (covered in the next section).


LangSmith has been instrumental in accelerating our AI
adoption and enhancing our ability to identify and resolve
issues that impact application reliability. With LangSmith, we’ve
created custom feedback loops, improving our AI application
accuracy by 40% and reducing deployment time by 50%.

Varadarajan Srinivasan
VP of Data Science, AI and ML Engineering at Acxiom

The Definitive Testing guide by langchain 27


P O S T- P R O D U C T I O N

Collect feedback in production


Unlike in the pre-production phase, evaluators for post-production testing
don’t have grounded reference responses to compare against. Instead, your
evaluators will score performance in real time as your application processes
user inputs. This reference-free, real-time evaluation is often referred to as
Online Evaluation.

There are at least two types of feedback you can collect in production to
improve app performance:
Feedback from users: You can directly collect user feedback, which can be
explicit or implicit. Adding a / button on your application is an easy
way to record user satisfaction with the application’s response. You can
also ask users to provide an additional explanation of why or why not their
expectations were met. In LangSmith, you can attach user feedback to any
trace or intermediate run (i.e. span) of a trace, including annotating traces
inline or reviewing runs together in an annotation queue.

Feedback from LLM-as-Judge


evaluators: These can be
appended to projects in
LangSmith, giving you the ability
to define LLM-as-judge evaluation
prompts that operate directly on
application inputs or outputs.
LangSmith has a number of
preexisting prompts for RAG as
well as tagging (e.g., for toxicity).
Off-the-shelf online evaluator prompts in LangSmith

The Definitive Testing guide by langchain 28


P O S T- P R O D U C T I O N

Use Case:

Evaluating a RAG application in production


Let’s apply some of these concepts by adding online evaluation to a RAG
application that handles common questions over a knowledge base. Typically,
LLM-as-Judge evaluators are used for RAG to evaluate factual accuracy and
consistency between texts.

To assess the RAG app’s performance in real-time, you’ll likely want to test:
If the application is hallucinating responses,
If the response is relevant and properly addresses the user’s questions
What types of questions the users are asking

1 - Testing for hallucinations


Below is an example of an LLM-as-Judge online evaluator that takes as input
both
facts which is a variable representing the raw source of information from
the retrieval step, an
student answer which is a variable representing the RAG application
response

The Definitive Testing guide by langchain 29


P O S T- P R O D U C T I O N

This prompt checks to see if the application’s response is grounded in the


retrieved documents and prevents the introduction of unsupported
information. We can represent the result of the test as a boolean (hallucination
or not). This also further motivates why keeping track of interim trace steps,
the retrieved documents, is important (and not just the input and final
response).

You are a teacher grading a quiz.



You will be given FACTS and a STUDENT ANSWER.

Here is the grade criteria to follow:

(1) Ensure the STUDENT ANSWER is grounded in the FACTS.

(2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside
the scope of the FACTS.


Score:

A score of 1 means that the student's answer meets all of the criteria. This is the
highest (best) score.

A score of 0 means that the student's answer does not meet all of the criteria.
This is the lowest possible score you can give.

Explain your reasoning in a step-by-step manner to ensure your reasoning and


conclusion are correct.

Avoid simply stating the correct answer at the outset.

To learn more about online evaluation, explore our educational videos


Online evaluation in our LangSmith Evaluation serie
Online evaluation with a focus on guardrails in our LangSmith Evaluation serie
Online evaluation with a focus on RAG in our LangSmith Evaluation series
P O S T- P R O D U C T I O N

Bootstrapping
After setting up tracing and online evaluators, you’ll start to catch errors in
your application in production. Ideally, you modify your application to fix these
errors. You can also fold these errors back into your test dataset used for
offline evaluation, in order to prevent the same mistakes in future releases of
your application.

Deploy app
Design Phase Pre-Production Phase Production Phase

Error handling in your app Test app before deploying Monitor app in production

App design App Testing Monitoring


Regression?
Add nodes for error handling Test app for regressions No Test app inputs / outputs

Yes
Self correction

Bootstrapping to improve future versions

You can also do a phased release of your app on a smaller audience to


gradually build up your dataset before doing a full cutover to the new version.
Seeing a big jump in any of your monitoring charts in LangSmith should alert
you to investigate further or do a rollback. This approach lets you spot
tradeoffs between cost, latency, and quality.

The Definitive Testing guide by langchain 31


Conclusion
CONCLUSION

Conclusion
As large language models evolve rapidly, robust testing is needed to improve
the system built around the model and provide lasting value. In this guide, we
discussed three layers of testing that you can consider: (1) error handling
within the application itself, (2) pre-production testing, and (3) post-
production monitoring.



Together, these three layers of testing create a virtuous data flywheel.


Production monitoring lets you identify application errors, informing the design
process and pre-production (regression) testing. During the design phase, in-
app error handling using frameworks like LangGraph can fix some of these
errors and enable self-correction. Pre-production testing ensures each app
version you ship avoids regressions and, ideally, improves performance on
your collected examples.



LangChain products have helped over a million developers integrate


generative AI into their software, and it's time to also integrate this flywheel of
testing. With this guide, we hope you've gained a framework for robust LLM
application testing, so you can iterate faster and systematically navigate
decisions in the ever-changing LLM space.

The Definitive Testing guide by langchain


Glossary
GLOSSARY OF TERMS

Glossary
Agents: An agent is a system that uses an LLM to decide the control flow of
an application.

Agent trajectory: The series of steps an agent took to complete a given task.

Experiment: An experiment is application code execution on all example


inputs in a dataset and evaluated for the criteria you’ve defined.

LLM-as-Judge Evaluation: LLM-as-judge evaluators use LLMs to score your


application’s performance. To use them, you typically encode the grading rules
/ criteria in the LLM prompt. They can be reference-free (e.g., check if system
output contains offensive content or adheres to specific criteria), or, they can
compare task output to a reference (e.g., check if the output is factually
accurate relative to the reference).

Offline evaluation: Offline evaluation is conducted prior to deployment of your


LLM application. Usually you have a set of examples in the form of a dataset
that you want to test your application on. Once you record the outputs of your
application over all examples, you can evaluate performance based on tests
you’ve created. We call this an experiment in LangSmith. These tests can be
reference-free or rely on a grounded true response to compare your
application’s response against.

The Definitive Testing guide by langchain 35


GLOSSARY OF TERMS

Glossary
Online evaluation: Online evaluation allows you to evaluate an application in
production. This type of evaluation scores performance in real time, as your
application handles user inputs. Notably, it does not rely on a grounded, true
response for comparison.

RAG (Retrieval Augmented Generation): A technique for AI applications that


leverages external knowledge from a knowledge base, retrieving relevant
documents or other sources of information to generate more informed and
context-aware responses.

Repetitions: Repetitions involve running the same evaluation multiple times


and aggregating the results in order to smooth out run-to-run variability in
LLM applications. This also lets you examine the reproducibility of the AI
application's performance.

The Definitive Testing guide by langchain 36


LangChain is the platform over a million developers choose

for building AI apps from prototype to production. Created by the LangChain
team, LangSmith is a unified platform for debugging, testing, deploying, and
monitoring your LLM application. Thousands of organizations rely on
LangSmith — including Rakuten, Home Depot, Elastic, and Grab — to build,
run, and manage their LLM applications. Founded in 2022, LangChain is
headquartered in San Francisco with customers worldwide.

Request Demo

You might also like