You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The in-process intelligence layer for AI agents. Optimize cost, latency, quality, budget, compliance, and energy β inside the execution loop, not at the HTTP boundary.
cascadeflow works where external proxies can't: per-step model decisions based on agent state, per-tool-call budget gating, runtime stop/continue/escalate actions, and business KPI injection during agent loops. It accumulates insight from every model call, tool result, and quality score β the agent gets smarter the more it runs. Sub-5ms overhead. Works with LangChain, OpenAI Agents SDK, CrewAI, Google ADK, n8n, and Vercel AI SDK.
cascadeflow is a library and agent harness β an intelligent AI model cascading package that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.
Use Cases
Inside-the-Loop Control. Influence decisions at every agent step β model call, tool call, sub-agent handoff β where most cost, delay, and failure actually happen. External proxies only see request boundaries; cascadeflow sees decision boundaries.
Multi-Dimensional Optimization. Optimize across cost, latency, quality, budget, compliance/risk, and energy simultaneously β relevant to engineering, finance, security, operations, and sustainability stakeholders.
Business Logic Injection. Embed KPI weights and policy intent directly into agent behavior at runtime. Shift AI control from static prompt design to live business governance.
Runtime Enforcement. Directly steer outcomes with four actions: allow, switch_model, deny_tool, stop β based on current context and policy state. Closes the gap between analytics and execution.
Auditability & Transparency. Every runtime decision is traceable and attributable. Supports audit requirements, faster tuning cycles, and trust in regulated or high-stakes workflows.
Measurable Value. Prove impact with reproducible metrics on realistic agent workflows β better economics and latency while preserving quality thresholds.
Latency Advantage. Proxy-based optimization adds 40-60ms per call. In a 10-step agent loop, that is 400-600ms of avoidable overhead. cascadeflow runs in-process with sub-5ms overhead β critical for real-time UX, task throughput, and enterprise SLAs.
Framework & Provider Neutral. Works with LangChain, OpenAI Agents SDK, CrewAI, Google ADK, Vercel AI SDK, n8n, and custom frameworks. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and more.
Self-Improving Agent Intelligence. Because cascadeflow runs inside the agent loop, it accumulates deep insight into every model call, tool result, quality score, and routing decision over time. This enables cascadeflow to learn which models perform best for which tasks, adapt routing strategies, and continuously improve cost-quality tradeoffs β without manual tuning. The agent gets smarter the more it runs.
Edge & Local-Hosted AI. Handle most queries with local models (vLLM, Ollama), automatically escalate complex queries to cloud providers only when needed.
βΉοΈ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper
How cascadeflow Works
cascadeflow uses speculative execution with quality validation:
Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
Learns patterns to optimize future cascading decisions and domain specific routing
Zero configuration. Works with YOUR existing models (>17 providers currently supported).
In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation
Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.
fromcascadeflowimportCascadeAgent, ModelConfig# Define your cascade - try cheap model first, escalate if neededagent=CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375), # Draft model (~$0.375/1M tokens)ModelConfig(name="gpt-5", provider="openai", cost=0.00562), # Verifier model (~$5.62/1M tokens)
])
# Run query - automatically routes to optimal modelresult=awaitagent.run("What's the capital of France?")
print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")
π‘ Optional: Use ML-based Semantic Quality Validation
For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.
fromcascadeflow.quality.semanticimportSemanticQualityChecker# Initialize semantic checker (downloads model on first use)checker=SemanticQualityChecker(
similarity_threshold=0.5, # Minimum similarity score (0-1)toxicity_threshold=0.7# Maximum toxicity score (0-1)
)
# Validate query-response alignmentquery="Explain Python decorators"response="Decorators are a way to modify functions using @syntax..."result=checker.validate(query, response, check_toxicity=True)
print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")
β οΈ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).
import{CascadeAgent,ModelConfig}from'@cascadeflow/core';// Same API as Python!constagent=newCascadeAgent({models: [{name: 'gpt-4o-mini',provider: 'openai',cost: 0.000375},{name: 'gpt-4o',provider: 'openai',cost: 0.00625},],});constresult=awaitagent.run('What is TypeScript?');console.log(`Model: ${result.modelUsed}`);console.log(`Cost: $${result.totalCost}`);console.log(`Saved: ${result.savingsPercentage}%`);
For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.
Step 1: Install the optional ML packages:
npm install @cascadeflow/ml @xenova/transformers
Step 2: Enable semantic validation in your cascade:
import{CascadeAgent,SemanticQualityChecker}from'@cascadeflow/core';constagent=newCascadeAgent({models: [{name: 'gpt-4o-mini',provider: 'openai',cost: 0.000375},{name: 'gpt-4o',provider: 'openai',cost: 0.00625},],quality: {threshold: 0.40,// Traditional confidence thresholdrequireMinimumTokens: 5,// Minimum response lengthuseSemanticValidation: true,// Enable ML validationsemanticThreshold: 0.5,// 50% minimum similarity},});// Responses now validated for semantic alignmentconstresult=awaitagent.run('Explain TypeScript generics');
Step 3: Or use semantic validation directly:
import{SemanticQualityChecker}from'@cascadeflow/core';constchecker=newSemanticQualityChecker();if(awaitchecker.isAvailable()){constresult=awaitchecker.checkSimilarity('What is TypeScript?','TypeScript is a typed superset of JavaScript.');console.log(`Similarity: ${(result.similarity*100).toFixed(1)}%`);console.log(`Passed: ${result.passed}`);}
TypeScript - Drop-in replacement for any LangChain chat model
import{ChatOpenAI}from'@langchain/openai';import{ChatAnthropic}from'@langchain/anthropic';import{withCascade}from'@cascadeflow/langchain';constcascade=withCascade({drafter: newChatOpenAI({model: 'gpt-4o-mini'}),// $0.15/$0.60 per 1M tokensverifier: newChatAnthropic({model: 'claude-sonnet-4-5'}),// $3/$15 per 1M tokensqualityThreshold: 0.8,// 80% queries use drafter});// Use like any LangChain chat modelconstresult=awaitcascade.invoke('Explain quantum computing');// Optional: Enable LangSmith tracing (see https://smith.langchain.com)// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true// Or with LCEL chainsconstchain=prompt.pipe(cascade).pipe(newStringOutputParser());
Python - Drop-in replacement for any LangChain chat model
fromlangchain_openaiimportChatOpenAIfromlangchain_anthropicimportChatAnthropicfromcascadeflow.integrations.langchainimportCascadeFlowcascade=CascadeFlow(
drafter=ChatOpenAI(model="gpt-4o-mini"), # $0.15/$0.60 per 1M tokensverifier=ChatAnthropic(model="claude-sonnet-4-5"), # $3/$15 per 1M tokensquality_threshold=0.8, # 80% queries use drafter
)
# Use like any LangChain chat modelresult=awaitcascade.ainvoke("Explain quantum computing")
# Optional: Enable LangSmith tracing (see https://smith.langchain.com)# Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true# Or with LCEL chainschain=prompt|cascade|StrOutputParser()
π‘ Optional: Cost Tracking with Callbacks (Python)
Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:
fromcascadeflow.integrations.langchain.langchain_callbacksimportget_cascade_callback# Track costs similar to get_openai_callback()withget_cascade_callback() ascb:
response=awaitcascade.ainvoke("What is Python?")
print(f"Total cost: ${cb.total_cost:.6f}")
print(f"Drafter cost: ${cb.drafter_cost:.6f}")
print(f"Verifier cost: ${cb.verifier_cost:.6f}")
print(f"Total tokens: {cb.total_tokens}")
print(f"Successful requests: {cb.successful_requests}")
Features:
π― Compatible with get_openai_callback() pattern
π‘ Optional: Model Discovery & Analysis Helpers (TypeScript)
For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:
import{discoverCascadePairs,findBestCascadePair,analyzeModel,validateCascadePair}from'@cascadeflow/langchain';// Your existing LangChain models (configured with YOUR API keys)constmyModels=[newChatOpenAI({model: 'gpt-3.5-turbo'}),newChatOpenAI({model: 'gpt-4o-mini'}),newChatOpenAI({model: 'gpt-4o'}),newChatAnthropic({model: 'claude-3-haiku'}),// ... any LangChain chat models];// Quick: Find best cascade pairconstbest=findBestCascadePair(myModels);console.log(`Best pair: ${best.analysis.drafterModel} β ${best.analysis.verifierModel}`);console.log(`Estimated savings: ${best.estimatedSavings}%`);// Use it immediatelyconstcascade=withCascade({drafter: best.drafter,verifier: best.verifier,});// Advanced: Discover all valid pairsconstpairs=discoverCascadePairs(myModels,{minSavings: 50,// Only pairs with β₯50% savingsrequireSameProvider: false,// Allow cross-provider cascades});// Validate specific pairconstvalidation=validateCascadePair(drafter,verifier);console.log(`Valid: ${validation.valid}`);console.log(`Warnings: ${validation.warnings}`);
What you get:
π Automatic discovery of optimal cascade pairs from YOUR models
π° Estimated cost savings calculations
β οΈ Validation warnings for misconfigured pairs
π Model tier analysis (drafter vs verifier candidates)