Catch RAG failures before your users do.

RAG-Forge audits any RAG pipeline against the RAG Maturity Model. Detect hallucinations, retrieval bypass, silent quality regressions, and cost drift before they ship — with a single CLI that works on your existing stack.

$npm install -g @rag-forge/cli

View on GitHub

rag-forge audit

$ rag-forge audit --golden-set eval/golden_set.json --judge claude

RAG-Forge Audit

===============

Samples: 19

Metrics: 4 (faithfulness, context_relevance, answer_relevance, hallucination)

Judge calls: 76 total

Judge model: claude-sonnet-4-20250514

Estimated cost: ~$1.25 USD

---

[ 1/19] [query redacted] faith=0.92 ctx=0.85 ans=0.91 hall=0.95 OK (8.2s)

[ 2/19] [query redacted] faith=0.88 ctx=0.79 ans=0.90 hall=0.93 OK (9.1s)

[ 3/19] [query redacted] faith=0.00 ctx=0.00 ans=0.78 hall=0.85 WARN 2 skipped (11.4s)

...

---

Audit complete in 9m 23s

Scored: 72 Skipped: 4

RMM-3Better Trust

v0.1.3 just shipped

MITlicensed

OIDC Trusted Publishers

Be one of the first 100

The RAG quality crisis

RAG has become the dominant architecture for enterprise AI. Yet the ecosystem suffers from a critical gap between building RAG pipelines and knowing whether they actually work.

32%

of teams cite quality as the #1 GenAI deployment barrier

LangChain State of AI Agents 2026

RMM-0

is where most production RAG pipelines actually sit — naive vector search with no quality framework

RAG-Forge Maturity Model

Few

open-source frameworks score any pipeline against a maturity model with framework-agnostic CLI tooling — RAG-Forge is one of them

RAG-Forge

Everything you need to ship a production RAG pipeline

Pipeline Primitives

Five chunking strategies, dense + sparse + hybrid retrieval, contextual enrichment, and reranking. Bring your own embedding model.

$ create_chunker(ChunkConfig(strategy="semantic"))

Evaluation as a CI/CD Gate

RAGAS, DeepEval, and LLM-as-Judge baked in. Cost + time estimates before each run, skip-aware aggregation, configurable thresholds in rag-forge.config.ts.

$ rag-forge audit --golden-set qa.json --judge claude

Built-in Observability

OpenTelemetry tracing on every pipeline stage. Drift detection, cost estimation, semantic caching.

$ rag-forge drift report --baseline baseline.json

Production Templates

Five battle-tested starting points. shadcn/ui model — you own every line of code.

$ rag-forge init enterprise

The RAG Maturity Model

Where does your pipeline stand? Score any RAG system from RMM-0 (naive) to RMM-5 (enterprise).

RMM-0
Naive
Basic vector search works
Gate: Vector retrieval returns results
RMM-1
Better Recall
Hybrid search active, Recall@5 > 70%
Gate: Dense + sparse + RRF fusion
RMM-2
Better Precision
Reranker active, nDCG@10 +10%
Gate: Cross-encoder reranking on top results
RMM-3
Better Trust
← Most pipelines stop here
Guardrails, faithfulness > 85%, citations
Gate: InputGuard + OutputGuard active
RMM-4
Better Workflow
Caching, P95 < 4s, cost tracking
Gate: Semantic cache + telemetry + cost meter
RMM-5
Enterprise
Drift detection, CI/CD gates, adversarial tests
Gate: All audit thresholds pass

Get started in 60 seconds

# Install the CLI
npm install -g @rag-forge/cli

# Scaffold a project (use --directory to name the folder)
rag-forge init basic --directory my-rag-project
cd my-rag-project

# Drop your documents into a folder of your choice
mkdir docs
echo "RAG-Forge is a CLI for building and evaluating RAG pipelines." > docs/example.md

# Index your docs and run an audit
rag-forge index --source ./docs
rag-forge audit --golden-set eval/golden_set.json

How RAG-Forge compares

Feature	rag-forge	langchain	llamaindex	ragas
Framework agnostic (audit any pipeline)	yes	no	partial	yes
Evaluation built in (CI/CD gate)	yes	partial	partial	yes
RAG Maturity Model scoring	yes	no	no	no
OpenTelemetry native	yes	partial	no	no
MCP server	yes	no	no	no
CLI scaffolding	yes	no	partial	no
Code ownership (shadcn model)	yes	no	no	no
Drift detection	yes	no	no	no

Comparison based on publicly available features as of April 2026.

Peer strengths worth knowing

RAGAS: Deeper metric research and a larger community. RAG-Forge's evaluator supports RAGAS as a backend — `rag-forge audit --evaluator ragas`.
LangChain & LlamaIndex: Far broader integration ecosystems if you're already invested in their framework. RAG-Forge complements them by sitting on top of any pipeline.
Giskard: Strong general-purpose ML testing story beyond RAG.

Pick the tool that matches your stage. RAG-Forge's wedge is the full lifecycle — scaffold → evaluate → score → ship — in one CLI, with the RAG Maturity Model as the objective function.

Start from a template

basic

Beginner

First RAG project, simple Q&A

$rag-forge init basic

hybrid

Intermediate

Production-ready document Q&A with reranking

$rag-forge init hybrid

agentic

Advanced

Multi-hop reasoning with query decomposition

$rag-forge init agentic

enterprise

Advanced

Regulated industries with full security suite

$rag-forge init enterprise

n8n

Intermediate

AI automation agency deployments

$rag-forge init n8n