Evaluating Retrieval in LangChain for Optimized AI Performance

Introduction

Retrieval is a critical component in many AI-driven applications, particularly in retrieval-augmented generation (RAG) systems, where relevant documents or data must be efficiently retrieved to inform accurate and contextually appropriate responses. LangChain, a powerful framework for building applications powered by language models, provides robust tools within its langchain.evaluation module to evaluate retrieval performance, ensuring retrievers deliver high-quality, relevant results. Accessible under the /langchain/evaluation/evaluating-retrieval path, retrieval evaluation focuses on assessing metrics like relevance, recall, precision, and context appropriateness for retrievers integrated with vector stores or other data sources. This comprehensive guide explores how to evaluate retrieval in LangChain, covering setup, core techniques, best practices, practical applications, and advanced configurations, empowering developers to optimize retrieval systems for superior AI performance.

To understand LangChain’s broader evaluation ecosystem, start with LangChain Evaluation Introduction.

What is Retrieval Evaluation in LangChain?

Retrieval evaluation in LangChain involves assessing the performance of retrievers—components responsible for fetching relevant documents or data from a data source (e.g., vector stores, databases)—based on metrics such as relevance, precision, recall, and ranking quality. The langchain.evaluation module provides evaluators that leverage LLMs, embedding-based similarity, or traditional metrics to measure how well retrievers identify and rank documents that match a given query. These evaluations are often integrated with LangSmith for dataset-driven testing and performance tracking, enabling developers to validate and optimize retrievers in RAG pipelines, semantic search, or question-answering systems. Retrieval evaluation ensures that the retrieved context is accurate and relevant, directly impacting the quality of downstream LLM outputs.

For related concepts, see LangChain Metrics Overview and Testing Pipelines.

Why Evaluate Retrieval?

Retrieval evaluation is essential for:

  • Relevance Assurance: Ensure retrievers fetch documents that accurately match user queries.
  • Performance Optimization: Identify and address issues in document ranking or context selection.
  • Downstream Impact: Improve the quality of LLM-generated responses by providing relevant context.
  • Scalability: Validate retriever performance across diverse queries and datasets.

Explore evaluation capabilities at the LangChain Evaluation Documentation.

Setting Up Retrieval Evaluation

To evaluate retrieval in LangChain, you need to install the required packages, configure a retriever (e.g., backed by a vector store), set up evaluators, and create test datasets. Below is a setup for evaluating a retriever in a RetrievalQA pipeline using LangSmith:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
import os
import logging
import time
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "retrieval-evaluation"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"}),
    Document(page_content="Florida is a state in the USA.", metadata={"source": "geo"})
]

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)

# Set up retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

# Set up RetrievalQA pipeline (for context, though evaluation focuses on retriever)
prompt = PromptTemplate.from_template("Answer: {question}")
qa_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

# Initialize LangSmith client
client = Client()

# Create or load a dataset
dataset_name = "retrieval_qa_dataset"
try:
    dataset = client.create_dataset(dataset_name=dataset_name)
except:
    dataset = client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {
        "input": "What is the capital of France?",
        "expected_docs": ["The capital of France is Paris."],
        "output": "Paris"
    },
    {
        "input": "Where is the Eiffel Tower?",
        "expected_docs": ["The Eiffel Tower is in Paris."],
        "output": "Paris"
    },
    {
        "input": "Describe Paris landmarks.",
        "expected_docs": ["The Eiffel Tower is in Paris."],
        "output": ""
    }
]
for example in examples:
    client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"], "expected_docs": example["expected_docs"]},
        dataset_id=dataset.id
    )

# Define evaluators
def evaluate_retrieval_relevance(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("retrieved_docs", [])
    question = example.inputs.get("question", "")
    if not retrieved_docs:
        return {"key": "retrieval_relevance", "score": 0.0, "comment": "No documents retrieved."}
    # Evaluate top retrieved document
    prediction = retrieved_docs[0]["page_content"]
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return {
        "key": "retrieval_relevance",
        "score": result["score"],
        "comment": result.get("reasoning", "")
    }

def evaluate_retrieval_precision(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("retrieved_docs", [])
    expected_docs = example.outputs.get("expected_docs", [])
    if not retrieved_docs or not expected_docs:
        return {"key": "retrieval_precision", "score": 0.0, "comment": "No documents retrieved or expected."}
    # Calculate precision: proportion of retrieved docs that are expected
    retrieved_texts = [doc["page_content"] for doc in retrieved_docs]
    relevant_count = sum(1 for doc in retrieved_texts if doc in expected_docs)
    precision = relevant_count / len(retrieved_docs)
    return {
        "key": "retrieval_precision",
        "score": precision,
        "comment": f"{relevant_count}/{len(retrieved_docs)} retrieved documents are relevant."
    }

# Run retrieval evaluation
start_time = time.time()
results = evaluate(
    lambda inputs: {
        "result": qa_pipeline.invoke({"query": inputs["question"]})["result"],
        "retrieved_docs": [
            {"page_content": doc.page_content, "metadata": doc.metadata}
            for doc in retriever.invoke(inputs["question"])
        ]
    },
    data=dataset_name,
    evaluators=[evaluate_retrieval_relevance, evaluate_retrieval_precision],
    experiment_prefix="retrieval_test",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T16:08:00Z"}
)

# Log results
logger.info(f"Retrieval testing completed in {time.time() - start_time:.2f} seconds")
print(f"Test Results: {results}")
print("View detailed results in LangSmith dashboard under 'retrieval_test' experiment.")

This setup creates a RetrievalQA pipeline with a Chroma vector store, uploads a test dataset to LangSmith, and evaluates the retriever’s performance for relevance and precision. Results are logged in the LangSmith dashboard for analysis.

Installation

Install the core packages for LangChain, LangSmith, and evaluation:

pip install langchain langchain-chroma langchain-openai chromadb langsmith

For specific metrics, install additional dependencies:

  • NLP Metrics: pip install nltk rouge-score for BLEU/ROUGE scores.
  • Embedding Metrics: Included with langchain-openai.

Example:

pip install nltk rouge-score

For detailed installation guidance, see LangSmith Documentation.

Configuration Options

Customize retrieval evaluation during setup:

  • Retriever Configuration:
    • Adjust retriever parameters (e.g., k for top-k documents) or use different vector stores.
    • Example:
    • retriever = vector_store.as_retriever(search_kwargs={"k": 3})
  • Dataset Configuration:
    • Include queries with expected documents and answers for precise evaluation.
    • Example:
    • client.create_example(
              inputs={"question": "What is the capital of France?"},
              outputs={"answer": "Paris", "expected_docs": ["The capital of France is Paris."]},
              dataset_id=dataset.id
          )
  • Evaluators:
    • Use built-in (CRITERIA, EMBEDDING_DISTANCE) or custom evaluators for retrieval-specific metrics.
    • Example:
    • evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
  • LangSmith Integration:
    • Track experiments with metadata for reproducibility.
    • Example:
    • evaluate(..., metadata={"version": "1.0", "retriever": "Chroma"})

Core Techniques for Evaluating Retrieval

1. Relevance Evaluation

Assess whether retrieved documents are relevant to the input query.

  • Criteria Evaluator (Relevance):
    • Uses an LLM to score the relevance of retrieved documents.
    • Use Case: Validating document appropriateness in RAG pipelines.
    • Example:
    • def evaluate_retrieval_relevance(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              question = example.inputs.get("question", "")
              if not retrieved_docs:
                  return {"key": "retrieval_relevance", "score": 0.0, "comment": "No documents retrieved."}
              prediction = retrieved_docs[0]["page_content"]
              evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
              result = evaluator.evaluate_strings(prediction=prediction, input=question)
              return {
                  "key": "retrieval_relevance",
                  "score": result["score"],
                  "comment": result.get("reasoning", "")
              }
  • Embedding Distance:
    • Measures semantic similarity between query and retrieved documents.
    • Use Case: Evaluating contextual alignment.
    • Example:
    • def evaluate_embedding_distance(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              question = example.inputs.get("question", "")
              if not retrieved_docs:
                  return {"key": "embedding_distance", "score": 1.0, "comment": "No documents retrieved."}
              prediction = retrieved_docs[0]["page_content"]
              evaluator = load_evaluator(EvaluatorType.EMBEDDING_DISTANCE, embeddings=embedding_function)
              result = evaluator.evaluate_strings(prediction=prediction, reference=question)
              return {
                  "key": "embedding_distance",
                  "score": result["score"],
                  "comment": "Cosine distance between query and top document."
              }

2. Precision and Recall Evaluation

Measure the proportion of relevant documents retrieved (precision) and the proportion of all relevant documents retrieved (recall).

  • Precision Evaluator:
    • Calculates the fraction of retrieved documents that are relevant.
    • Use Case: Ensuring high-quality document selection.
    • Example:
    • def evaluate_retrieval_precision(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              expected_docs = example.outputs.get("expected_docs", [])
              if not retrieved_docs or not expected_docs:
                  return {"key": "retrieval_precision", "score": 0.0, "comment": "No documents retrieved or expected."}
              retrieved_texts = [doc["page_content"] for doc in retrieved_docs]
              relevant_count = sum(1 for doc in retrieved_texts if doc in expected_docs)
              precision = relevant_count / len(retrieved_docs)
              return {
                  "key": "retrieval_precision",
                  "score": precision,
                  "comment": f"{relevant_count}/{len(retrieved_docs)} retrieved documents are relevant."
              }
  • Recall Evaluator:
    • Calculates the fraction of all relevant documents that were retrieved.
    • Use Case: Ensuring comprehensive document coverage.
    • Example:
    • def evaluate_retrieval_recall(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              expected_docs = example.outputs.get("expected_docs", [])
              if not expected_docs:
                  return {"key": "retrieval_recall", "score": 0.0, "comment": "No expected documents provided."}
              retrieved_texts = [doc["page_content"] for doc in retrieved_docs]
              relevant_count = sum(1 for doc in expected_docs if doc in retrieved_texts)
              recall = relevant_count / len(expected_docs)
              return {
                  "key": "retrieval_recall",
                  "score": recall,
                  "comment": f"{relevant_count}/{len(expected_docs)} expected documents retrieved."
              }

3. Ranking Quality Evaluation

Assess the order of retrieved documents to ensure the most relevant are ranked highest.

  • Custom Ranking Metric:
    • Evaluates whether the top-ranked document is the most relevant.
    • Use Case: Optimizing retriever ranking algorithms.
    • Example:
    • def evaluate_ranking_quality(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              expected_docs = example.outputs.get("expected_docs", [])
              if not retrieved_docs or not expected_docs:
                  return {"key": "ranking_quality", "score": 0.0, "comment": "No documents retrieved or expected."}
              top_doc = retrieved_docs[0]["page_content"]
              score = 1.0 if top_doc in expected_docs else 0.5
              return {
                  "key": "ranking_quality",
                  "score": score,
                  "comment": "Checks if the top-ranked document is expected."
              }

4. Custom Retrieval Metrics

Define custom metrics to evaluate retriever-specific behaviors.

  • Custom Evaluator (Context Specificity):
    • Assesses whether retrieved documents provide specific, detailed context.
    • Example:
    • class ContextSpecificityEvaluator(StringEvaluator):
              def __init__(self, llm):
                  self.llm = llm
      
              def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
                  evaluator = load_evaluator(
                      EvaluatorType.CRITERIA,
                      criteria={"specificity": "Does the document provide detailed, specific context for the query?"},
                      llm=self.llm
                  )
                  result = evaluator.evaluate_strings(prediction=prediction, input=input)
                  return {
                      "key": "context_specificity",
                      "score": result["score"],
                      "reasoning": result.get("reasoning", "")
                  }
      
          def evaluate_context_specificity(run, example):
              retrieved_docs = run.outputs.get("retrieved_docs", [])
              question = example.inputs.get("question", "")
              if not retrieved_docs:
                  return {"key": "context_specificity", "score": 0.0, "comment": "No documents retrieved."}
              prediction = retrieved_docs[0]["page_content"]
              evaluator = ContextSpecificityEvaluator(llm=llm)
              result = evaluator.evaluate_strings(prediction=prediction, input=question)
              return result

5. Human-in-the-Loop Validation

Supplement automated retrieval evaluation with human feedback for subjective or ambiguous cases.

  • LangSmith HITL:
    • Use LangSmith to collect human feedback on retrieved document relevance.
    • Example: In LangSmith UI, reviewers score documents for “relevance” (0-1) with comments like “Document mentions Eiffel Tower but not other landmarks.”

Comprehensive Example

Here’s a complete system evaluating a LangChain retrieval pipeline with automated and custom metrics using LangSmith, integrated with Chroma and MongoDB Atlas:

from langchain_chroma import Chroma
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType, StringEvaluator
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
from pymongo import MongoClient
import os
import logging
import time
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "retrieval-pipeline-evaluation"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"}),
    Document(page_content="Florida is a state in the USA.", metadata={"source": "geo"})
]

# Initialize Chroma and MongoDB Atlas vector stores
chroma_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)
client = MongoClient("mongodb:srv://:@.mongodb.net/")
collection = client["langchain_db"]["example_collection"]
mongo_store = MongoDBAtlasVectorSearch.from_documents(
    documents,
    embedding_function,
    collection=collection,
    index_name="vector_index"
)

# Set up retriever and RetrievalQA pipeline
retriever = chroma_store.as_retriever(search_kwargs={"k": 2})
prompt = PromptTemplate.from_template("Answer: {question}")
qa_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

# Initialize LangSmith client
ls_client = Client()

# Create or load dataset
dataset_name = "retrieval_pipeline_dataset"
try:
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
except:
    dataset = ls_client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {
        "input": "What is the capital of France?",
        "expected_docs": ["The capital of France is Paris."],
        "output": "Paris"
    },
    {
        "input": "Where is the Eiffel Tower?",
        "expected_docs": ["The Eiffel Tower is in Paris."],
        "output": "Paris"
    },
    {
        "input": "Describe Paris landmarks.",
        "expected_docs": ["The Eiffel Tower is in Paris."],
        "output": ""
    }
]
for example in examples:
    ls_client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"], "expected_docs": example["expected_docs"]},
        dataset_id=dataset.id
    )

# Define custom evaluator
class ContextSpecificityEvaluator(StringEvaluator):
    def __init__(self, llm):
        self.llm = llm

    def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
        evaluator = load_evaluator(
            EvaluatorType.CRITERIA,
            criteria={"specificity": "Does the document provide detailed, specific context for the query?"},
            llm=self.llm
        )
        result = evaluator.evaluate_strings(prediction=prediction, input=input)
        return {
            "key": "context_specificity",
            "score": result["score"],
            "reasoning": result.get("reasoning", "")
        }

# Define evaluation functions
def evaluate_retrieval_relevance(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("retrieved_docs", [])
    question = example.inputs.get("question", "")
    if not retrieved_docs:
        return {"key": "retrieval_relevance", "score": 0.0, "comment": "No documents retrieved."}
    prediction = retrieved_docs[0]["page_content"]
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    result = evaluator.evaluate_strings(prediction=prediction, input=question)
    return {
        "key": "retrieval_relevance",
        "score": result["score"],
        "comment": result.get("reasoning", "")
    }

def evaluate_retrieval_precision(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("retrieved_docs", [])
    expected_docs = example.outputs.get("expected_docs", [])
    if not retrieved_docs or not expected_docs:
        return {"key": "retrieval_precision", "score": 0.0, "comment": "No documents retrieved or expected."}
    retrieved_texts = [doc["page_content"] for doc in retrieved_docs]
    relevant_count = sum(1 for doc in retrieved_texts if doc in expected_docs)
    precision = relevant_count / len(retrieved_docs)
    return {
        "key": "retrieval_precision",
        "score": precision,
        "comment": f"{relevant_count}/{len(retrieved_docs)} retrieved documents are relevant."
    }

def evaluate_context_specificity(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("retrieved_docs", [])
    question = example.inputs.get("question", "")
    if not retrieved_docs:
        return {"key": "context_specificity", "score": 0.0, "comment": "No documents retrieved."}
    prediction = retrieved_docs[0]["page_content"]
    evaluator = ContextSpecificityEvaluator(llm=llm)
    result = evaluator.evaluate_strings(prediction=prediction, input=question)
    return result

# Run retrieval evaluation
start_time = time.time()
results = evaluate(
    lambda inputs: {
        "result": qa_pipeline.invoke({"query": inputs["question"]})["result"],
        "retrieved_docs": [
            {"page_content": doc.page_content, "metadata": doc.metadata}
            for doc in retriever.invoke(inputs["question"])
        ]
    },
    data=dataset_name,
    evaluators=[evaluate_retrieval_relevance, evaluate_retrieval_precision, evaluate_context_specificity],
    experiment_prefix="retrieval_pipeline_test",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T16:08:00Z"}
)

# Log results
logger.info(f"Retrieval pipeline testing completed in {time.time() - start_time:.2f} seconds")
print(f"Test Results: {results}")
print("View detailed results in LangSmith dashboard under 'retrieval_pipeline_test' experiment.")

Output:

Retrieval pipeline testing completed in 10.45 seconds
Test Results: 
View detailed results in LangSmith dashboard under 'retrieval_pipeline_test' experiment.

The evaluation tests the retriever for relevance, precision, and context specificity, logging results in LangSmith for detailed analysis via the dashboard.

Best Practices

  1. Define Clear Metrics: Use relevance and precision for retrieval quality, and custom metrics like specificity for nuanced evaluation.
  2. Curate Diverse Datasets: Include varied queries, factual and open-ended, to test retriever robustness.
  3. Combine Metrics: Evaluate both retrieval (e.g., precision) and downstream output (e.g., correctness) for holistic pipeline testing.
  4. Leverage LangSmith: Use LangSmith for dataset management, tracking, and visualization of evaluation results.
  5. Optimize Retriever Parameters: Test different k values or search strategies to improve performance.
  6. Supplement with HITL: Use human-in-the-loop feedback for subjective or ambiguous retrieval cases (see Human-in-the-Loop Evaluation).

Error Handling

  • Retriever Failures: Handle cases where no documents are retrieved by returning default scores (e.g., 0.0).
  • LLM Errors: Implement retries or fallback models for evaluation failures.
  • Dataset Issues: Validate dataset format to ensure expected documents are correctly specified.
  • Resource Limits: Batch evaluations to manage API costs and rate limits.

See Troubleshooting.

Limitations

  • LLM Bias: Judgment-based metrics like relevance may vary by model or prompt.
  • Dataset Dependency: Evaluation quality depends on well-curated datasets with accurate expected documents.
  • Cost: LLM-based evaluations can be expensive for large datasets.
  • Metric Applicability: Precision and recall require clear expected documents, limiting use for open-ended queries.

Recent Developments

  • 2025 Updates: LangSmith introduced advanced retrieval evaluation templates for precision and recall.
  • Community Feedback: X posts highlight retrieval evaluation for optimizing RAG systems in legal and medical domains.
  • LangSmith UI: Enhanced analytics for visualizing retrieval performance trends and document rankings.

Conclusion

Evaluating retrieval in LangChain is crucial for optimizing RAG pipelines and ensuring high-quality, relevant document selection. By leveraging automated metrics, custom evaluators, and LangSmith integration, developers can assess relevance, precision, and context specificity, enhancing pipeline performance. Start evaluating your LangChain retrievers to deliver robust, accurate AI applications tailored to your use case.

For official documentation, visit LangSmith Documentation.