Debugging with Evaluations in LangChain for Effective AI System Optimization

Introduction

Debugging AI-driven applications is critical to identify and resolve issues that impact performance, accuracy, and user experience. LangChain, a powerful framework for building applications powered by large language models (LLMs), provides robust evaluation tools within its langchain.evaluation module to facilitate debugging through systematic performance assessments. Accessible under the /langchain/evaluation/debugging-with-evals path, debugging with evaluations involves using metrics like correctness, relevance, and custom scores, combined with detailed logging and visualization in LangSmith, to pinpoint errors, diagnose root causes, and optimize components such as chains, agents, and retrievers. This comprehensive guide explores debugging with evaluations in LangChain, covering setup, core techniques, best practices, practical applications, and advanced configurations, empowering developers to enhance their AI systems effectively.

To understand LangChain’s broader evaluation ecosystem, start with LangChain Evaluation Introduction.

What is Debugging with Evaluations in LangChain?

Debugging with evaluations in LangChain involves using the langchain.evaluation module to assess the performance of LangChain components (e.g., chains, agents, retrievers) and identify issues by analyzing metrics, logs, and feedback. Evaluations provide quantitative scores (e.g., correctness, latency) and qualitative insights (e.g., reasoning comments, human feedback) to diagnose problems such as incorrect outputs, irrelevant retrievals, or inefficient processing. LangSmith enhances debugging by logging results, enabling visualization of metrics, and supporting human-in-the-loop (HITL) feedback for subjective analysis. This approach is essential for optimizing AI systems, ensuring reliability, and improving user satisfaction.

For related concepts, see LangChain Metrics Overview and Logging Results.

Why Debug with Evaluations?

Debugging with evaluations is critical for:

Error Identification: Pinpoint specific failures in accuracy, relevance, or performance.
Root Cause Analysis: Diagnose issues in prompts, retrievers, or model behavior.
Optimization: Refine components based on actionable insights from metrics and feedback.
Reliability: Ensure consistent, high-quality outputs in production.

Explore evaluation capabilities at the LangChain Evaluation Documentation.

Setting Up Debugging with Evaluations

To debug with evaluations in LangChain, you need to install the required packages, configure a pipeline, define evaluators, set up test datasets, and use LangSmith for logging and visualization. Below is a setup for debugging a RetrievalQA pipeline with automated and manual evaluations:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType, StringEvaluator
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
import os
import logging
import time
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.DEBUG)  # Use DEBUG for detailed logging
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "debugging-evaluations"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"}),
    Document(page_content="Florida is a state in the USA.", metadata={"source": "geo"})
]

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)

# Set up RetrievalQA pipeline
prompt = PromptTemplate.from_template("Answer: {question}")
qa_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True  # Return documents for debugging
)

# Initialize LangSmith client
client = Client()

# Create or load a dataset
dataset_name = "qa_debugging_dataset"
try:
    dataset = client.create_dataset(dataset_name=dataset_name)
except:
    dataset = client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {
        "input": "What is the capital of France?",
        "output": "Paris",
        "expected_docs": ["The capital of France is Paris."]
    },
    {
        "input": "Where is the Eiffel Tower?",
        "output": "Paris",
        "expected_docs": ["The Eiffel Tower is in Paris."]
    },
    {
        "input": "What is the capital of Florida?",
        "output": "Tallahassee",
        "expected_docs": ["Florida is a state in the USA."]  # Incorrect document for debugging
    },
    {
        "input": "Describe Paris landmarks.",
        "output": ""
    }
]
for example in examples:
    client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"], "expected_docs": example.get("expected_docs", [])},
        dataset_id=dataset.id
    )

# Define custom evaluator for debugging
class CompletenessEvaluator(StringEvaluator):
    def __init__(self, llm):
        self.llm = llm

    def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
        evaluator = load_evaluator(
            EvaluatorType.CRITERIA,
            criteria={"completeness": "Does the response provide sufficient information to answer the query?"},
            llm=self.llm
        )
        result = evaluator.evaluate_strings(prediction=prediction, input=input)
        return {
            "key": "completeness",
            "score": result["score"],
            "comment": result.get("reasoning", "")
        }

# Define evaluation functions
def evaluate_correctness(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    if not reference:
        return {"key": "correctness", "score": None, "comment": "No reference provided."}
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    try:
        result = evaluator.evaluate_strings(
            prediction=prediction,
            reference=reference,
            input=question
        )
        return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}
    except Exception as e:
        logger.error(f"Correctness evaluation failed: {e}")
        return {"key": "correctness", "score": 0.0, "comment": f"Evaluation error: {e}"}

def evaluate_retrieval_relevance(run, example) -> Dict[str, Any]:
    retrieved_docs = run.outputs.get("source_documents", [])
    question = example.inputs.get("question", "")
    if not retrieved_docs:
        return {"key": "retrieval_relevance", "score": 0.0, "comment": "No documents retrieved."}
    prediction = retrieved_docs[0].page_content
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    try:
        result = evaluator.evaluate_strings(prediction=prediction, input=question)
        return {
            "key": "retrieval_relevance",
            "score": result["score"],
            "comment": result.get("reasoning", "")
        }
    except Exception as e:
        logger.error(f"Retrieval relevance evaluation failed: {e}")
        return {"key": "retrieval_relevance", "score": 0.0, "comment": f"Evaluation error: {e}"}

def evaluate_completeness(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    question = example.inputs.get("question", "")
    evaluator = CompletenessEvaluator(llm=llm)
    try:
        result = evaluator.evaluate_strings(prediction=prediction, input=question)
        return result
    except Exception as e:
        logger.error(f"Completeness evaluation failed: {e}")
        return {"key": "completeness", "score": 0.0, "comment": f"Evaluation error: {e}"}

# Run evaluation and log results
start_time = time.time()
results = evaluate(
    lambda inputs: {
        "result": qa_pipeline.invoke({"query": inputs["question"]})["result"],
        "source_documents": qa_pipeline.invoke({"query": inputs["question"]})["source_documents"],
        "latency": time.time() - start_time
    },
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_retrieval_relevance, evaluate_completeness],
    experiment_prefix="qa_debugging_test",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T15:45:00Z"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Test Results: {results}")
print("Debug results in LangSmith dashboard under 'qa_debugging_test' experiment.")
print("Instructions: Navigate to LangSmith, select the 'debugging-evaluations' project, review low-scoring examples, and add HITL feedback for subjective metrics like completeness.")

Manual Evaluation Setup (LangSmith HITL)

Access LangSmith Dashboard:
- Log in to LangSmith with your API key.
- Navigate to the project (debugging-evaluations) and experiment (qa_debugging_test).

2. Review Outputs:

Examine each example’s input, output, retrieved documents, and automated scores (correctness, retrieval relevance, completeness).
Focus on low-scoring examples (e.g., correctness < 0.5) for debugging.

3. Annotate Manual Feedback:

Add scores (0-1) for subjective metrics like “coherence” or “context appropriateness.”
Include comments, e.g., “Retrieved document is irrelevant; response lacks factual accuracy.”

4. Save and Analyze:

Save annotations to log manual feedback.
Use LangSmith’s visualization tools (e.g., tables, charts) to compare automated and manual scores.

Output:

Evaluation completed in 8.45 seconds
Test Results: 
Debug results in LangSmith dashboard under 'qa_debugging_test' experiment.
Instructions: Navigate to LangSmith, select the 'debugging-evaluations' project, review low-scoring examples, and add HITL feedback for subjective metrics like completeness.

This setup evaluates a RetrievalQA pipeline with automated metrics (correctness, retrieval relevance, completeness) and prepares for manual HITL feedback in LangSmith, logging all results for debugging.

Core Techniques for Debugging with Evaluations

1. Automated Metric Analysis

Use automated evaluators to identify errors through low scores and detailed comments.

Correctness Debugging:

Detect factual errors by analyzing low correctness scores.
Example: A low score (e.g., 0.0) for “What is the capital of Florida?” with comment “Prediction ‘Tallahassee’ does not match reference” indicates a retrieval or model error.
Action: Check retrieved documents (e.g., “Florida is a state in the USA.” is irrelevant) and refine the retriever or prompt.

Retrieval Relevance Debugging:

Identify irrelevant document retrievals with low relevance scores.
Example: A low retrieval relevance score (e.g., 0.3) with comment “Document does not address query” suggests poor vector store embeddings or query mismatch.
Action: Adjust embedding model or retriever parameters (e.g., increase k or use MMR search).

Completeness Debugging:

Spot incomplete responses with low completeness scores.
Example: A low completeness score (e.g., 0.4) for “Describe Paris landmarks” with comment “Response lacks sufficient detail” indicates prompt or context issues.
Action: Modify prompt to request detailed responses or increase retriever k.

2. Detailed Logging

Enable verbose logging to capture evaluation details for debugging.

DEBUG-Level Logging:

Log inputs, outputs, scores, and errors for each example.
Example:

logger.debug(f"Input: {question}, Prediction: {prediction}, Score: {result['score']}, Comment: {result['comment']}")

Action: Review logs to trace errors, e.g., “Retrieved document irrelevant” points to retriever issues.

Error Logging:

Capture exceptions during evaluation to diagnose failures.
Example: logger.error(f"Evaluation failed: {e}") logs API or model errors.
Action: Investigate logged errors (e.g., API timeout) and implement retries.

3. LangSmith Visualization

Use LangSmith’s dashboard to visualize evaluation results and pinpoint issues.

Per-Example Tables:

View inputs, outputs, scores, and comments to identify problematic examples.
Example: A table showing low correctness for “What is the capital of Florida?” highlights a specific failure.
Action: Inspect retrieved documents and response to diagnose the issue.

Metric Distributions:

Plot histograms of scores (e.g., correctness, relevance) to identify outliers.
Example: A histogram showing most correctness scores >0.8 but one at 0.0 flags an anomaly.
Action: Focus debugging on the outlier example.

Experiment Comparison:

Compare experiments (e.g., different retriever settings) to assess improvements.
Example: A bar chart showing higher relevance scores for k=3 vs. k=2 suggests increasing k.
Action: Update retriever configuration based on comparison.

4. Human-in-the-Loop Debugging

Incorporate HITL feedback to diagnose subjective or complex issues.

Manual Annotations:

Reviewers add scores and comments for subjective metrics (e.g., coherence).
Example: A comment “Response is factually correct but overly brief” for “Describe Paris landmarks” indicates a prompt issue.
Action: Revise prompt to request more detailed responses.

Error Flagging:

Reviewers flag examples with unexpected behavior for deeper analysis.
Example: Flagging “What is the capital of Florida?” due to incorrect document retrieval.
Action: Investigate vector store embeddings or document indexing.

5. Iterative Debugging Workflow

Use evaluations to iteratively debug and optimize the pipeline.

Identify Issues:

Review low scores, comments, and logs to pinpoint errors (e.g., irrelevant documents).

Diagnose Root Causes:

Analyze inputs, outputs, and retrieved documents to identify causes (e.g., poor embeddings).

Implement Fixes:

Adjust pipeline components (e.g., retriever k, prompt, or model).
Example: Increase k to 3 or add a prompt instruction like “Provide detailed answers.”

Re-evaluate:

Run evaluations again to verify improvements.
Example: Re-run evaluate and check if correctness score for “What is the capital of Florida?” improves.

Log and Visualize:

Log results in LangSmith and visualize changes (e.g., trend graph showing improved relevance).

Practical Applications

Question Answering:
- Debug RAG pipelines for factual errors or incomplete responses (RetrievalQA Chain).
- Example: Fix low correctness scores by improving retriever relevance.

2. Semantic Search:

Diagnose irrelevant document retrievals (Evaluating Retrieval).
Example: Adjust embedding model based on low retrieval relevance scores.

3. Conversational Agents:

Debug tool usage or response coherence issues (Evaluate Agent Behavior).
Example: Refine agent prompts for better tool selection.

4. Production Systems:

Monitor and debug performance regressions in real-time.
Example: Use latency metrics to optimize pipeline efficiency.

Try the Document Search Engine Tutorial.

Best Practices

Use Comprehensive Metrics:
- Evaluate correctness, relevance, completeness, and latency to cover all aspects.

2. Enable Verbose Logging:

Set logging to DEBUG for detailed error tracing.

3. Leverage LangSmith Visualizations:

Use tables, histograms, and trend graphs to identify and prioritize issues.

4. Combine Automated and Manual:

Use automated metrics for scalability and HITL for subjective debugging (Automated vs. Manual Evaluation).

5. Test with Diverse Datasets:

Include factual, open-ended, and edge-case inputs to uncover hidden issues.

6. Iterate Rapidly:

Debug, fix, and re-evaluate in short cycles to optimize efficiently.

Error Handling

Evaluation Failures:

Catch exceptions in evaluators and log errors for diagnosis.
Example: logger.error(f"Evaluation failed: {e}") in evaluate_correctness.

Retriever Issues:

Handle cases with no retrieved documents by assigning low scores.
Example: if not retrieved_docs: return {"score": 0.0, "comment": "No documents retrieved."}.

Dataset Errors:

Validate dataset format to avoid parsing issues.

Resource Limits:

Batch evaluations to manage API costs and rate limits.

See Troubleshooting.

Limitations

LLM Bias: Automated metrics may be inconsistent due to model variability.
Subjectivity: Manual debugging relies on reviewer expertise, introducing variability.
Cost: LLM-based evaluations and LangSmith logging can be expensive for large datasets.
Dataset Dependency: Debugging effectiveness depends on dataset quality and diversity.

Recent Developments

2025 Updates: LangSmith introduced enhanced debugging tools, including real-time error highlighting and automated issue flagging.
Community Feedback: X posts emphasize debugging RAG pipelines with evaluations, focusing on retrieval relevance improvements.
LangSmith UI: Improved debugging dashboards with interactive error analysis and experiment comparison.

Conclusion

Debugging with evaluations in LangChain, powered by LangSmith, enables developers to systematically identify, diagnose, and resolve issues in AI systems. By combining automated metrics, detailed logging, HITL feedback, and visualizations, developers can optimize pipelines for accuracy, relevance, and efficiency. Start debugging your LangChain projects with evaluations to ensure reliable, high-quality performance in production.

For official documentation, visit LangSmith Documentation.