Logging Evaluation Results in LangChain for Comprehensive Performance Tracking

Introduction

Effective evaluation of AI-driven applications requires not only assessing performance but also systematically logging results to track progress, identify trends, and facilitate iterative improvements. LangChain, a powerful framework for building applications powered by large language models (LLMs), provides robust tools within its langchain.evaluation module, integrated with LangSmith, to log evaluation results for components like chains, agents, and retrievers. Accessible under the /langchain/evaluation/logging-results path, logging evaluation results enables developers to store, analyze, and visualize metrics such as accuracy, relevance, latency, and custom scores, ensuring comprehensive performance tracking. This guide explores how to log evaluation results in LangChain, covering setup, core techniques, best practices, practical applications, and advanced configurations, empowering developers to maintain robust, data-driven insights into their AI systems.

To understand LangChain’s broader evaluation ecosystem, start with LangChain Evaluation Introduction.

What is Logging Evaluation Results in LangChain?

Logging evaluation results in LangChain involves capturing and storing the outcomes of performance assessments for LangChain components, such as chains, retrievers, or agents, typically using LangSmith. These results include quantitative metrics (e.g., accuracy, latency), qualitative scores (e.g., relevance, coherence), and associated metadata (e.g., timestamps, experiment details). LangSmith provides a centralized platform to log results, organize them into experiments, and visualize performance trends, enabling developers to track improvements, detect regressions, and share insights with teams. Logging can be automated via the langsmith package and supports both built-in and custom evaluators, making it ideal for iterative development and production monitoring.

For related concepts, see LangChain Metrics Overview and LangSmith Evaluation.

Why Log Evaluation Results?

Logging evaluation results is essential for:

Performance Tracking: Monitor metrics over time to assess improvements or regressions.
Debugging: Identify issues in pipelines by analyzing logged metrics and comments.
Collaboration: Share results with teams for collective analysis and decision-making.
Reproducibility: Maintain a record of experiments with metadata for consistent evaluation.

Explore LangSmith’s logging capabilities at the LangSmith Documentation.

Setting Up Logging Evaluation Results

To log evaluation results in LangChain, you need to install the required packages, configure LangSmith, set up a pipeline, define evaluators, and create test datasets. Below is a setup for logging results from a RetrievalQA pipeline using LangSmith:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
import os
import logging
import time
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "evaluation-logging"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"}),
    Document(page_content="Florida is a state in the USA.", metadata={"source": "geo"})
]

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)

# Set up RetrievalQA pipeline
prompt = PromptTemplate.from_template("Answer: {question}")
qa_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True
)

# Initialize LangSmith client
client = Client()

# Create or load a dataset
dataset_name = "qa_logging_dataset"
try:
    dataset = client.create_dataset(dataset_name=dataset_name)
except:
    dataset = client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {
        "input": "What is the capital of France?",
        "output": "Paris",
        "expected_docs": ["The capital of France is Paris."]
    },
    {
        "input": "Where is the Eiffel Tower?",
        "output": "Paris",
        "expected_docs": ["The Eiffel Tower is in Paris."]
    },
    {
        "input": "What is the capital of Florida?",
        "output": "",
        "expected_docs": ["Florida is a state in the USA."]
    }
]
for example in examples:
    client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"], "expected_docs": example["expected_docs"]},
        dataset_id=dataset.id
    )

# Define evaluators
def evaluate_correctness(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    if not reference:
        return {"key": "correctness", "score": None, "comment": "No reference provided."}
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_relevance(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return {"key": "relevance", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_latency(run, example) -> Dict[str, Any]:
    latency = run.outputs.get("latency", 0.0)
    score = 1.0 if latency < 1.0 else max(0.0, 1.0 - (latency - 1.0) / 5.0)
    return {
        "key": "latency",
        "score": score,
        "comment": f"Response latency: {latency:.2f} seconds."
    }

# Run evaluation and log results
start_time = time.time()
results = evaluate(
    lambda inputs: {
        "result": qa_pipeline.invoke({"query": inputs["question"]})["result"],
        "latency": time.time() - start_time
    },
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_relevance, evaluate_latency],
    experiment_prefix="qa_logging_test",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T15:35:00Z"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Test Results: {results}")
print("View detailed results in LangSmith dashboard under 'qa_logging_test' experiment.")

This setup creates a RetrievalQA pipeline, uploads a test dataset to LangSmith, evaluates outputs for correctness, relevance, and latency, and logs results in the LangSmith dashboard for analysis.

Installation

Install the core packages for LangChain, LangSmith, and evaluation:

pip install langchain langchain-chroma langchain-openai chromadb langsmith

For specific metrics, install additional dependencies:

NLP Metrics: pip install nltk rouge-score for BLEU/ROUGE scores.
Embedding Metrics: Included with langchain-openai.

Example:

pip install nltk rouge-score

For detailed installation guidance, see LangSmith Documentation.

Configuration Options

Customize logging evaluation results during setup:

Dataset Configuration:

Create datasets with input-output pairs or open-ended inputs.
Example:

client.create_example(
        inputs={"question": "Describe Paris landmarks."},
        outputs={},
        dataset_id=dataset.id
    )

Evaluators:

Use built-in (QA, CRITERIA) or custom evaluators for specific metrics.
Example:

evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="coherence", llm=llm)

Experiment Metadata:

Include version, timestamps, or pipeline details for traceability.
Example:

evaluate(..., metadata={"version": "1.0", "pipeline": "RetrievalQA"})

Logging Level:

Adjust logging verbosity (e.g., INFO, DEBUG) for detailed output.
Example:

logging.basicConfig(level=logging.DEBUG)

Core Techniques for Logging Evaluation Results

1. Automated Logging with LangSmith

Use LangSmith to automatically log evaluation results for each experiment.

Experiment Logging:

Log results under a unique experiment prefix with metadata.
Example:

results = evaluate(
        lambda inputs: {"result": qa_pipeline.invoke({"query": inputs["question"]})["result"]},
        data=dataset_name,
        evaluators=[evaluate_correctness],
        experiment_prefix="qa_test",
        metadata={"version": "1.0"}
    )

Dashboard Visualization:

Access logged results in the LangSmith UI to view scores, comments, and trends.
Example: Navigate to qa_logging_test in LangSmith to analyze correctness and latency.

2. Custom Metric Logging

Log custom metrics tailored to specific use cases.

Custom Evaluator Logging:

Log scores and comments for custom metrics like efficiency.
Example:

class EfficiencyEvaluator(StringEvaluator):
        def _evaluate_strings(self, prediction: str, **kwargs) -> Dict[str, Any]:
            word_count = len(prediction.split())
            score = 1.0 if word_count < 50 else 0.8
            return {
                "key": "efficiency",
                "score": score,
                "reasoning": f"Response brevity: {word_count} words."
            }

    def evaluate_efficiency(run, example):
        prediction = run.outputs.get("result", "")
        evaluator = EfficiencyEvaluator()
        return evaluator.evaluate_strings(prediction=prediction)

3. Latency and Resource Usage Logging

Log system-level performance metrics like latency and memory usage.

Latency Logging:

Capture and log response time for each query.
Example (as shown in evaluate_latency above):

def evaluate_latency(run, example):
        latency = run.outputs.get("latency", 0.0)
        score = 1.0 if latency < 1.0 else max(0.0, 1.0 - (latency - 1.0) / 5.0)
        return {"key": "latency", "score": score, "comment": f"Response latency: {latency:.2f} seconds."}

Memory Usage Logging:

Log memory consumption during evaluation.
Example:

def evaluate_memory_usage(run, example):
        memory_mb = run.outputs.get("memory_mb", 0.0)
        score = 1.0 if memory_mb < 500 else max(0.0, 1.0 - (memory_mb - 500) / 1000)
        return {
            "key": "memory_usage",
            "score": score,
            "comment": f"Memory usage: {memory_mb:.2f} MB."
        }

4. Human-in-the-Loop Logging

Incorporate human feedback into logged results for subjective metrics.

LangSmith HITL Logging:

Log human annotations alongside automated metrics in LangSmith.
Example: In LangSmith UI, reviewers add scores (0-1) for “coherence” with comments, logged under the experiment.

5. Batch Logging for Scalability

Log results for large datasets efficiently using batch processing.

Batch Evaluation:

Process multiple examples in a single evaluation run.
Example:

results = evaluate(
        lambda inputs: {"result": qa_pipeline.invoke({"query": inputs["question"]})["result"]},
        data=dataset_name,
        evaluators=[evaluate_correctness, evaluate_relevance],
        batch_size=10
    )

Comprehensive Example

Here’s a complete system evaluating a LangChain agent pipeline with logged results for multiple metrics, integrated with Chroma, MongoDB Atlas, and LangSmith:

from langchain_chroma import Chroma
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType, StringEvaluator
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchRun
from langsmith import Client
from langsmith.evaluation import evaluate
from pymongo import MongoClient
import os
import logging
import time
import psutil
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "agent-logging-results"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"}),
    Document(page_content="Florida is a state in the USA.", metadata={"source": "geo"})
]

# Initialize Chroma and MongoDB Atlas vector stores
chroma_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)
client = MongoClient("mongodb+srv://:@.mongodb.net/")
collection = client["langchain_db"]["example_collection"]
mongo_store = MongoDBAtlasVectorSearch.from_documents(
    documents,
    embedding_function,
    collection=collection,
    index_name="vector_index"
)

# Set up search tool and agent pipeline
search = DuckDuckGoSearchRun()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Useful for answering questions about recent events or general knowledge."
    )
]
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Initialize LangSmith client
ls_client = Client()

# Create or load dataset
dataset_name = "agent_logging_dataset"
try:
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
except:
    dataset = ls_client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {
        "input": "What is the capital of France?",
        "output": "Paris",
        "expected_docs": ["The capital of France is Paris."]
    },
    {
        "input": "Where is the Eiffel Tower?",
        "output": "Paris",
        "expected_docs": ["The Eiffel Tower is in Paris."]
    },
    {
        "input": "Describe Paris landmarks.",
        "output": ""
    }
]
for example in examples:
    ls_client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"], "expected_docs": example.get("expected_docs", [])},
        dataset_id=dataset.id
    )

# Define custom evaluator
class EfficiencyEvaluator(StringEvaluator):
    def _evaluate_strings(self, prediction: str, **kwargs) -> Dict[str, Any]:
        word_count = len(prediction.split())
        score = 1.0 if word_count < 50 else 0.8
        return {
            "key": "efficiency",
            "score": score,
            "reasoning": f"Response brevity: {word_count} words."
        }

# Define evaluation functions
def evaluate_correctness(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    if not reference:
        return {"key": "correctness", "score": None, "comment": "No reference provided."}
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_relevance(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return {"key": "relevance", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_latency(run, example) -> Dict[str, Any]:
    latency = run.outputs.get("latency", 0.0)
    score = 1.0 if latency < 1.0 else max(0.0, 1.0 - (latency - 1.0) / 5.0)
    return {
        "key": "latency",
        "score": score,
        "comment": f"Response latency: {latency:.2f} seconds."
    }

def evaluate_efficiency(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    evaluator = EfficiencyEvaluator()
    return evaluator.evaluate_strings(prediction=prediction)

# Run evaluation and log results
start_time = time.time()
results = evaluate(
    lambda inputs: {
        "result": agent.run(inputs["question"]),
        "latency": time.time() - start_time,
        "memory_mb": psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
    },
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_relevance, evaluate_latency, evaluate_efficiency],
    experiment_prefix="agent_logging_test",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T15:35:00Z"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Test Results: {results}")
print("View detailed results in LangSmith dashboard under 'agent_logging_test' experiment.")

Output:

Evaluation completed in 9.67 seconds
Test Results: 
View detailed results in LangSmith dashboard under 'agent_logging_test' experiment.

The evaluation logs results for correctness, relevance, latency, and efficiency in LangSmith, accessible via the dashboard for analysis.

Best Practices

Log Comprehensive Metrics: Include accuracy, latency, and custom metrics for holistic tracking.
Use Descriptive Metadata: Add version, timestamp, and pipeline details for traceability.
Leverage LangSmith UI: Analyze logged results through visualizations and trends.
Combine with HITL: Log human feedback for subjective metrics (Human-in-the-Loop Evaluation).
Automate Logging: Use LangSmith’s evaluate function for consistent, automated logging.
Monitor Trends: Regularly review logged results to detect performance changes.

Error Handling

API Errors: Handle LangSmith API failures with retries or fallback logging.
Dataset Issues: Validate dataset format to avoid parsing errors.
Pipeline Failures: Log errors during evaluation to diagnose issues.
Resource Limits: Batch evaluations to manage API costs and rate limits.

See Troubleshooting.

Limitations

Cost: Logging large datasets with LLM-based metrics can be expensive.
Storage: Extensive logging requires sufficient storage in LangSmith.
Complexity: Managing multiple experiments and metrics can be complex.
LLM Bias: Judgment-based metrics may introduce variability.

Recent Developments

2025 Updates: LangSmith enhanced logging with real-time dashboards and bulk export features.
Community Feedback: X posts highlight logging workflows for tracking RAG system performance in production.
LangSmith UI: Improved experiment comparison and metric filtering.

Conclusion

Logging evaluation results in LangChain with LangSmith enables comprehensive performance tracking, facilitating debugging, optimization, and collaboration. By logging automated and custom metrics, developers can maintain robust insights into their AI systems, ensuring reliability and scalability. Start logging evaluation results to enhance your LangChain projects, leveraging LangSmith for data-driven improvements.

For official documentation, visit LangSmith Documentation.