Building a Sitemap Document Loader with LangChain: A Comprehensive Guide

A sitemap document loader in LangChain enables efficient ingestion of web content by parsing sitemap XML files, extracting URLs, and loading their associated documents. This is particularly useful for web scraping, content analysis, or building knowledge bases from websites. By integrating LangChain with OpenAI, you can create a system that loads sitemap content and processes it for conversational or analytical tasks.

Introduction to LangChain and Sitemap Document Loaders

A sitemap is an XML file that lists a website’s URLs, aiding search engines in indexing content. LangChain’s sitemap document loader fetches these URLs and loads their content as documents, enabling applications like web content summarization or question-answering systems. LangChain facilitates this with document loaders, chains, and integrations. OpenAI’s API, powering models like gpt-3.5-turbo, enhances processing, while libraries like requests and beautifulsoup4 handle web scraping.

This tutorial assumes basic knowledge of Python, web scraping, and APIs. References include LangChain’s getting started guide, OpenAI’s API documentation, Beautiful Soup documentation, and Python’s documentation.

Prerequisites for Building the Sitemap Document Loader

Ensure you have:

pip install langchain openai langchain-openai requests beautifulsoup4 flask python-dotenv

Step 1: Setting Up the Development Environment

Configure your environment by importing libraries and setting API keys. Use a .env file for secure key management.

import os
import requests
from flask import Flask, request, jsonify
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load environment variables
load_dotenv()

# Set OpenAI API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found.")

# Initialize Flask app
app = Flask(__name__)

Create a .env file in your project directory:

OPENAI_API_KEY=your-openai-api-key

Replace your-openai-api-key with your actual key. Environment variables enhance security, as explained in LangChain’s security and API keys guide.

Step 2: Implementing the Sitemap Document Loader

Create a custom sitemap document loader to fetch URLs from a sitemap and load their content as LangChain documents.

def load_sitemap_documents(sitemap_url, max_pages=10):
    """Load documents from a sitemap URL."""
    try:
        # Fetch sitemap XML
        response = requests.get(sitemap_url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "xml")

        # Extract URLs
        urls = [loc.text for loc in soup.find_all("loc")][:max_pages]
        documents = []

        # Load content from each URL
        for url in urls:
            try:
                page_response = requests.get(url, timeout=10)
                page_response.raise_for_status()
                page_soup = BeautifulSoup(page_response.content, "html.parser")

                # Extract text content (customize as needed)
                content = " ".join([p.text.strip() for p in page_soup.find_all("p")])
                if content:
                    documents.append(Document(
                        page_content=content,
                        metadata={"source": url}
                    ))
            except requests.RequestException as e:
                print(f"Error loading {url}: {str(e)}")
                continue

        return documents
    except requests.RequestException as e:
        print(f"Error fetching sitemap: {str(e)}")
        return []

# Index documents in FAISS
def index_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002",
        chunk_size=1000,
        max_retries=3
    )
    vectorstore = FAISS.from_documents(
        documents=chunks,
        embedding=embeddings,
        distance_strategy="COSINE",
        normalize_L2=True
    )
    return vectorstore

Key Parameters for load_sitemap_documents

  • sitemap_url: URL of the sitemap XML file (e.g., "https://example.com/sitemap.xml").
  • max_pages: Limits the number of pages to load (e.g., 10) to manage processing time.

Key Parameters for RecursiveCharacterTextSplitter

  • chunk_size: Maximum characters per chunk (e.g., 1000). Balances context and retrieval.
  • chunk_overlap: Overlapping characters (e.g., 200). Preserves context.
  • length_function: Measures text length (default: len).

Key Parameters for OpenAIEmbeddings

  • model: Embedding model (e.g., text-embedding-ada-002). Determines vector quality.
  • chunk_size: Texts processed per API call (e.g., 1000). Balances speed and limits.
  • max_retries: Retry attempts for API failures (e.g., 3). Enhances reliability.

Key Parameters for FAISS.from_documents

  • documents: List of Document objects with web content.
  • embedding: Embedding model instance.
  • distance_strategy: Similarity metric (e.g., "COSINE"). Suits semantic search.
  • normalize_L2: If True, normalizes vectors for consistent scores.

This loader fetches URLs from the sitemap, extracts text content using Beautiful Soup, and indexes it in FAISS for retrieval. For advanced loaders, see LangChain’s document loaders.

Step 3: Initializing the Language Model

Initialize the OpenAI LLM using ChatOpenAI for processing and responding to queries.

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
    frequency_penalty=0.2,
    presence_penalty=0.1,
    n=1
)

Key Parameters for ChatOpenAI

  • model_name: OpenAI model (e.g., gpt-3.5-turbo, gpt-4). gpt-3.5-turbo is efficient; gpt-4 excels in reasoning. See OpenAI’s model documentation.
  • temperature (0.0–2.0): Controls randomness. At 0.7, balances creativity and coherence for conversational responses.
  • max_tokens: Maximum response length (e.g., 512). Adjust for detail vs. cost. See LangChain’s token limit handling.
  • top_p (0.0–1.0): Nucleus sampling. At 0.9, focuses on likely tokens.
  • frequency_penalty (–2.0–2.0): Discourages repetition. At 0.2, promotes variety.
  • presence_penalty (–2.0–2.0): Encourages new topics. At 0.1, mild novelty boost.
  • n: Number of responses (e.g., 1). Single response suits API interactions.

Step 4: Implementing Conversational Memory

Use ConversationBufferMemory to maintain user-specific conversation context.

user_memories = {}

def get_user_memory(user_id):
    if user_id not in user_memories:
        user_memories[user_id] = ConversationBufferMemory(
            memory_key="history",
            return_messages=True,
            k=5
        )
    return user_memories[user_id]

Key Parameters for ConversationBufferMemory

  • memory_key: History variable name (default: "history").
  • return_messages: If True, returns message objects. Suits chat models.
  • k: Limits stored interactions (e.g., 5). Balances context and performance.

For advanced memory, see LangChain’s memory integration guide.

Step 5: Building the RetrievalQA Chain

Create a RetrievalQA chain to retrieve relevant web content and generate responses.

retrieval_prompt = PromptTemplate(
    input_variables=["context", "query"],
    template="You are an assistant providing information based on web content from a sitemap. Use the provided context to answer the query accurately and concisely:\n\nContext: {context}\n\nQuery: {query}\n\nAnswer: ",
    validate_template=True
)

def get_qa_chain(vectorstore):
    return RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 3, "fetch_k": 5}
        ),
        return_source_documents=True,
        verbose=True,
        prompt=retrieval_prompt,
        input_key="query",
        output_key="result"
    )

Key Parameters for RetrievalQA.from_chain_type

  • llm: The initialized LLM.
  • chain_type: Document processing method (e.g., "stuff"). Combines documents into one prompt.
  • retriever: Retrieval mechanism.
  • return_source_documents: If True, includes retrieved documents.
  • verbose: If True, logs execution.
  • prompt: Custom prompt template.
  • input_key: Input variable (e.g., "query").
  • output_key: Output variable (e.g., "result").

Key Parameters for as_retriever

  • search_type: Retrieval method (e.g., "similarity").
  • search_kwargs: Settings, e.g., k (top results, 3), fetch_k (initial candidates, 5).

See LangChain’s RetrievalQA chain guide.

Step 6: Building the Conversation Chain

Create a ConversationChain for general conversational queries and context maintenance.

conversation_prompt = PromptTemplate(
    input_variables=["history", "input"],
    template="You are a conversational assistant. Respond in a friendly, engaging tone, using the conversation history for context:\n\nHistory: {history}\n\nUser: {input}\n\nAssistant: ",
    validate_template=True
)

def get_conversation_chain(user_id):
    memory = get_user_memory(user_id)
    return ConversationChain(
        llm=llm,
        memory=memory,
        prompt=conversation_prompt,
        verbose=True,
        output_key="response"
    )

See LangChain’s introduction to chains.

Step 7: Implementing the Flask API for the Sitemap Loader

Expose the sitemap loader and query processing via a Flask API.

@app.route("/load_sitemap", methods=["POST"])
def load_sitemap():
    try:
        data = request.get_json()
        sitemap_url = data.get("sitemap_url")
        max_pages = data.get("max_pages", 10)

        if not sitemap_url:
            return jsonify({"error": "sitemap_url is required"}), 400

        documents = load_sitemap_documents(sitemap_url, max_pages)
        if not documents:
            return jsonify({"error": "No documents loaded from sitemap"}), 400

        vectorstore = index_documents(documents)
        global qa_chain
        qa_chain = get_qa_chain(vectorstore)

        return jsonify({"message": f"Loaded {len(documents)} documents from sitemap"})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route("/query", methods=["POST"])
def query():
    try:
        data = request.get_json()
        user_id = data.get("user_id")
        query = data.get("query")

        if not user_id or not query:
            return jsonify({"error": "user_id and query are required"}), 400

        if 'qa_chain' not in globals():
            return jsonify({"error": "Sitemap not loaded. Please load a sitemap first."}), 400

        # Check if query is content-specific
        content_keywords = ["content", "information", "details", "page", "website"]
        is_content_query = any(keyword in query.lower() for keyword in content_keywords)

        if is_content_query:
            response = qa_chain({"query": query})
            answer = response["result"]
            sources = [doc.metadata["source"] for doc in response["source_documents"]]
            if sources:
                answer += f"\n\nSources: {', '.join(sources)}"
            memory = get_user_memory(user_id)
            memory.save_context({"input": query}, {"response": answer})
        else:
            conversation = get_conversation_chain(user_id)
            answer = conversation.predict(input=query)

        return jsonify({
            "response": answer,
            "user_id": user_id
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Key Endpoints

  • /load_sitemap: Loads and indexes documents from a sitemap URL.
  • /query: Processes user queries, using RetrievalQA for content-specific queries and ConversationChain for general ones.

Step 8: Testing the Sitemap Document Loader

Test the API by loading a sitemap and querying its content.

import requests

def test_sitemap_loader(sitemap_url, max_pages=5):
    response = requests.post(
        "http://localhost:5000/load_sitemap",
        json={"sitemap_url": sitemap_url, "max_pages": max_pages},
        headers={"Content-Type": "application/json"}
    )
    print("Load Response:", response.json())

def test_query(user_id, query):
    response = requests.post(
        "http://localhost:5000/query",
        json={"user_id": user_id, "query": query},
        headers={"Content-Type": "application/json"}
    )
    print("Query Response:", response.json())

# Example sitemap (replace with a real one)
sitemap_url = "https://example.com/sitemap.xml"
test_sitemap_loader(sitemap_url, max_pages=5)
test_query("user123", "What content is available on the website?")
test_query("user123", "Summarize the main page content.")
test_query("user123", "Tell me about web development.")

Example Output (assuming a sample sitemap):

Load Response: {'message': 'Loaded 5 documents from sitemap'}
Query Response: {'response': 'The website contains pages about our services, blog posts on technology trends, and contact information.\n\nSources: https://example.com/about, https://example.com/services', 'user_id': 'user123'}
Query Response: {'response': 'The main page highlights the company’s mission to provide innovative solutions, with links to services and recent blog posts.\n\nSources: https://example.com/', 'user_id': 'user123'}
Query Response: {'response': 'Web development involves creating and maintaining websites, using technologies like HTML, CSS, and JavaScript. Would you like specific details about a framework or tool?', 'user_id': 'user123'}

The loader processes sitemap URLs, and the API handles content-specific and general queries. For patterns, see LangChain’s conversational flows.

Step 9: Customizing the Sitemap Document Loader

Enhance with custom prompts, additional tools, or advanced processing.

9.1 Custom Prompt Engineering

Modify the retrieval prompt for a specific tone.

retrieval_prompt = PromptTemplate(
    input_variables=["context", "query"],
    template="You are a web content expert. Provide a concise, professional summary of the requested information based on the sitemap content:\n\nContext: {context}\n\nQuery: {query}\n\nAnswer: ",
    validate_template=True
)

See LangChain’s prompt templates guide.

9.2 Adding a Web Search Tool

Integrate SerpAPI for supplementary data.

from langchain.agents import initialize_agent, Tool
from langchain_community.utilities import SerpAPIWrapper

search = SerpAPIWrapper()
tools = [
    Tool(
        name="WebSearch",
        func=search.run,
        description="Search the web for additional information or trends."
    )
]

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    max_iterations=3,
    early_stopping_method="force"
)

@app.route("/agent_query", methods=["POST"])
def agent_query():
    try:
        data = request.get_json()
        user_id = data.get("user_id")
        query = data.get("query")

        if not user_id or not query:
            return jsonify({"error": "user_id and query are required"}), 400

        memory = get_user_memory(user_id)
        history = memory.load_memory_variables({})["history"]
        response = agent.run(f"{query}\nHistory: {history}")

        memory.save_context({"input": query}, {"response": response})

        return jsonify({
            "response": response,
            "user_id": user_id
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

Test with:

curl -X POST -H "Content-Type: application/json" -d '{"user_id": "user123", "query": "Latest web development trends in 2025"}' http://localhost:5000/agent_query

See LangChain’s agents guide.

9.3 Enhancing Content Extraction

Improve content extraction by targeting specific HTML elements.

def load_sitemap_documents(sitemap_url, max_pages=10, css_selector="article"):
    try:
        response = requests.get(sitemap_url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "xml")
        urls = [loc.text for loc in soup.find_all("loc")][:max_pages]
        documents = []

        for url in urls:
            try:
                page_response = requests.get(url, timeout=10)
                page_response.raise_for_status()
                page_soup = BeautifulSoup(page_response.content, "html.parser")

                # Extract content from specific CSS selector
                elements = page_soup.select(css_selector)
                content = " ".join([elem.text.strip() for elem in elements])
                if content:
                    documents.append(Document(
                        page_content=content,
                        metadata={"source": url}
                    ))
            except requests.RequestException as e:
                print(f"Error loading {url}: {str(e)}")
                continue

        return documents
    except requests.RequestException as e:
        print(f"Error fetching sitemap: {str(e)}")
        return []

Update the /load_sitemap endpoint to use the enhanced loader with a customizable css_selector.

Step 10: Deploying the Sitemap Document Loader

Deploy the Flask API to a cloud platform like Heroku for production use.

Heroku Deployment Steps:

  1. Create a Procfile:
web: gunicorn app:app
  1. Create requirements.txt:
pip freeze > requirements.txt
  1. Install gunicorn:
pip install gunicorn
  1. Deploy:
heroku create
heroku config:set OPENAI_API_KEY=your-openai-api-key
git push heroku main

Test the deployed API:

curl -X POST -H "Content-Type: application/json" -d '{"sitemap_url": "https://example.com/sitemap.xml", "max_pages": 5}' https://your-app.herokuapp.com/load_sitemap
curl -X POST -H "Content-Type: application/json" -d '{"user_id": "user123", "query": "What content is available?"}' https://your-app.herokuapp.com/query

For deployment details, see Heroku’s Python guide or Flask’s deployment guide.

Step 11: Evaluating and Testing the Sitemap Loader

Evaluate responses using LangChain’s evaluation metrics.

from langchain.evaluation import load_evaluator

evaluator = load_evaluator(
    "qa",
    criteria=["correctness", "relevance"]
)
result = evaluator.evaluate_strings(
    prediction="The website offers services, blog posts, and contact information.",
    input="What content is available on the website?",
    reference="The website includes pages on services, a blog with technology posts, and a contact section."
)
print(result)

load_evaluator Parameters:

  • evaluator_type: Metric type (e.g., "qa").
  • criteria: Evaluation criteria.

Test with queries like:

  • “What content is on the main page?”
  • “Summarize the blog section.”
  • “Tell me about web development trends.”

Debug with LangSmith per LangChain’s LangSmith intro.

Advanced Features and Next Steps

Enhance with:

See LangChain’s startup examples or GitHub repos.

Conclusion

Building a sitemap document loader with LangChain, as of May 15, 2025, enables efficient web content ingestion for conversational AI. This guide covered setup, loader implementation, query processing, deployment, evaluation, and parameters. Leverage LangChain’s document loaders, chains, and integrations to create robust content processing systems.

Explore agents, tools, or evaluation metrics. Debug with LangSmith. Happy coding!