Generating SQL Queries from Natural Language with LangChain and OpenAI: A Comprehensive Guide

Converting natural language queries into SQL statements is a powerful capability for making database interactions accessible to non-technical users. By leveraging LangChain and OpenAI, you can build a system that translates plain English into executable SQL queries.

Introduction to Natural Language to SQL and LangChain

Natural language to SQL systems enable users to query databases using everyday language, such as “Show me all employees hired last year.” These systems use LLMs to interpret intent and generate accurate SQL queries. LangChain simplifies this with tools for chains, prompt engineering, and SQL database integration. OpenAI’s API, powering models like gpt-3.5-turbo, drives the language understanding, while LangChain manages schema context and query generation.

This tutorial assumes basic Python and SQL knowledge, with references to LangChain’s getting started guide, OpenAI’s API documentation, and SQLite documentation.

Prerequisites for Building the SQL Generation System

Ensure you have:

pip install langchain openai langchain-openai

Step 1: Setting Up the Development Environment

Configure your environment by importing libraries and setting the OpenAI API key. Create a sample SQLite database for testing.

import os
import sqlite3
from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_community.utilities import SQLDatabase
from langchain.prompts import PromptTemplate

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Create a sample SQLite database
conn = sqlite3.connect("company.db")
cursor = conn.cursor()

# Create sample tables
cursor.executescript("""
CREATE TABLE IF NOT EXISTS employees (
    id INTEGER PRIMARY KEY,
    name TEXT,
    department_id INTEGER,
    hire_date TEXT,
    salary REAL
);
CREATE TABLE IF NOT EXISTS departments (
    id INTEGER PRIMARY KEY,
    name TEXT
);
INSERT INTO departments (id, name) VALUES
(1, 'Engineering'), (2, 'HR'), (3, 'Sales');
INSERT INTO employees (id, name, department_id, hire_date, salary) VALUES
(1, 'Alice Smith', 1, '2023-05-10', 75000),
(2, 'Bob Jones', 2, '2022-11-15', 65000),
(3, 'Carol White', 1, '2024-01-20', 80000),
(4, 'David Brown', 3, '2023-08-30', 70000);
""")
conn.commit()

# Connect LangChain to the database
db = SQLDatabase.from_uri("sqlite:///company.db")

Replace "your-openai-api-key" with your actual key. Environment variables enhance security, as explained in LangChain’s security and API keys guide. The sample database includes employees and departments tables, detailed in LangChain’s SQL database chains.

Step 2: Initializing the Language Model

Initialize the OpenAI LLM using ChatOpenAI for natural language processing.

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.2,
    max_tokens=512,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.1,
    n=1
)

Key Parameters for ChatOpenAI

  • model_name: OpenAI model (e.g., gpt-3.5-turbo, gpt-4). gpt-3.5-turbo is efficient; gpt-4 excels in complex reasoning. See OpenAI’s model documentation.
  • temperature (0.0–2.0): Controls randomness. At 0.2, prioritizes precise, deterministic SQL generation.
  • max_tokens: Maximum response length (e.g., 512). Sufficient for SQL queries; higher values increase costs. See LangChain’s token limit handling.
  • top_p (0.0–1.0): Nucleus sampling. At 0.9, focuses on high-probability tokens for coherence.
  • frequency_penalty (–2.0–2.0): Discourages repetition. At 0.1, mildly promotes variety.
  • presence_penalty (–2.0–2.0): Encourages new topics. At 0.1, slight novelty boost.
  • n: Number of responses (e.g., 1). Single response suits SQL generation.

For alternatives, see LangChain’s integrations.

Step 3: Creating the SQL Query Chain

Use LangChain’s create_sql_query_chain to generate SQL queries from natural language, incorporating database schema context.

sql_chain = create_sql_query_chain(
    llm=llm,
    db=db,
    prompt=None,
    k=5
)

Key Parameters for create_sql_query_chain

  • llm: The initialized LLM.
  • db: The connected database (e.g., SQLDatabase instance).
  • prompt: Optional custom prompt. If None, uses LangChain’s default SQL prompt.
  • k: Maximum number of table rows to include in context (e.g., 5). Balances schema detail and token usage.

The chain automatically incorporates the database schema, enabling accurate query generation. For details, see LangChain’s SQL database chains.

Step 4: Generating SQL Queries

Test the chain by generating SQL queries from natural language inputs.

# Example queries
query = "Show me all employees hired in 2023."
sql_query = sql_chain.invoke({"question": query})
print("SQL Query:", sql_query)

# Execute the query to verify
cursor.execute(sql_query)
results = cursor.fetchall()
print("Results:", results)

Example Output:

SQL Query: SELECT * FROM employees WHERE hire_date LIKE '2023%';
Results: [(1, 'Alice Smith', 1, '2023-05-10', 75000.0), (4, 'David Brown', 3, '2023-08-30', 70000.0)]

Try another query:

query = "List employees in the Engineering department with salary above 70000."
sql_query = sql_chain.invoke({"question": query})
print("SQL Query:", sql_query)
cursor.execute(sql_query)
print("Results:", cursor.fetchall())

Example Output:

SQL Query: SELECT e.* FROM employees e JOIN departments d ON e.department_id = d.id WHERE d.name = 'Engineering' AND e.salary > 70000;
Results: [(1, 'Alice Smith', 1, '2023-05-10', 75000.0), (3, 'Carol White', 1, '2024-01-20', 80000.0)]

The chain generates accurate SQL based on schema and query intent. For more examples, see LangChain’s SQL database chains.

Step 5: Customizing the SQL Generation System

Enhance the system with custom prompts, conversational memory, or tool integration.

5.1 Custom Prompt Engineering

Modify the prompt for precise SQL generation.

custom_prompt = PromptTemplate(
    input_variables=["input", "table_info", "top_k"],
    template="You are a SQL expert. Generate a valid SQL query for the given database schema and user question. Ensure the query is safe and executable.\n\nSchema: {table_info}\n\nQuestion: {input}\n\nTop K: {top_k}\n\nSQL Query: ",
    validate_template=True
)

sql_chain = create_sql_query_chain(
    llm=llm,
    db=db,
    prompt=custom_prompt,
    k=5
)

PromptTemplate Parameters:

  • input_variables: Variables (e.g., ["input", "table_info", "top_k"]).
  • template: Defines structure and instructions.
  • validate_template: If True, validates variables.

See LangChain’s prompt templates guide.

5.2 Adding Conversational Memory

Incorporate ConversationBufferMemory for context-aware query refinement.

from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="history",
    return_messages=True,
    k=3
)

conversation_prompt = PromptTemplate(
    input_variables=["history", "input"],
    template="You are a SQL assistant. Use the conversation history and database schema to refine and generate SQL queries.\n\nHistory: {history}\n\nQuestion: {input}\n\nSQL Query: ",
    validate_template=True
)

conversation_chain = ConversationChain(
    llm=llm,
    memory=memory,
    prompt=conversation_prompt,
    verbose=True,
    output_key="response"
)

# Example with context
response = conversation_chain.predict(input="Show employees hired in 2023.")
print(response)
response = conversation_chain.predict(input="Now filter for those in Engineering.")
print(response)

ConversationBufferMemory Parameters:

  • memory_key: History variable (default: "history").
  • return_messages: If True, returns message objects.
  • k: Limits stored interactions (e.g., 3).

Example Output:

SELECT * FROM employees WHERE hire_date LIKE '2023%';
SELECT e.* FROM employees e JOIN departments d ON e.department_id = d.id WHERE e.hire_date LIKE '2023%' AND d.name = 'Engineering';

See LangChain’s memory integration guide.

5.3 Tool Integration with Agents

Add tools like SerpAPI for supplementary data.

from langchain.agents import initialize_agent, Tool
from langchain.tools import SerpAPIWrapper

search = SerpAPIWrapper()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Fetch external data to inform SQL queries."
    ),
    Tool(
        name="SQLQuery",
        func=lambda q: sql_chain.invoke({"question": q}),
        description="Generate SQL queries from natural language."
    )
]

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True,
    max_iterations=3,
    early_stopping_method="force"
)

response = agent.run("Generate a SQL query for employees hired in 2023, considering recent hiring trends.")
print(response)

initialize_agent Parameters:

  • tools: List of tools.
  • llm: The LLM.
  • agent: Agent type.
  • verbose: If True, logs decisions.
  • max_iterations: Limits steps.
  • early_stopping_method: Stops execution.

See LangChain’s agents guide.

Step 6: Deploying the SQL Generation System

Deploy as a Streamlit app for an interactive interface.

import streamlit as st

st.title("Natural Language to SQL Query Generator")
st.write("Ask questions in plain English to generate SQL queries!")

# Connect to database
db = SQLDatabase.from_uri("sqlite:///company.db")
sql_chain = create_sql_query_chain(llm=llm, db=db)

if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if query := st.chat_input("Enter your question:"):
    st.session_state.messages.append({"role": "user", "content": query})
    with st.chat_message("user"):
        st.markdown(query)
    with st.chat_message("assistant"):
        with st.spinner("Generating SQL..."):
            sql_query = sql_chain.invoke({"question": query})
            st.markdown(f"**SQL Query:** `{sql_query}`")
            try:
                cursor.execute(sql_query)
                results = cursor.fetchall()
                st.markdown("**Results:**")
                st.write(results)
            except Exception as e:
                st.markdown(f"**Error:** {str(e)}")
            st.session_state.messages.append({"role": "assistant", "content": sql_query})

Save as app.py, install Streamlit (pip install streamlit), and run:

streamlit run app.py

Visit http://localhost:8501. Deploy to Streamlit Community Cloud by pushing to GitHub and configuring secrets. See LangChain’s Streamlit tutorial or Streamlit’s deployment guide.

Step 7: Evaluating and Testing the System

Evaluate generated queries using LangChain’s evaluation metrics.

from langchain.evaluation import load_evaluator

evaluator = load_evaluator(
    "string_distance",
    distance_metric="levenshtein"
)
result = evaluator.evaluate_strings(
    prediction="SELECT * FROM employees WHERE hire_date LIKE '2023%';",
    reference="SELECT * FROM employees WHERE strftime('%Y', hire_date) = '2023';"
)
print(result)

load_evaluator Parameters:

  • evaluator_type: Metric type (e.g., "string_distance").
  • distance_metric: Distance measure (e.g., "levenshtein").

Test with diverse queries (e.g., “Average salary by department”). Debug with LangSmith per LangChain’s LangSmith intro.

Advanced Features and Next Steps

Enhance with:

See LangChain’s startup examples or GitHub repos.

Conclusion

Generating SQL from natural language with LangChain and OpenAI democratizes database access. This guide covered setup, query generation, customization, deployment, evaluation, and parameters. Leverage LangChain’s chains, prompts, and integrations to build intuitive SQL systems.

Explore agents, tools, or evaluation metrics. Debug with LangSmith. Happy coding!