LangChain Integration
📚 Navigation: ← Back to Agent Context Kit | Snowflake Integration →
What Problem Does This Solve?
LangChain is a popular framework for building AI agent applications, but agents often struggle with data questions because they:
- Don't have access to your organization's data catalog
- Hallucinate table names, schemas, and relationships
- Can't discover data ownership or documentation
- Have no context about data quality or lineage
The LangChain integration provides pre-built LangChain tools that give your agents direct access to DataHub metadata, enabling them to answer data questions accurately using real metadata from your organization.
What You Can Build
- Data Discovery Chatbot: "Show me all datasets about customers in the marketing domain"
- Schema Explorer: "What columns exist in the user_events table and what do they mean?"
- Lineage Tracer: "What dashboards use data from this raw table?"
- Documentation Assistant: "Find the business definition of 'monthly recurring revenue'"
- Compliance Helper: "List all tables with PII and their owners"
Overview
The LangChain optional add-on lets you connect DataHub's context tools directly to your LangChain Agent.
Installation
pip install datahub-agent-context[langchain]
Prerequisites
- Python 3.10 or higher
- LangChain (
pip install langchain langchain-openai) - DataHub instance with access token
- OpenAI API key (or other LLM provider)
Quick Start
Basic Setup
Build AI agents with pre-built LangChain tools:
from datahub.sdk.main_client import DataHubClient
from datahub_agent_context.langchain_tools import build_langchain_tools
# Initialize DataHub client from environment (recommended)
client = DataHubClient.from_env()
# Or specify server and token explicitly:
# client = DataHubClient(server="http://localhost:8080", token="YOUR_TOKEN")
# Build all tools (read-only by default)
tools = build_langchain_tools(client, include_mutations=False)
# Or include mutation tools for tagging, descriptions, etc.
tools = build_langchain_tools(client, include_mutations=True)
Note: include_mutations=False provides read-only tools (search, get entities, lineage). Set to True to enable tools that modify metadata (add tags, update descriptions, etc.).
Complete Working Example
Here's a full example of a DataHub-powered LangChain agent:
from datahub.sdk.main_client import DataHubClient
from datahub_agent_context.langchain_tools import build_langchain_tools
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate
# Initialize DataHub client
client = DataHubClient.from_env()
# Or: client = DataHubClient(server="http://localhost:8080", token="YOUR_TOKEN")
# Build DataHub tools (read-only)
tools = build_langchain_tools(client, include_mutations=False)
# Initialize LLM
llm = ChatOpenAI(
model="gpt-4",
temperature=0,
openai_api_key="YOUR_OPENAI_KEY"
)
# Create agent prompt
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful data catalog assistant with access to DataHub metadata.
Use the available tools to:
- Search for datasets, dashboards, and other data assets
- Get detailed entity information including schemas and descriptions
- Trace data lineage to understand data flow
- Find documentation and business glossary terms
Always provide URNs when referencing entities so users can find them in DataHub.
Be concise but thorough in your explanations."""),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
# Create agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
handle_parsing_errors=True
)
# Use the agent
def ask_datahub(question: str):
"""Ask a question about your data catalog."""
result = agent_executor.invoke({"input": question})
return result["output"]
# Example queries
if __name__ == "__main__":
# Find datasets
print(ask_datahub("Find all datasets about customers"))
# Get schema information
print(ask_datahub("What columns are in the user_events dataset?"))
# Trace lineage
print(ask_datahub("Show me the upstream sources for the revenue dashboard"))
# Find documentation
print(ask_datahub("What's the business definition of 'churn rate'?"))
Example Output
User: Find all datasets about customers
Agent: I found 3 datasets related to customers:
1. **customer_profiles** (urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_profiles,PROD))
- Description: Core customer profile data including demographics and preferences
- Platform: Snowflake
- Domain: Marketing
2. **customer_transactions** (urn:li:dataset:(urn:li:dataPlatform:postgres,transactions.customer_orders,PROD))
- Description: Historical customer purchase transactions
- Platform: PostgreSQL
- Domain: Finance
3. **customer_360** (urn:li:dataset:(urn:li:dataPlatform:bigquery,analytics.customer_360,PROD))
- Description: Unified customer view combining profile, transactions, and interactions
- Platform: BigQuery
- Domain: Analytics
You can view these in DataHub at: https://your-datahub.acryl.io
Advanced Usage
Using with Memory
Add conversation memory to your agent:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
memory=memory,
verbose=True
)
Filtering Tools
Include only specific tool categories:
from datahub_agent_context.langchain_tools import (
build_search_tools,
build_entity_tools,
build_lineage_tools
)
# Only search and entity tools
search_tools = build_search_tools(client)
entity_tools = build_entity_tools(client)
tools = search_tools + entity_tools
Custom Tool Configuration
# Configure specific tools
tools = build_langchain_tools(
client,
include_mutations=False,
max_search_results=20, # Increase search result limit
max_lineage_depth=5 # Deeper lineage traversal
)
More Examples
Complete examples are available in the datahub-project repo:
Troubleshooting
Tool Execution Errors
Problem: Agent tries to use tools but gets errors
Solutions:
- Verify DataHub connection: Test
client.configis correctly set - Check tool availability: Print
[tool.name for tool in tools]to see available tools - Enable verbose mode: Set
verbose=Truein AgentExecutor to see tool calls - Validate token permissions: Ensure token has read access (and write access if using mutations)
Agent Not Using Tools
Problem: Agent responds without calling DataHub tools
Solutions:
- Improve system prompt: Explicitly instruct agent to use tools
- Use better examples: Add few-shot examples of tool usage to prompt
- Check model compatibility: Ensure LLM supports tool/function calling (GPT-4, GPT-3.5-turbo, Claude 3+)
- Reduce temperature: Set
temperature=0for more deterministic tool usage
Performance Issues
Problem: Agent responses are slow
Solutions:
- Limit search results: Reduce
max_search_resultsin tool configuration - Use streaming: Enable streaming responses with
stream=True - Cache results: Implement caching for frequently accessed metadata
- Use faster models: Consider gpt-3.5-turbo for simpler queries
Import Errors
Problem: ModuleNotFoundError for LangChain components
Solutions:
# Install all required dependencies
pip install datahub-agent-context[langchain]
pip install langchain langchain-openai
# Or install from scratch
pip install datahub-agent-context langchain langchain-openai langchain-community
Getting Help
- LangChain Docs: LangChain Agent Documentation
- DataHub Agent Context: Agent Context Kit Guide
- GitHub Issues: Report issues
- Community Slack: Join DataHub Slack