Agent Context Kit
📚 Navigation: LangChain Integration → | Snowflake Integration →
What Problem Does This Solve?​
When building AI agents that answer questions about data, agents often face these challenges:
- Hallucinate metadata: Generate table or column names that don't exist
- Lack context: Can't discover related datasets, lineage, or business definitions
- Missing ownership info: Don't know who owns what data or how to contact them
- No quality signals: Can't distinguish certified datasets from deprecated ones
Agent Context Kit solves this by giving AI agents real-time access to your DataHub metadata catalog, enabling them to provide accurate, contextual answers about your data ecosystem.
Example Use Cases​
- Data Discovery: "Show me all datasets owned by the analytics team"
- Schema Exploration: "What tables have a customer_id column?"
- Lineage Tracing: "Trace lineage from raw data to this dashboard"
- Documentation Search: "Find the business definition of 'churn rate'"
- Compliance Queries: "List all PII fields and their owners"
Overview​
DataHub Agent Context provides a collection of tools and utilities for building AI agents that interact with DataHub metadata. This package contains MCP (Model Context Protocol) tools that enable AI agents to search, retrieve, and manipulate metadata in DataHub. These can be used directly to create an agent, or be included in an MCP server such as DataHub's open source MCP server.
Quick Start Guide​
- New to Agent Context? Start here with the basic example below
- Using LangChain? See the LangChain integration guide
- Using Snowflake Intelligence? See the Snowflake integration guide
Installation​
pip install datahub-agent-context
Prerequisites​
- DataHub instance (Cloud or self-hosted)
- Python 3.10 or higher
- DataHub personal access token (for authentication)
Basic Usage​
Simple Example​
These tools are designed to be used with an AI agent and have the responses passed directly to an LLM, so the return schema is a simple dict, but they can be used independently if desired.
from datahub.sdk.main_client import DataHubClient
from datahub_agent_context.context import DataHubContext
from datahub_agent_context.mcp_tools.search import search
from datahub_agent_context.mcp_tools.entities import get_entities
# Initialize DataHub client from environment (or specify server/token)
client = DataHubClient.from_env()
# Or: client = DataHubClient(server="http://localhost:8080", token="YOUR_TOKEN")
# Use DataHubContext to set up the client for tool calls
with DataHubContext(client):
# Search for datasets
results = search(
query="user_data",
filters={"entity_type": ["dataset"]},
num_results=10
)
print(f"Found {len(results['searchResults'])} datasets")
for result in results["searchResults"]:
print(f"- {result['entity']['name']} ({result['entity']['urn']})")
# Get detailed entity information
entity_urns = [result["entity"]["urn"] for result in results["searchResults"]]
entities = get_entities(urns=entity_urns)
print(f"\nDetailed info for {len(entities['entities'])} entities:")
for entity in entities["entities"]:
print(f"- {entity['urn']}: {entity.get('properties', {}).get('description', 'No description')}")
Key Concepts (Glossary)​
Before using Agent Context Kit, familiarize yourself with these DataHub concepts:
- Entity: A metadata object in DataHub (e.g., Dataset, Dashboard, Chart, User). Think of these as the "nouns" of your data ecosystem.
- URN (Uniform Resource Name): A unique identifier for an entity. Format:
urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.users,PROD). This is like a primary key for metadata. - MCP (Model Context Protocol): A standard protocol for connecting AI agents to external data sources. These tools implement MCP for DataHub.
- Client: The underlying client used internally to query DataHub. For LangChain users, this is handled automatically by the builder.
Agent Platforms​
| Platform | Status | Guide |
|---|---|---|
| Custom | Launched | See below |
| Langchain | Launched | LangChain Guide |
| Snowflake | Launched | Snowflake Guide |
| Google ADK | Coming Soon | - |
| Crew AI | Coming Soon | - |
| OpenAI | Coming Soon | - |
Available Tools​
Search Tools​
search(client, query, filters, num_results)- Use when: Finding entities by keyword across DataHub
- Returns: List of matching entities with URNs, names, and descriptions
- Example:
search(client, "customer", {"entity_type": ["dataset"]}, 10)to find datasets about customers - Filters: Can filter by entity_type, platform, domain, tags, and more
search_documents(client, query, semantic_query, num_results)- Use when: Searching for documentation, business glossaries, or knowledge base articles
- Returns: Document entities with titles and content
- Example:
search_documents(client, "*", "data retention policy", 5)to find policy documents
grep_documents(client, pattern, num_results)- Use when: Searching for specific patterns or exact phrases in documentation
- Returns: Documents containing the pattern with matched excerpts
- Example:
grep_documents(client, "PII.*encrypted", 10)to find docs mentioning PII encryption
Entity Tools​
get_entities(client, urns)- Use when: Retrieving detailed metadata for specific entities you already know the URNs for
- Returns: Full entity metadata including all aspects (schema, ownership, properties, etc.)
- Example: After search, use this to get complete details about the found entities
list_schema_fields(client, urn, filters)- Use when: Exploring columns/fields in a dataset
- Returns: List of fields with names, types, descriptions, and tags
- Example:
list_schema_fields(client, dataset_urn, {"field_path": "customer_"})to find customer-related columns - Filters: Can filter by field name patterns, data types, or tags
Lineage Tools​
get_lineage(client, urn, direction, max_depth)- Use when: Understanding data flow and dependencies
- Returns: Upstream (sources) or downstream (consumers) entities
- Example:
get_lineage(client, dashboard_urn, "UPSTREAM", 3)to trace data sources for a dashboard - Direction: Use "UPSTREAM" for sources, "DOWNSTREAM" for consumers
get_lineage_paths_between(client, source_urn, destination_urn)- Use when: Finding how data flows between two specific entities
- Returns: All paths connecting the entities with intermediate steps
- Example: Find how raw data flows to a specific dashboard
Query Tools​
get_dataset_queries(client, urn, column_name)- Use when: Finding SQL queries that use a dataset or specific column
- Returns: List of queries with SQL text and metadata
- Example:
get_dataset_queries(client, dataset_urn, "email")to see how the email column is used - Use cases: Understanding data usage patterns, finding query examples
Mutation Tools​
Note: These tools modify metadata. Use with caution in production environments.
add_tags(client, urn, tags)/remove_tags(client, urn, tags)- Use when: Categorizing or labeling entities
- Example:
add_tags(client, dataset_urn, ["PII", "Finance"])to mark sensitive data
update_description(client, urn, description)- Use when: Adding or updating documentation for entities
- Example: Agents can auto-generate and update descriptions
set_domains(client, urn, domain_urns)/remove_domains(client, urn, domain_urns)- Use when: Organizing entities into business domains
- Example: Assign datasets to "Marketing" or "Finance" domains
add_owners(client, urn, owners)/remove_owners(client, urn, owners)- Use when: Assigning data ownership and accountability
- Example:
add_owners(client, dataset_urn, [{"owner": user_urn, "type": "TECHNICAL_OWNER"}])
add_glossary_terms(client, urn, term_urns)/remove_glossary_terms(client, urn, term_urns)- Use when: Linking entities to business glossary definitions
- Example: Link a revenue column to the "Revenue" glossary term
add_structured_properties(client, urn, properties)/remove_structured_properties(client, urn, properties)- Use when: Adding custom metadata fields to entities
- Example: Add "data_retention_days" or "compliance_tier" properties
save_document(document_type, title, content, urn, topics, related_documents, related_assets)- Use when: Creating or updating standalone documents in DataHub's knowledge base (insights, decisions, FAQs, analysis, etc.)
- Document types: "Insight", "Decision", "FAQ", "Analysis", "Summary", "Recommendation", "Note", "Context"
- Parameters:
document_type: Type of document (required)title: Document title (required)content: Full document content in markdown format (required)urn: URN of existing document to update (optional, creates new if not provided)topics: List of topic tags for categorization (optional)related_documents: URNs of related documents (optional)related_assets: URNs of related data assets like datasets or dashboards (optional)
- Example:
save_document("Insight", "High Null Rate in Customer Emails", "## Finding\\n\\n23% of customer records have null email...", topics=["data-quality", "customer-data"], related_assets=["urn:li:dataset:(urn:li:dataPlatform:snowflake,customers,PROD)"]) - Important: Always confirm with user before saving. Documents are visible to all DataHub users.
- Note: For updating descriptions on data assets (datasets, dashboards), use
update_descriptioninstead
User Tools​
get_me(client)- Use when: Getting information about the authenticated user
- Returns: User details including name, email, and roles
- Use cases: Personalization, permission checks, audit logging
MCP Server​
It is also possible to connect your agent or tool directly to the DataHub MCP Server: DataHub MCP Server with your chosen framework.
Troubleshooting​
Authentication Errors​
Problem: Unauthorized or 401 errors when calling tools
Solutions:
- Verify your DataHub token is valid:
datahub check metadata-service - Ensure the token has the required permissions (read access for search tools, write access for mutation tools)
- Check that the token hasn't expired
Connection Errors​
Problem: Connection refused or timeout errors
Solutions:
- Verify DataHub server URL is correct and accessible
- Check network connectivity:
curl -I https://your-datahub-instance.com/api/gms/health - Ensure firewall rules allow outbound connections to DataHub
- For self-hosted DataHub, verify the service is running
Empty or Unexpected Results​
Problem: Search returns no results or missing expected entities
Solutions:
- Verify entities exist in DataHub UI first
- Check that your search query isn't too restrictive
- Try removing filters to broaden the search
- Ensure entity types are spelled correctly (case-sensitive):
dataset, notDataset - For schema fields, verify the dataset URN is correct
Import Errors​
Problem: ModuleNotFoundError: No module named 'datahub_agent_context'
Solutions:
- Ensure package is installed:
pip install datahub-agent-context - If using LangChain:
pip install datahub-agent-context[langchain] - If using Snowflake:
pip install datahub-agent-context[snowflake] - Verify you're using the correct Python environment
Rate Limiting​
Problem: 429 Too Many Requests errors
Solutions:
- Implement exponential backoff and retry logic
- Reduce the frequency of API calls
- For batch operations, use pagination instead of large single requests
- Contact DataHub admin to adjust rate limits if needed
Debugging Tips​
Enable debug logging to see detailed API calls:
import logging
logging.basicConfig(level=logging.DEBUG)
# Your agent code here
Check the DataHub server logs for more details on server-side errors.
Getting Help​
- Documentation: DataHub Docs
- Community Slack: Join DataHub Slack
- GitHub Issues: Report bugs
- Email Support: For DataHub Cloud customers, contact support@acryl.io