DataHub Semantic Search
This directory contains documentation for DataHub's semantic search capability, which enables natural language search across metadata entities using vector embeddings.
Note: This is developer documentation for the semantic search feature. For a working example, see the smoke test at
smoke-test/tests/semantic/test_semantic_search.py.
Overview
Traditional keyword search requires exact term matches, limiting discoverability. Semantic search uses AI-generated embeddings to understand the meaning of queries and documents, returning relevant results even when exact keywords don't match.
Example:
- Query: "how to request data access permissions"
- Keyword search: ❌ No results (no exact match)
- Semantic search: ✅ Returns "Data Access Request Process" document
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ DataHub Semantic Search │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Ingestion │ │ GMS │────▶│ OpenSearch │ │
│ │ Connector │ │ │ │ │ │
│ │ │ │ ┌──────────┐ │ │ ┌────────────────────┐ │ │
│ │ 1. Generate │ │ │ Process │ │ │ │ entityindex_v2 │ │ │
│ │ embeddings│ │ │ MCP + │ │ │ │ (keyword search) │ │ │
│ │ │ │ │ Write to │ │ │ └────────────────────┘ │ │
│ │ 2. Emit MCP │────▶│ │ indices │ │ │ │ │
│ │ with │ │ └──────────┘ │ │ ┌────────────────────┐ │ │
│ │ Semantic │ │ │ │ │ entityindex_v2_ │ │ │
│ │ Embedding │ │ │ │ │ semantic │ │ │
│ │ aspect │ │ │ │ │ (vector search) │ │ │
│ └──────────────┘ └──────────────┘ │ └────────────────────┘ │ │
│ └──────────────────────────┘ │
│ │ │
│ ┌──────────────┐ │ │
│ │ GraphQL │◀───────────────────────────────────────┘ │
│ │ Client │ semanticSearchAcrossEntities() │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
How It Works
1. Data Ingestion
Documents and other entities are ingested into DataHub using standard ingestion connectors. When semantic search is enabled, GMS performs a dual-write:
- Primary Index (
entityindex_v2): Standard keyword-searchable index - Semantic Index (
entityindex_v2_semantic): Vector-enabled index for semantic search
Note: The dual-index approach is transitional. The plan is to eventually retire
v2indices and use_semanticindices exclusively for both keyword and semantic search. See Architecture for details.
2. Embedding Generation
Embeddings are generated at two points:
Document Embeddings (at ingestion time):
- Generated by the ingestion connector
- Emitted via MCP (Metadata Change Proposal) as a
SemanticContentaspect - GMS processes the MCP and writes embeddings to the semantic index
- Supports privacy-sensitive use cases where only embeddings (not source text) are shared
Query Embeddings (at search time):
- Generated by GMS using the configured embedding provider
- Used to find similar documents via k-NN search
3. Query Processing
When a user performs a semantic search:
- The query text is converted to an embedding vector using the same model
- OpenSearch performs k-NN (k-nearest neighbors) vector similarity search
- Results are ranked by cosine similarity to the query embedding
- Top matches are returned through the GraphQL API
Quick Start
Prerequisites
- DataHub running with semantic search enabled
- AWS credentials (for Bedrock) or API key (for Cohere/OpenAI)
1. Enable Semantic Search
Set in your environment (e.g., docker/profiles/empty2.env):
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
2. Run the Smoke Test
The best way to verify semantic search is working is to run the smoke test:
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v
This test:
- Ingests sample documents via GraphQL
- Waits for indexing (20 seconds)
- Executes semantic search
- Verifies results
GraphQL API
Semantic Search Query
query SemanticSearch($input: SearchAcrossEntitiesInput!) {
semanticSearchAcrossEntities(input: $input) {
total
searchResults {
entity {
urn
type
... on Document {
info {
title
contents {
text
}
}
}
}
}
}
}
Variables:
{
"input": {
"query": "how to request data access",
"types": ["DOCUMENT"],
"start": 0,
"count": 10
}
}
Documentation Index
| File | Description |
|---|---|
README.md | This documentation - overview and quick start |
ARCHITECTURE.md | Detailed architecture and design decisions |
CONFIGURATION.md | Configuration options and embedding models |
Testing
For a working example of semantic search:
# Run the smoke test
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v
Further Reading
- Architecture Details - Deep dive into the design
- Configuration Guide - Embedding models and settings