Ingestion Security Comparison
DataHub supports three ways to ingest metadata. They differ primarily in where credentials are stored and what network access is required.
Quick Comparison
| Credentials | Runs From | Network | Firewall/IP Allowlist Adjustments | |
|---|---|---|---|---|
| UI Ingestion | Encrypted in DataHub | DataHub's infrastructure | DataHub infrastructure connects to your data sources | Required for sources behind firewalls or with IP allowlists |
| CLI Ingestion | Local files/env vars | Wherever you execute it (personal machine, CI/CD, scheduler like Airflow/Cron, etc.) | CLI connects to your data sources, then sends metadata to DataHub | Depends on where CLI runs and if those machines already have connectivity |
| Remote Executor | Your infrastructure (AWS Secrets, K8s, etc.) | Deployed in your infrastructure (K8s, ECS, etc.) | Executor connects to your data sources, then sends metadata to DataHub (outbound only) | Depends on where executor runs and if those machines already have connectivity |
Network Flow Diagram
graph LR
UI[UI Ingestion<br/>runs in DataHub infrastructure]
CLI[CLI Ingestion<br/>runs where you execute it]
RE[Remote Executor<br/>runs in your infrastructure]
DH[DataHub]
SRC[Data Sources]
UI -.->|managed by DataHub| DH
DH -->|connects to| SRC
CLI -->|sends metadata to| DH
CLI -->|extracts from| SRC
RE -->|sends metadata to| DH
RE -->|extracts from| SRC
Where Credentials Live
UI Ingestion
Credentials are encrypted with an encryption key and stored in the DataHub database. DataHub manages the encryption and uses these credentials when connecting to your data sources.
Secure by default: GMS defaults to SECRET_SERVICE_CALLER_GUARD_MODE=ENFORCE, which blocks browser sessions and user PATs from reading plaintext secret values via the GraphQL API. Administrators can still manage secrets through the UI; only programmatic plaintext retrieval from human-facing sessions is denied unless the guard is relaxed. Scheduled UI ingestion resolves secrets via datahub-actions (OSS, system client credentials) or an embedded executor (DataHub Cloud).
CLI Ingestion
Credentials are stored in your infrastructure via recipe files. Best practice: Always use environment variables rather than hardcoding credentials in recipe files. You can also integrate with local secret managers.
Remote Executor
Recommended: Integrate with enterprise secret management in your infrastructure:
- AWS Secrets Manager (runtime or ECS deploy-time)
- Kubernetes Secrets mounted at
/mnt/secrets/ - External Secrets Operator
- HashiCorp Vault
- Environment variables on the executor container
Discouraged for strict security: DataHub UI Secrets. These store credentials in DataHub and are sent to the Remote Executor in plaintext over the DataHub API when a job runs (TLS in transit). Use local secret backends instead.
Network Patterns
UI Ingestion
DataHub's infrastructure connects directly to your data sources. This requires configuring your sources to allow DataHub access, which is source-dependent:
- Cloud sources (Snowflake, BigQuery, etc.): May require adding DataHub IPs to allowlists
- On-premise sources: May require VPN tunnels or firewall rules to allow DataHub to reach them
CLI Ingestion
The CLI runs wherever you execute it (personal machine, CI/CD, cloud instance, scheduler like Airflow/Cron). It first connects to your data sources to extract metadata, then sends that metadata to DataHub. Network requirements depend entirely on where the CLI runs and whether that machine already has connectivity to both your sources and DataHub.
Remote Executor
Deployed as software in your infrastructure (Kubernetes, ECS, etc.) with access to both your data sources and DataHub. Like CLI, it connects to sources first, then sends metadata to DataHub.
Key advantage: Only makes outbound connections. You don't need to open inbound firewall ports or configure VPN access for external systems—the executor software runs entirely within your network perimeter.
When to Use What
Most organizations use a mix of all three approaches based on their specific needs. Here are common patterns:
UI Ingestion - Best For:
- Cloud-hosted data sources: Snowflake, BigQuery, Redshift, Tableau, Looker, PowerBI, etc.
- When you want simplicity: Built-in scheduling, no infrastructure to manage, easiest to scale
- Getting started quickly: Minimal setup required
Advantages: Simplest for both scheduling and scale. DataHub handles all infrastructure and orchestration.
CLI Ingestion - Best For:
- Sources requiring local files: dbt projects, custom SQL queries, local transformations
- CI/CD pipelines: Integrate metadata ingestion into your build/deploy process
- Development and testing: Quick iteration and testing of ingestion recipes
- Custom orchestration: When you need fine-grained control over scheduling (Airflow, Cron, etc.)
Scheduling: Requires external scheduler (Airflow, Cron, Kubernetes CronJob, etc.)
Remote Executor - Best For:
- Databases inside your network: On-premise databases, internal data warehouses
- Strict security requirements: When credentials cannot leave your infrastructure — use local secret backends, not DataHub UI Secrets. GMS defaults to
SECRET_SERVICE_CALLER_GUARD_MODE=ENFORCE, which blocks browser and user-PAT access to secret values; embedded executors (DataHub Cloud) can still resolve UI secrets at job time. - Sources behind firewalls: When you cannot or prefer not to configure external access
- Enterprise secret management: When you need integration with existing secret management systems
Advantages: Runs in your infrastructure with your security controls, only requires outbound connectivity, integrates with your existing secret management.
Choosing the Right Approach
The choice often depends on:
- Source location: Cloud-hosted vs on-premise
- Security requirements: Where credentials can be stored
- Network topology: Firewall and connectivity constraints
- Operational preferences: Managed service vs self-hosted
- Scale: How many sources and how frequently you ingest
Note: These are guidelines, not strict rules. The best choice varies by organization and even by individual data source within an organization.