Skip to main content

Elasticsearch

Overview

Elasticsearch is a DataHub utility or metadata-focused integration.

The DataHub integration for Elasticsearch covers metadata entities and operational objects relevant to this connector. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source ConceptDataHub ConceptNotes
Platform/account/project scopePlatform Instance, ContainerOrganizes assets within the platform context.
Core technical asset (for example table/view/topic/file)DatasetPrimary ingested technical asset.
Schema fields / columnsSchemaFieldIncluded when schema extraction is supported.
Ownership and collaboration principalsCorpUser, CorpGroupEmitted by modules that support ownership and identity metadata.
Dependencies and processing relationshipsLineage edgesAvailable when lineage extraction is supported and enabled.

Module elasticsearch

Certified

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled by default via stateful ingestion.
Platform InstanceEnabled by default.

Overview

The elasticsearch module ingests metadata from Elasticsearch into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

This plugin extracts the following:

  • Metadata for indexes
  • Column types associated with each index field

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

Install the Plugin

pip install 'acryl-datahub[elasticsearch]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "elasticsearch"
config:
# Coordinates
host: 'localhost:9200'

# Credentials
username: user # optional
password: pass # optional

# SSL support
use_ssl: False
verify_certs: False
ca_certs: "./path/ca.cert"
client_cert: "./path/client.cert"
client_key: "./path/client.key"
ssl_assert_hostname: False
ssl_assert_fingerprint: "./path/cert.fingerprint"

# Options
url_prefix: "" # optional url_prefix
env: "PROD"
index_pattern:
allow: [".*some_index_name_pattern*"]
deny: [".*skip_index_name_pattern*"]
ingest_index_templates: False
index_template_pattern:
allow: [".*some_index_template_name_pattern*"]

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_key
One of object, string, null
API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'.
Default: None
ca_certs
One of string, null
Path to a certificate authority (CA) certificate.
Default: None
client_cert
One of string, null
Path to the file containing the private key and the certificate, or cert only if using client_key.
Default: None
client_key
One of string, null
Path to the file containing the private key if using separate cert and key files.
Default: None
host
string
The elastic search host URI.
Default: localhost:9200
ingest_index_templates
boolean
Ingests ES index templates if enabled.
Default: False
password
One of string(password), null
The password credential.
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
ssl_assert_fingerprint
One of string, null
Verify the supplied certificate fingerprint if not None.
Default: None
ssl_assert_hostname
boolean
Use hostname verification if not False.
Default: False
url_prefix
string
There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.
Default:
use_ssl
boolean
Whether to use SSL for the connection or not.
Default: False
username
One of string, null
The username credential.
Default: None
verify_certs
boolean
Whether to verify SSL certificates.
Default: False
env
string
The environment that all assets produced by this connector belong to
Default: PROD
collapse_urns
CollapseUrns
collapse_urns.urns_suffix_regex
array
List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.
The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.
e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.
collapse_urns.urns_suffix_regex.string
string
index_pattern
AllowDenyPattern
A class to store allow deny regexes
index_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
index_template_pattern
AllowDenyPattern
A class to store allow deny regexes
index_template_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
ElasticProfiling
profiling.enabled
boolean
Whether to enable profiling for the elastic search source.
Default: False
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
stateful_ingestion
One of StatefulIngestionConfig, null
Stateful Ingestion Config
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Code Coordinates

  • Class Name: datahub.ingestion.source.elastic_search.ElasticsearchSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Elasticsearch, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.