Hex
This connector ingests Hex assets into DataHub.
Concept Mapping
Hex Concept | DataHub Concept | Notes |
---|---|---|
"hex" | Data Platform | |
Workspace | Container | |
Project | Dashboard | Subtype Project |
Component | Dashboard | Subtype Component |
Collection | Tag |
Other Hex concepts are not mapped to DataHub entities yet.
Limitations
Currently, the Hex API has some limitations that affect the completeness of the extracted metadata:
Projects and Components Relationship: The API does not support fetching the many-to-many relationship between Projects and their Components.
Metadata Access: There is no direct method to retrieve metadata for Collections, Status, or Categories. This information is only available indirectly through references within Projects and Components.
Please keep these limitations in mind when working with the Hex connector.
For the Dataset - Hex Project lineage, the connector relies on the Hex query metadata feature. Therefore, in order to extract lineage information, the required setup must include:
- A separated warehouse ingestor (eg BigQuery, Snowflake, Redshift, ...) with
use_queries_v2
enabled in order to fetch Queries. This will ingest the queries into DataHub asQuery
entities and the ones triggered by Hex will include the corresponding Hex query metadata. - A DataHub server with version >= SaaS
0.3.10
or > OSS1.0.0
so theQuery
entities are properly indexed by source (Hex in this case) and so fetched and processed by the Hex ingestor in order to emit the Dataset - Project lineage.
Please note:
- Lineage is only captured for scheduled executions of the Project.
- In cases where queries are handled by
hextoolkit
, Hex query metadata is not injected, which prevents capturing lineage.
Important Capabilities
Capability | Status | Notes |
---|---|---|
Asset Containers | ✅ | Enabled by default |
Descriptions | ✅ | Supported by default |
Detect Deleted Entities | ✅ | Optionally enabled via stateful_ingestion.remove_stale_metadata |
Extract Ownership | ✅ | Supported by default |
Platform Instance | ✅ | Enabled by default |
Prerequisites
Workspace name
Workspace name is required to fetch the data from Hex. You can find the workspace name in the URL of your Hex home page.
https://app.hex.tech/<workspace_name>"
Eg: In https://app.hex.tech/acryl-partnership, acryl-partnership
is the workspace name.
Authentication
To authenticate with Hex, you will need to provide your Hex API Bearer token. You can obtain your API key by following the instructions on the Hex documentation.
Either PAT (Personal Access Token) or Workspace Token can be used as API Bearer token:
- (Recommended) If Workspace Token, a read-only token would be enough for ingestion.
- If PAT, ingestion will be done with the user's permissions.
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: hex
config:
workspace_name: # Hex workspace name. You can find this name in your Hex home page URL: https://app.hex.tech/<workspace_name>
token: # Your PAT or Workspace token
sink:
# sink configs
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
token ✅ string(password) | Hex API token; either PAT or Workflow token - https://learn.hex.tech/docs/api/api-overview#authentication |
workspace_name ✅ string | Hex workspace name. You can find this name in your Hex home page URL: https://app.hex.tech/<workspace_name> |
base_url string | Hex API base URL. For most Hex users, this will be https://app.hex.tech/api/v1. Single-tenant app users should replace this with the URL they use to access Hex. Default: https://app.hex.tech/api/v1 |
categories_as_tags boolean | Emit Hex Category as tags Default: True |
collections_as_tags boolean | Emit Hex Collections as tags Default: True |
datahub_page_size integer | Number of items to fetch per DataHub API call. Default: 100 |
include_components boolean | Default: True |
include_lineage boolean | Include Hex lineage, being fetched from DataHub. See "Limitations" section in the docs for more details about the limitations of this feature. Default: True |
lineage_end_time string(date-time) | Latest date of lineage to consider. Default: Current time in UTC. You can specify absolute time like '2023-01-01' or relative time like '-1 day' or '-1d'. |
lineage_start_time string(date-time) | Earliest date of lineage to consider. Default: 1 day before lineage end time. You can specify absolute time like '2023-01-01' or relative time like '-7 days' or '-7d'. |
page_size integer | Number of items to fetch per Hex API call. Default: 100 |
patch_metadata boolean | Emit metadata as patch events Default: False |
platform_instance string | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details. |
set_ownership_from_email boolean | Set ownership identity from owner/creator email Default: True |
status_as_tag boolean | Emit Hex Status as tags Default: True |
env string | The environment that all assets produced by this connector belong to Default: PROD |
component_title_pattern AllowDenyPattern | Regex pattern for component titles to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
component_title_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
component_title_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
component_title_pattern.allow.string string | |
component_title_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
component_title_pattern.deny.string string | |
project_title_pattern AllowDenyPattern | Regex pattern for project titles to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
project_title_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
project_title_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
project_title_pattern.allow.string string | |
project_title_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
project_title_pattern.deny.string string | |
stateful_ingestion StatefulStaleMetadataRemovalConfig | Configuration for stateful ingestion and stale metadata removal. |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"title": "HexSourceConfig",
"description": "Base configuration class for stateful ingestion for source configs to inherit from.",
"type": "object",
"properties": {
"env": {
"title": "Env",
"description": "The environment that all assets produced by this connector belong to",
"default": "PROD",
"type": "string"
},
"platform_instance": {
"title": "Platform Instance",
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.",
"type": "string"
},
"stateful_ingestion": {
"title": "Stateful Ingestion",
"description": "Configuration for stateful ingestion and stale metadata removal.",
"allOf": [
{
"$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
}
]
},
"workspace_name": {
"title": "Workspace Name",
"description": "Hex workspace name. You can find this name in your Hex home page URL: https://app.hex.tech/<workspace_name>",
"type": "string"
},
"token": {
"title": "Token",
"description": "Hex API token; either PAT or Workflow token - https://learn.hex.tech/docs/api/api-overview#authentication",
"type": "string",
"writeOnly": true,
"format": "password"
},
"base_url": {
"title": "Base Url",
"description": "Hex API base URL. For most Hex users, this will be https://app.hex.tech/api/v1. Single-tenant app users should replace this with the URL they use to access Hex.",
"default": "https://app.hex.tech/api/v1",
"type": "string"
},
"include_components": {
"title": "Include Components",
"default": true,
"desciption": "Include Hex Components in the ingestion",
"type": "boolean"
},
"page_size": {
"title": "Page Size",
"description": "Number of items to fetch per Hex API call.",
"default": 100,
"type": "integer"
},
"patch_metadata": {
"title": "Patch Metadata",
"description": "Emit metadata as patch events",
"default": false,
"type": "boolean"
},
"collections_as_tags": {
"title": "Collections As Tags",
"description": "Emit Hex Collections as tags",
"default": true,
"type": "boolean"
},
"status_as_tag": {
"title": "Status As Tag",
"description": "Emit Hex Status as tags",
"default": true,
"type": "boolean"
},
"categories_as_tags": {
"title": "Categories As Tags",
"description": "Emit Hex Category as tags",
"default": true,
"type": "boolean"
},
"project_title_pattern": {
"title": "Project Title Pattern",
"description": "Regex pattern for project titles to filter in ingestion.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"component_title_pattern": {
"title": "Component Title Pattern",
"description": "Regex pattern for component titles to filter in ingestion.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"set_ownership_from_email": {
"title": "Set Ownership From Email",
"description": "Set ownership identity from owner/creator email",
"default": true,
"type": "boolean"
},
"include_lineage": {
"title": "Include Lineage",
"description": "Include Hex lineage, being fetched from DataHub. See \"Limitations\" section in the docs for more details about the limitations of this feature.",
"default": true,
"type": "boolean"
},
"lineage_start_time": {
"title": "Lineage Start Time",
"description": "Earliest date of lineage to consider. Default: 1 day before lineage end time. You can specify absolute time like '2023-01-01' or relative time like '-7 days' or '-7d'.",
"type": "string",
"format": "date-time"
},
"lineage_end_time": {
"title": "Lineage End Time",
"description": "Latest date of lineage to consider. Default: Current time in UTC. You can specify absolute time like '2023-01-01' or relative time like '-1 day' or '-1d'.",
"type": "string",
"format": "date-time"
},
"datahub_page_size": {
"title": "Datahub Page Size",
"description": "Number of items to fetch per DataHub API call.",
"default": 100,
"type": "integer"
}
},
"required": [
"workspace_name",
"token"
],
"additionalProperties": false,
"definitions": {
"DynamicTypedStateProviderConfig": {
"title": "DynamicTypedStateProviderConfig",
"type": "object",
"properties": {
"type": {
"title": "Type",
"description": "The type of the state provider to use. For DataHub use `datahub`",
"type": "string"
},
"config": {
"title": "Config",
"description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).",
"default": {},
"type": "object"
}
},
"required": [
"type"
],
"additionalProperties": false
},
"StatefulStaleMetadataRemovalConfig": {
"title": "StatefulStaleMetadataRemovalConfig",
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"type": "object",
"properties": {
"enabled": {
"title": "Enabled",
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"default": false,
"type": "boolean"
},
"remove_stale_metadata": {
"title": "Remove Stale Metadata",
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"default": true,
"type": "boolean"
},
"fail_safe_threshold": {
"title": "Fail Safe Threshold",
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"default": 75.0,
"minimum": 0.0,
"maximum": 100.0,
"type": "number"
}
},
"additionalProperties": false
},
"AllowDenyPattern": {
"title": "AllowDenyPattern",
"description": "A class to store allow deny regexes",
"type": "object",
"properties": {
"allow": {
"title": "Allow",
"description": "List of regex patterns to include in ingestion",
"default": [
".*"
],
"type": "array",
"items": {
"type": "string"
}
},
"deny": {
"title": "Deny",
"description": "List of regex patterns to exclude from ingestion.",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"ignoreCase": {
"title": "Ignorecase",
"description": "Whether to ignore case sensitivity during pattern matching.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
}
}
}
Code Coordinates
- Class Name:
datahub.ingestion.source.hex.hex.HexSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Hex, feel free to ping us on our Slack.