Skip to main content

Matillion DPC

Overview

Matillion Data Productivity Cloud (DPC) is a cloud-native data integration platform for building, orchestrating, and monitoring data pipelines. Learn more in the official Matillion documentation.

The DataHub integration for Matillion DPC ingests pipelines, streaming pipelines, projects, and environments as DataHub entities. It captures table- and column-level lineage via the Matillion OpenLineage API, pipeline execution history as operational metadata, and child pipeline dependency relationships for end-to-end orchestration visibility.

Concept Mapping

Source ConceptDataHub ConceptNotes
ProjectContainerTop-level grouping of pipelines within a Matillion account.
EnvironmentContainerDeployment environment within a project (e.g. Production, Staging).
PipelineDataFlowAn orchestration pipeline that transforms or moves data.
Pipeline Component / StepDataJobAn individual step within a pipeline.
Streaming PipelineDataFlowA CDC or streaming pipeline, emitted with pipeline_type=streaming.
Pipeline ExecutionDataProcessInstanceA single run of a pipeline, including status and timing.
OpenLineage table referenceDatasetUpstream or downstream dataset referenced via OpenLineage events.
Table/column lineage edgeLineage edgeExtracted from OpenLineage events; column-level via SQL parsing.

Module matillion-dpc

Incubating

Important Capabilities

CapabilityStatusNotes
Column-level LineageEnabled by default, can be disabled via configuration parse_sql_for_lineage.
Detect Deleted EntitiesEnabled via stateful ingestion.
Platform InstanceEnabled by default.
Table-Level LineageEnabled by default via OpenLineage data from pipeline executions.

Overview

The matillion-dpc module ingests metadata from Matillion Data Productivity Cloud (DPC) into DataHub. It extracts pipelines, streaming pipelines, projects, environments, execution history, and table and column-level lineage via the Matillion OpenLineage API.

Prerequisites

Obtain API Credentials

The connector uses OAuth2 client credentials and automatically handles token generation and refresh.

  1. Log into Matillion Data Productivity Cloud as a Super Admin
  2. Navigate to Profile & AccountAPI credentials
  3. Click Set an API Credential
  4. Provide a descriptive name (e.g., "DataHub Integration")
  5. Assign an Account Role with read permissions to required APIs
  6. Click Save and immediately copy the Client Secret (not shown again)
  7. Note the Client ID (remains visible)

For detailed instructions, see Matillion API Authentication.

Required Permissions

The API credentials must have an Account Role with Read permissions to:

  • Projects (/v1/projects)
  • Environments (/v1/environments)
  • Pipelines (/v1/pipelines)
  • Schedules (/v1/schedules)
  • Lineage Events (/v1/lineage/events)
  • Pipeline Executions (/v1/pipeline-executions) - optional
  • Streaming Pipelines (/v1/streaming-pipelines) - optional

If using an account role other than Super Admin, grant project and environment-level roles as needed.

See Matillion RBAC documentation for details.

Lineage and Dependencies

The connector automatically extracts:

  1. Table and Column-Level Lineage - From OpenLineage Events API (/v1/lineage/events) (docs)
  2. Operational Metadata - Pipeline execution history from Pipeline Executions API (/v1/pipeline-executions) emitted as DataProcessInstance entities (docs)
  3. Child Pipeline Dependencies - Automatically tracks when pipelines call other pipelines, creating step-to-step dependency relationships for comprehensive pipeline orchestration visibility

OpenLineage Namespace Mapping (Optional)

Optional: Map OpenLineage namespace URIs to DataHub platform instances for lineage connections. If not configured, the connector extracts platform type from URIs (e.g., postgresql://...postgres) with default environment (PROD).

When to use: Configure this when you need lineage to connect to existing datasets with platform instances.

Example namespaces: postgresql://host:5432, snowflake://account.snowflakecomputing.com, bigquery://project

namespace_to_platform_instance:
"postgresql://prod-db.us-east-1.rds.amazonaws.com:5432":
platform_instance: postgres_prod
env: PROD
database: analytics
schema: public

"snowflake://prod-account.snowflakecomputing.com":
platform_instance: snowflake_prod
env: PROD
convert_urns_to_lowercase: true

Platform instances must match those used when ingesting the source data platforms.

Install the Plugin

pip install 'acryl-datahub[matillion-dpc]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: matillion-dpc
config:
api_config:
client_id: "${MATILLION_CLIENT_ID}"
client_secret: "${MATILLION_CLIENT_SECRET}"
region: "EU1" # EU1 or US1

env: "PROD"

# Optional: Map OpenLineage namespaces to DataHub platform instances
# Required if existing datasets use platform instances
namespace_to_platform_instance:
"postgresql://prod-db.us-east-1.rds.amazonaws.com:5432":
platform_instance: postgres_prod
env: PROD
database: analytics
schema: public

"snowflake://prod-account.snowflakecomputing.com":
platform_instance: snowflake_prod
env: PROD
convert_urns_to_lowercase: true

"bigquery://my-gcp-project":
platform_instance: bigquery_prod
env: PROD

include_streaming_pipelines: true
include_unpublished_pipelines: true
max_executions_per_pipeline: 10
extract_projects_to_containers: true

# Optional: Filter projects, environments, pipelines using regex patterns
# project_patterns:
# allow: ["^prod-.*", "^staging-.*"]
# deny: [".*-deprecated$", ".*-archived$"]

# environment_patterns:
# allow: ["^production$", "^staging$"]

# pipeline_patterns:
# deny: ["^test_.*", ".*_backup$"]

# streaming_pipeline_patterns:
# allow: ["^cdc_.*"]

stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_config 
MatillionAPIConfig
api_config.client_id 
string(password)
Matillion API Client ID for OAuth2 authentication.
api_config.client_secret 
string(password)
Matillion API Client Secret for OAuth2 authentication.
api_config.custom_base_url
One of string, null
Custom API base URL for VPC endpoints or on-premise installations.
Default: None
api_config.custom_oauth_token_url
One of string, null
Custom OAuth2 token endpoint URL for VPC endpoints or on-premise installations.
Default: None
api_config.region
Enum
One of: "EU1", "US1"
api_config.request_timeout_sec
integer
Request timeout in seconds
Default: 30
bucket_duration
Enum
One of: "DAY", "HOUR"
end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
extract_projects_to_containers
boolean
Whether to extract Matillion projects as DataHub containers. When enabled, pipelines are organized under project containers, providing hierarchical navigation.
Default: True
include_streaming_pipelines
boolean
Whether to ingest Matillion streaming pipelines (CDC pipelines). Streaming pipelines are emitted as separate DataFlows with pipeline_type='streaming'.
Default: True
include_unpublished_pipelines
boolean
Whether to discover and ingest unpublished pipelines from recent execution history. When enabled, the connector will discover pipelines that have been executed but not yet published. Disable this to only ingest published pipelines from the published-pipelines API.
Default: True
lineage_platform_mapping
One of string, null
Override platform name mappings from OpenLineage namespaces to DataHub platforms. Only needed for non-standard platforms. See documentation for list of pre-mapped platforms. Example: {"customdb": "postgres", "mywarehouse": "snowflake"}
Default: None
max_executions_per_pipeline
integer
Maximum number of recent pipeline executions to ingest per pipeline. Set to 0 to disable execution ingestion.
Default: 10
parse_sql_for_lineage
boolean
Whether to parse SQL from OpenLineage events to extract additional column-level lineage. Requires DataHub graph access. When enabled, SQL queries are parsed to infer lineage beyond what's explicitly provided in OpenLineage column mappings.
Default: True
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to
Default: None
start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
Default: None
env
string
The environment that all assets produced by DataHub platform ingestion source belong to
Default: PROD
environment_patterns
AllowDenyPattern
A class to store allow deny regexes
environment_patterns.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
environment_patterns.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
environment_patterns.allow.string
string
environment_patterns.deny
array
List of regex patterns to exclude from ingestion.
Default: []
environment_patterns.deny.string
string
namespace_to_platform_instance
One of NamespacePlatformMapping, null
Maps OpenLineage namespace prefixes to platform instance/environment using longest prefix matching. Unmapped namespaces extract platform from URI with defaults (env=PROD). Example: {"snowflake://prod-account": {"platform_instance": "snowflake_prod", "env": "PROD"}}
Default: None
namespace_to_platform_instance.key.platform_instance
One of string, null
DataHub platform instance to use for datasets from this namespace
Default: None
namespace_to_platform_instance.key.convert_urns_to_lowercase
boolean
Whether to convert dataset URNs to lowercase for this namespace.
Default: False
namespace_to_platform_instance.key.database
One of string, null
Default database name to prepend if dataset name doesn't include database context
Default: None
namespace_to_platform_instance.key.schema
One of string, null
Default schema name to prepend if dataset name doesn't include schema context
Default: None
namespace_to_platform_instance.key.env
string
Environment (PROD, DEV, etc.) to use for datasets from this namespace
Default: PROD
pipeline_patterns
AllowDenyPattern
A class to store allow deny regexes
pipeline_patterns.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
pipeline_patterns.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
pipeline_patterns.allow.string
string
pipeline_patterns.deny
array
List of regex patterns to exclude from ingestion.
Default: []
pipeline_patterns.deny.string
string
project_patterns
AllowDenyPattern
A class to store allow deny regexes
project_patterns.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
project_patterns.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
project_patterns.allow.string
string
project_patterns.deny
array
List of regex patterns to exclude from ingestion.
Default: []
project_patterns.deny.string
string
streaming_pipeline_patterns
AllowDenyPattern
A class to store allow deny regexes
streaming_pipeline_patterns.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
streaming_pipeline_patterns.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
streaming_pipeline_patterns.allow.string
string
streaming_pipeline_patterns.deny
array
List of regex patterns to exclude from ingestion.
Default: []
streaming_pipeline_patterns.deny.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

OpenLineage Namespace Mapping

Optional configuration to map OpenLineage namespace URIs to DataHub platform information. Without this, the connector extracts platform type from URIs with default environment.

Fields:

  • platform_instance: Platform instance identifier (must match source ingestion)
  • database / schema: Defaults for incomplete dataset names from OpenLineage
    • 3-tier platforms (Snowflake, Postgres, Redshift): database.schema.table
    • 2-tier platforms (MySQL, Hive): schema.table
  • convert_urns_to_lowercase: Normalize URNs to lowercase (use true for Snowflake)
  • env: Environment tag (PROD, DEV, etc.)

Fallback behavior: Unmapped namespaces extract platform type from the URI (e.g., postgresql://...postgres) without platform instance assignment.

SQL Parsing for Column-Level Lineage

Enable parse_sql_for_lineage: true to parse SQL queries from OpenLineage events for additional column-level lineage.

Requirements:

  • DataHub graph connection configured
  • Schema information in OpenLineage events

Platform-Specific Handling

Snowflake: Use convert_urns_to_lowercase: true in namespace mapping

BigQuery: 3-tier naming (project.dataset.table). Set database: project-id, schema: dataset-name

MySQL / 2-tier: 2-tier naming (schema.table). Set schema only

Postgres / Redshift: 3-tier naming (database.schema.table). Set both database and schema

Filtering Options

The connector supports flexible regex-based filtering to control what metadata is ingested.

Project Filtering
project_patterns:
allow: ["^prod-.*", "^staging-.*"]
deny: [".*-deprecated$"]
Environment Filtering
environment_patterns:
allow: ["^production$", "^staging$"]
deny: ["^sandbox.*"]
Pipeline Filtering
pipeline_patterns:
allow: [".*"]
deny: ["^test_.*", ".*_backup$"]
Streaming Pipeline Filtering
streaming_pipeline_patterns:
allow: ["^cdc_.*"]
deny: [".*_test$"]

All patterns are case-insensitive by default and support full regex syntax. Deny patterns take precedence over allow patterns.

Child Pipeline Dependencies

The connector automatically detects and tracks when pipelines call other pipelines (via "Run Pipeline" components). This creates step-level dependency relationships in DataHub, showing:

  • Which pipeline steps trigger child pipelines
  • Complete execution lineage across pipeline orchestrations
  • Cross-pipeline data flow for comprehensive impact analysis

No configuration needed — this feature is automatic when execution history is ingested.

Published vs Unpublished Pipelines

The connector can discover pipelines from two sources:

  1. Published Pipelines — Pipelines explicitly published in Matillion DPC (fetched from /published-pipelines API)
  2. Unpublished Pipelines — Pipelines discovered from recent execution history (fetched from /pipeline-executions API)

By default, both types are ingested. To only ingest published pipelines:

include_unpublished_pipelines: false

This is useful when:

  • You want to control what appears in DataHub via Matillion's publish workflow
  • You have many development/test pipelines that run but shouldn't be documented
  • You want to reduce ingestion time and API calls

Limitations

  • SQL parsing for column-level lineage requires a DataHub graph connection and schema information in OpenLineage events. Unsupported SQL dialects or complex queries are skipped with a warning.
  • Column-level lineage is only available when Matillion pipelines emit SQL via OpenLineage; transformations without SQL output will have coarse-grained lineage only.

Troubleshooting

Lineage Not Showing Up

  1. Verify namespace mapping matches source ingestion platform instances
  2. Check logs for Processing OpenLineage event messages
  3. Confirm dataset names in OpenLineage match actual tables

Column-Level Lineage Missing

Enable parse_sql_for_lineage: true (requires DataHub graph connection).

Execution History Not Appearing

  1. Adjust start_time to query further back in time if needed
  2. Verify API permissions for Pipeline Executions API

Performance Issues

  1. Reduce time window by adjusting start_time (e.g., only last 7 days instead of 30)
  2. Use filtering patterns to reduce scope:
    • project_patterns to filter projects
    • environment_patterns to filter environments
    • pipeline_patterns to filter pipelines
    • streaming_pipeline_patterns to filter streaming pipelines
  3. Disable include_streaming_pipelines if not needed
  4. Increase api_config.request_timeout_sec if needed

Code Coordinates

  • Class Name: datahub.ingestion.source.matillion_dpc.matillion.MatillionSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Matillion DPC, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.