Teradata
Overview
Teradata is a DataHub utility or metadata-focused integration. Learn more in the official Teradata documentation.
The DataHub integration for Teradata covers metadata entities and operational objects relevant to this connector. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.
Concept Mapping
While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.
| Source Concept | DataHub Concept | Notes |
|---|---|---|
| Platform/account/project scope | Platform Instance, Container | Organizes assets within the platform context. |
| Core technical asset (for example table/view/topic/file) | Dataset | Primary ingested technical asset. |
| Schema fields / columns | SchemaField | Included when schema extraction is supported. |
| Ownership and collaboration principals | CorpUser, CorpGroup | Emitted by modules that support ownership and identity metadata. |
| Dependencies and processing relationships | Lineage edges | Available when lineage extraction is supported and enabled. |
Module teradata
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Database. |
| Classification | ✅ | Optionally enabled via classification.enabled. |
| Column-level Lineage | ✅ | Optionally enabled via configuration. |
| Data Profiling | ✅ | Optionally enabled via configuration. |
| Dataset Usage | ✅ | Optionally enabled via configuration. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default when stateful ingestion is turned on. |
| Domains | ✅ | Enabled by default. |
| Extract Ownership | ✅ | Optionally enabled via configuration (extract_ownership). |
| Platform Instance | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Optionally enabled via configuration. |
| Test Connection | ✅ | Enabled by default. |
Overview
The teradata module ingests metadata from Teradata into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
This plugin extracts the following:
- Metadata for databases, schemas, views, and tables
- Column types associated with each table
- Table, row, and column statistics via optional SQL profiling
Prerequisites
Create a user which has access to the database you want to ingest.
CREATE USER datahub FROM <database> AS PASSWORD = <password> PERM = 20000000;Create a user with the following privileges:
GRANT SELECT ON dbc.columns TO datahub;
GRANT SELECT ON dbc.databases TO datahub;
GRANT SELECT ON dbc.tables TO datahub;
GRANT SELECT ON DBC.All_RI_ChildrenV TO datahub;
GRANT SELECT ON DBC.ColumnsV TO datahub;
GRANT SELECT ON DBC.IndicesV TO datahub;
GRANT SELECT ON dbc.TableTextV TO datahub;
GRANT SELECT ON dbc.TablesV TO datahub;
GRANT SELECT ON dbc.dbqlogtbl TO datahub; -- if lineage or usage extraction is enabledIf you want to run profiling, you need to grant select permission on all the tables you want to profile.
For lineage/usage extraction: Enable query logging and set an appropriate query text size (default is 200 chars, may be insufficient).
To set for all users:
REPLACE QUERY LOGGING LIMIT SQLTEXT=2000 ON ALL;See more here about query logging: https://docs.teradata.com/r/Lake-Database-Reference/Database-Administration/Tracking-Query-Behavior-with-Database-Query-Logging-Operational-DBAs/SQL-Statements-to-Control-Logging/LIMIT-Logging-Options
Install the Plugin
pip install 'acryl-datahub[teradata]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
pipeline_name: my-teradata-ingestion-pipeline
source:
type: teradata
config:
host_port: "myteradatainstance.teradata.com:1025"
username: myuser
password: mypassword
#database_pattern:
# allow:
# - "my_database"
# ignoreCase: true
include_table_lineage: true
include_usage_statistics: true
stateful_ingestion:
enabled: true
# --- Performance options for large installations ---
# Skip column extraction for tables not altered in the last N days (recommended for
# scheduled pipelines — set once and never update the recipe).
#column_extraction_days_back: 3
# Alternative: skip column extraction for tables unchanged since an absolute timestamp.
# Set to the start time of the last successful run. Mutually exclusive with
# column_extraction_days_back.
#column_extraction_watermark: "2024-06-01T00:00:00Z"
# Use dbc.ColumnsV for view columns first (faster); fall back to HELP only
# when a column has an unknown type (e.g. derived expressions).
#use_dbc_columns_for_views: true
# Cap profiling to high-priority tables (uses standard GEProfilingConfig).
#profiling:
# enabled: true
# limit: 500
#profile_pattern:
# allow:
# - "important_db\\..*"
# Increase if lineage queries against DBC.QryLogV time out (default: 120000 ms).
#request_timeout_ms: 300000
sink:
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
host_port ✅ string | host URL |
bucket_duration Enum | One of: "DAY", "HOUR" |
column_extraction_days_back One of integer, null | Skip column extraction for tables/views not altered within the last N days. Computed at runtime as now() - N days, so the recipe never needs updating. A value of 3 for a daily schedule covers up to two missed runs with no gap risk. Mutually exclusive with column_extraction_watermark. Default: None |
column_extraction_watermark One of string(date-time), null | Skip column extraction for tables/views whose LastAlterTimeStamp is older than this timestamp. Set to the start time of the last successful ingestion run to enable incremental column extraction. Mutually exclusive with column_extraction_days_back. At 13k tables where ~200 change per day this can reduce ingestion from hours to minutes. Default: None |
connect_timeout_ms integer | Connection timeout in milliseconds when establishing Teradata connections. Default is 30000 (30 seconds). Default: 30000 |
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
default_db One of string, null | The default database to use for unqualified table names Default: None |
end_time string(date-time) | Latest date of lineage/usage to consider. Default: Current time in UTC |
extract_ownership boolean | Whether to extract ownership information for tables and views based on their creator. When enabled, the table/view creator from Teradata's system tables will be added as an owner with DATAOWNER type. Ownership is applied using OVERWRITE mode, meaning any existing ownership information (including manually added or modified owners from the UI) will be replaced. Use with caution. Default: False |
include_historical_lineage boolean | Whether to include historical lineage data from PDCRINFO.DBQLSqlTbl_Hst in addition to current DBC.QryLogV data. This provides access to historical query logs that may have been archived. The historical table existence is checked automatically and gracefully falls back to current data only if not available. Default: False |
include_queries boolean | Whether to generate query entities for SQL queries. Query entities provide metadata about individual SQL queries including execution timestamps, user information, and query text. Default: True |
include_table_lineage boolean | Whether to include table lineage in the ingestion. This requires to have the table lineage feature enabled. Default: False |
include_table_location_lineage boolean | If the source supports it, include table lineage to the underlying storage location. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_usage_statistics boolean | Generate usage statistic. Default: False |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Whether to include view lineage in the ingestion. This requires to have the view lineage feature enabled. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
max_workers integer | Maximum number of worker threads to use for parallel processing. Controls the level of concurrency for operations like view processing. Default: 10 |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
request_timeout_ms integer | Request timeout in milliseconds for Teradata query execution. Increase this when queries against large system tables (e.g., DBC.QryLogV) time out silently and fall back. Default is 120000 (2 minutes). Default: 120000 |
scheme string | database scheme Default: teradatasql |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
start_time string(date-time) | Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None |
use_dbc_columns_for_views boolean | When True, attempt to use dbc.ColumnsV for view column metadata (faster bulk fetch) and fall back to HELP statements only for views where any column has a null/unknown ColumnType (e.g., derived expression columns). Can cut HELP calls by 80-90%% for installations where most view columns have explicit types. Set to False (default) to always use HELP for views, which is the conservative but slower approach. Default: False |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
use_qvci boolean | Whether to use QVCI to get column information. This is faster but requires to have QVCI enabled. Default: False |
use_server_side_cursors boolean | Enable server-side cursors for large result sets using SQLAlchemy's stream_results. This reduces memory usage by streaming results from the database server. Automatically falls back to client-side batching if server-side cursors are not supported. Default: True |
username One of string, null | username Default: None |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
databases One of array, null | List of databases to ingest. If not specified, all databases will be ingested. Even if this is specified, databases will still be filtered by database_pattern. Default: None |
databases.string string | |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
usage BaseUsageConfig | |
usage.bucket_duration Enum | One of: "DAY", "HOUR" |
usage.end_time string(date-time) | Latest date of lineage/usage to consider. Default: Current time in UTC |
usage.format_sql_queries boolean | Whether to format sql queries Default: False |
usage.include_operational_stats boolean | Whether to display operational stats. Default: True |
usage.include_read_operational_stats boolean | Whether to report read operational stats. Experimental. Default: False |
usage.include_top_n_queries boolean | Whether to ingest the top_n_queries. Default: True |
usage.start_time string(date-time) | Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None |
usage.top_n_queries integer | Number of top queries to save to each table. Default: 10 |
usage.user_email_pattern AllowDenyPattern | A class to store allow deny regexes |
usage.user_email_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.method Enum | One of: "ge", "sqlalchemy" Default: ge |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"BaseUsageConfig": {
"additionalProperties": false,
"properties": {
"bucket_duration": {
"$ref": "#/$defs/BucketDuration",
"default": "DAY",
"description": "Size of the time window to aggregate usage stats."
},
"end_time": {
"description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
"format": "date-time",
"title": "End Time",
"type": "string"
},
"start_time": {
"default": null,
"description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
"format": "date-time",
"title": "Start Time",
"type": "string"
},
"top_n_queries": {
"default": 10,
"description": "Number of top queries to save to each table.",
"exclusiveMinimum": 0,
"title": "Top N Queries",
"type": "integer"
},
"user_email_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "regex patterns for user emails to filter in usage."
},
"include_operational_stats": {
"default": true,
"description": "Whether to display operational stats.",
"title": "Include Operational Stats",
"type": "boolean"
},
"include_read_operational_stats": {
"default": false,
"description": "Whether to report read operational stats. Experimental.",
"title": "Include Read Operational Stats",
"type": "boolean"
},
"format_sql_queries": {
"default": false,
"description": "Whether to format sql queries",
"title": "Format Sql Queries",
"type": "boolean"
},
"include_top_n_queries": {
"default": true,
"description": "Whether to ingest the top_n_queries.",
"title": "Include Top N Queries",
"type": "boolean"
}
},
"title": "BaseUsageConfig",
"type": "object"
},
"BucketDuration": {
"enum": [
"DAY",
"HOUR"
],
"title": "BucketDuration",
"type": "string"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"method": {
"default": "ge",
"description": "Profiling method to use. Options: `ge` (Great Expectations) or `sqlalchemy` (custom SQLAlchemy-based profiler). The SQLAlchemy profiler has no GE dependency and provides the same functionality.",
"enum": [
"ge",
"sqlalchemy"
],
"title": "Method",
"type": "string"
},
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"bucket_duration": {
"$ref": "#/$defs/BucketDuration",
"default": "DAY",
"description": "Size of the time window to aggregate usage stats."
},
"end_time": {
"description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
"format": "date-time",
"title": "End Time",
"type": "string"
},
"start_time": {
"default": null,
"description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
"format": "date-time",
"title": "Start Time",
"type": "string"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_table_location_lineage": {
"default": true,
"description": "If the source supports it, include table lineage to the underlying storage location.",
"title": "Include Table Location Lineage",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Whether to include view lineage in the ingestion. This requires to have the view lineage feature enabled.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"method": "ge",
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"description": "host URL",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"scheme": {
"default": "teradatasql",
"description": "database scheme",
"title": "Scheme",
"type": "string"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [
"All",
"Crashdumps",
"Default",
"DemoNow_Monitor",
"EXTUSER",
"External_AP",
"GLOBAL_FUNCTIONS",
"LockLogShredder",
"PUBLIC",
"SQLJ",
"SYSBAR",
"SYSJDBC",
"SYSLIB",
"SYSSPATIAL",
"SYSUDTLIB",
"SYSUIF",
"SysAdmin",
"Sys_Calendar",
"SystemFe",
"TDBCMgmt",
"TDMaps",
"TDPUSER",
"TDQCD",
"TDStats",
"TD_ANALYTICS_DB",
"TD_SERVER_DB",
"TD_SYSFNLIB",
"TD_SYSGPL",
"TD_SYSXML",
"TDaaS_BAR",
"TDaaS_DB",
"TDaaS_Maint",
"TDaaS_Monitor",
"TDaaS_Support",
"TDaaS_TDBCMgmt1",
"TDaaS_TDBCMgmt2",
"dbcmngr",
"mldb",
"system",
"tapidb",
"tdwm",
"val",
"dbc"
],
"ignoreCase": true
},
"description": "Regex patterns for databases to filter in ingestion."
},
"databases": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of databases to ingest. If not specified, all databases will be ingested. Even if this is specified, databases will still be filtered by `database_pattern`.",
"title": "Databases"
},
"include_table_lineage": {
"default": false,
"description": "Whether to include table lineage in the ingestion. This requires to have the table lineage feature enabled.",
"title": "Include Table Lineage",
"type": "boolean"
},
"include_queries": {
"default": true,
"description": "Whether to generate query entities for SQL queries. Query entities provide metadata about individual SQL queries including execution timestamps, user information, and query text.",
"title": "Include Queries",
"type": "boolean"
},
"usage": {
"$ref": "#/$defs/BaseUsageConfig",
"default": {
"bucket_duration": "DAY",
"end_time": "2026-04-06T23:27:26.630402Z",
"start_time": "2026-04-05T00:00:00Z",
"queries_character_limit": 24000,
"top_n_queries": 10,
"user_email_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"include_operational_stats": true,
"include_read_operational_stats": false,
"format_sql_queries": false,
"include_top_n_queries": true
},
"description": "The usage config to use when generating usage statistics"
},
"default_db": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The default database to use for unqualified table names",
"title": "Default Db"
},
"include_usage_statistics": {
"default": false,
"description": "Generate usage statistic.",
"title": "Include Usage Statistics",
"type": "boolean"
},
"use_qvci": {
"default": false,
"description": "Whether to use QVCI to get column information. This is faster but requires to have QVCI enabled.",
"title": "Use Qvci",
"type": "boolean"
},
"include_historical_lineage": {
"default": false,
"description": "Whether to include historical lineage data from PDCRINFO.DBQLSqlTbl_Hst in addition to current DBC.QryLogV data. This provides access to historical query logs that may have been archived. The historical table existence is checked automatically and gracefully falls back to current data only if not available.",
"title": "Include Historical Lineage",
"type": "boolean"
},
"use_server_side_cursors": {
"default": true,
"description": "Enable server-side cursors for large result sets using SQLAlchemy's stream_results. This reduces memory usage by streaming results from the database server. Automatically falls back to client-side batching if server-side cursors are not supported.",
"title": "Use Server Side Cursors",
"type": "boolean"
},
"max_workers": {
"default": 10,
"description": "Maximum number of worker threads to use for parallel processing. Controls the level of concurrency for operations like view processing.",
"title": "Max Workers",
"type": "integer"
},
"extract_ownership": {
"default": false,
"description": "Whether to extract ownership information for tables and views based on their creator. When enabled, the table/view creator from Teradata's system tables will be added as an owner with DATAOWNER type. Ownership is applied using OVERWRITE mode, meaning any existing ownership information (including manually added or modified owners from the UI) will be replaced. Use with caution.",
"title": "Extract Ownership",
"type": "boolean"
},
"column_extraction_watermark": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Skip column extraction for tables/views whose LastAlterTimeStamp is older than this timestamp. Set to the start time of the last successful ingestion run to enable incremental column extraction. Mutually exclusive with column_extraction_days_back. At 13k tables where ~200 change per day this can reduce ingestion from hours to minutes.",
"title": "Column Extraction Watermark"
},
"column_extraction_days_back": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Skip column extraction for tables/views not altered within the last N days. Computed at runtime as now() - N days, so the recipe never needs updating. A value of 3 for a daily schedule covers up to two missed runs with no gap risk. Mutually exclusive with column_extraction_watermark.",
"title": "Column Extraction Days Back"
},
"use_dbc_columns_for_views": {
"default": false,
"description": "When True, attempt to use dbc.ColumnsV for view column metadata (faster bulk fetch) and fall back to HELP statements only for views where any column has a null/unknown ColumnType (e.g., derived expression columns). Can cut HELP calls by 80-90%% for installations where most view columns have explicit types. Set to False (default) to always use HELP for views, which is the conservative but slower approach.",
"title": "Use Dbc Columns For Views",
"type": "boolean"
},
"request_timeout_ms": {
"default": 120000,
"description": "Request timeout in milliseconds for Teradata query execution. Increase this when queries against large system tables (e.g., DBC.QryLogV) time out silently and fall back. Default is 120000 (2 minutes).",
"title": "Request Timeout Ms",
"type": "integer"
},
"connect_timeout_ms": {
"default": 30000,
"description": "Connection timeout in milliseconds when establishing Teradata connections. Default is 30000 (30 seconds).",
"title": "Connect Timeout Ms",
"type": "integer"
}
},
"required": [
"host_port"
],
"title": "TeradataConfig",
"type": "object"
}
Capabilities
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
Large-scale Deployment Tuning
For Teradata installations with thousands of tables the following options can significantly reduce ingestion time.
Incremental column extraction
The connector compares each table's LastAlterTimeStamp against a watermark and skips column extraction for tables that have not changed. Only altered tables and tables with no recorded alter timestamp are re-extracted. At 13 000 tables where ~200 change per day this typically reduces a multi-hour run to minutes.
Two mutually exclusive options control the watermark (setting both raises a validation error at startup):
column_extraction_days_back— recommended for scheduled pipelines. Set once and never update the recipe. A value of 3 covers up to two missed daily runs with no gap risk.column_extraction_days_back: 3column_extraction_watermark— for stateful pipelines that track the exact timestamp of the last successful run programmatically.column_extraction_watermark: "2024-06-01T00:00:00Z"
Faster view column fetching
By default the connector uses Teradata HELP statements for every view to ensure derived expression columns (e.g. col1 + col2) have correct types. Set use_dbc_columns_for_views: true to attempt a bulk dbc.ColumnsV fetch first and fall back to HELP only for views where any column has an unknown type. This can reduce HELP calls by 80–90 % on installations where most view columns have explicit types.
Profiling at scale
Profiling all tables in a large installation is impractical. Use profiling.limit (part of the standard GEProfilingConfig) to cap how many tables are profiled per run. You can also combine it with profile_pattern to restrict profiling to specific schemas or tables.
profiling:
enabled: true
limit: 500
profile_pattern:
allow:
- "high_priority_db\\..*"
Lineage query scope
When databases is not set the connector automatically scopes DBC.QryLogV queries to the databases discovered during metadata extraction, filtered by database_pattern. This avoids scanning the entire audit log. You can further restrict the scope with an explicit databases list.
SQL parse cache size
When usage statistics or lineage are enabled, every query row from DBC.QryLogV is parsed with sqlglot to extract table references. Identical query text in a session (e.g. a BI dashboard query that runs thousands of times per day) hits an LRU cache and avoids re-parsing. The default cache holds 1 000 entries, which is too small for production Teradata installations where hundreds of distinct queries each execute thousands of times.
Set the DATAHUB_SQL_PARSE_CACHE_SIZE environment variable before running the pipeline to increase the cache:
export DATAHUB_SQL_PARSE_CACHE_SIZE=50000
datahub ingest -c teradata_recipe.yml
Each cache entry holds a parsed query result in memory. 50 000 entries typically uses 200–500 MB of additional heap depending on query complexity. Start with 10 000 if memory is constrained and increase until cache hit rates stabilise (visible in the ingestion report under sql_parsing_cache_stats).
Connection timeouts
Use request_timeout_ms and connect_timeout_ms to tune the Teradata driver timeouts. Increase request_timeout_ms (default: 120 000 ms) if lineage queries against large DBC.QryLogV tables time out silently.
Limitations
use_dbc_columns_for_viewsfalls back toHELPfor any view that contains derived expression columns. Views with only explicit-type columns benefit most from this option.column_extraction_watermarkmust be managed manually — set it to the start time of the previous successful run. Usecolumn_extraction_days_backinstead if you want a self-maintaining schedule-relative window.column_extraction_watermarkandcolumn_extraction_days_backare mutually exclusive. Setting both raises a validation error at startup.- Profiling capped by
profiling.limitdoes not prioritise tables — they are profiled in the order they are returned bydbc.TablesV. Useprofile_patternto target specific schemas if order matters.
Troubleshooting
If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
If lineage queries fail silently and return no results, increase request_timeout_ms. The default 2-minute timeout can be insufficient for DBC.QryLogV on busy systems with large audit logs.
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.teradata.TeradataSource - Browse on GitHub
If you've got any questions on configuring ingestion for Teradata, feel free to ping us on our Slack.
This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.
Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.