Configuration settings¶

To configure the DB extractor, you must create a configuration file. The file must be in YAML format.

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

Using values from environment variables¶

The configuration file allows substitutions with environment variables. For example:

cognite:
  secret: ${COGNITE_CLIENT_SECRET}

will load the value from the COGNITE_CLIENT_SECRET environment variable into the cognite/secret parameter. You can also do string interpolation with environment variables, for example:

url: http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}

:::info Note Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

url: !env 'http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}'

:::

Using values from Azure Key Vault¶

The DB extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name secret in Key Vault into a password parameter, configure your extractor like this:

password: !keyvault my-secret-name

To use Key Vault, you also need to include the azure-keyvault section in your configuration, with the following parameters:

Parameter	Description
`keyvault-name`	Name of Key Vault to load secrets from
`authentication-method`	How to authenticate to Azure. Either `default` or `client-secret`. For `default`, the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For `client-secret`, the extractor will authenticate with a configured client ID/secret pair.
`client-id`	Required for using the `client-secret` authentication method. The client ID to use when authenticating to Azure.
`secret`	Required for using the `client-secret` authentication method. The client secret to use when authenticating to Azure.
`tenant-id`	Required for using the `client-secret` authentication method. The tenant ID of the Key Vault in Azure.

Example:

azure-keyvault:
  keyvault-name: my-keyvault-name
  authentication-method: client-secret
  tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
  client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
  secret: 1234abcd

Base configuration object

Parameter	Type	Description
`version`	either string or integer	Configuration file version
`type`	either `local` or `remote`	Configuration file type. Either `local`, meaning the full config is loaded from this file, or `remote`, which means that only the `cognite` section is loaded from this file, and the rest is loaded from extraction pipelines. Default value is `local`.
`cognite`	object	The cognite section describes which CDF project the extractor will load data into and how to connect to the project.
`logger`	object	The optional `logger` section sets up logging to a console and files.
`metrics`	object	The `metrics` section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.
`queries`	list	List of queries to execute
`databases`	list	List of databases to connect to
`extractor`	object	General extractor configuration

`cognite`¶

Global parameter.

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

Parameter	Type	Description
`project`	string	Insert the CDF project name.
`idp-authentication`	object	The `idp-authentication` section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
`data-set`	object	Enter a data set the extractor should write data into
`extraction-pipeline`	object	Enter the extraction pipeline used for remote config and reporting statuses
`host`	string	Insert the base URL of the CDF project. Default value is `https://api.cognitedata.com`.
`timeout`	integer	Enter the timeout on requests to CDF, in seconds. Default value is `30`.
`external-id-prefix`	string	Prefix on external ID used when creating CDF resources
`connection`	object	Configure network connection details

`idp-authentication`¶

Part of cognite configuration.

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).

Parameter	Type	Description
`authority`	string	Insert the authority together with `tenant` to authenticate against Azure tenants. Default value is `https://login.microsoftonline.com/`.
`client-id`	string	Required. Enter the service principal client id from the IdP.
`tenant`	string	Enter the Azure tenant.
`token-url`	string	Insert the URL to fetch tokens from.
`secret`	string	Enter the service principal client secret from the IdP.
`resource`	string	Resource parameter passed along with token requests.
`audience`	string	Audience parameter passed along with token requests.
`scopes`	list	Enter a list of scopes requested for the token
`min-ttl`	integer	Insert the minimum time in seconds a token will be valid. If the cached token expires in less than `min-ttl` seconds, it will be refreshed even if it is still valid. Default value is `30`.
`certificate`	object	Authenticate with a client certificate

`scopes`¶

Part of idp-authentication configuration.

Enter a list of scopes requested for the token

Each element of this list should be a string.

`certificate`¶

Part of idp-authentication configuration.

Authenticate with a client certificate

Parameter	Type	Description
`authority-url`	string	Authentication authority URL
`path`	string	Required. Enter the path to the .pem or .pfx certificate to be used for authentication
`password`	string	Enter the password for the key file, if it is encrypted.

`data-set`¶

Part of cognite configuration.

Enter a data set the extractor should write data into

Parameter	Type	Description
`id`	integer	Resource internal id
`external-id`	string	Resource external id

`extraction-pipeline`¶

Part of cognite configuration.

Enter the extraction pipeline used for remote config and reporting statuses

Parameter	Type	Description
`id`	integer	Resource internal id
`external-id`	string	Resource external id

`connection`¶

Part of cognite configuration.

Configure network connection details

Parameter	Type	Description
`disable-gzip`	boolean	Whether or not to disable gzipping of json bodies.
`status-forcelist`	string	HTTP status codes to retry. Defaults to 429, 502, 503 and 504
`max-retries`	integer	Max number of retries on a given http request. Default value is `10`.
`max-retries-connect`	integer	Max number of retries on connection errors. Default value is `3`.
`max-retry-backoff`	integer	Retry strategy employs exponential backoff. This parameter sets a max on the amount of backoff after any request failure. Default value is `30`.
`max-connection-pool-size`	integer	The maximum number of connections which will be kept in the SDKs connection pool. Default value is `50`.
`disable-ssl`	boolean	Whether or not to disable SSL verification.
`proxies`	object	Dictionary mapping from protocol to url.

`proxies`¶

Part of connection configuration.

Dictionary mapping from protocol to url.

`logger`¶

Global parameter.

The optional logger section sets up logging to a console and files.

Parameter	Type	Description
`console`	object	Include the console section to enable logging to a standard output, such as a terminal window.
`file`	object	Include the file section to enable logging to a file. The files are rotated daily.
`metrics`	boolean	Enables metrics on the number of log messages recorded per logger and level. This requires `metrics` to be configured as well

`console`¶

Part of logger configuration.

Include the console section to enable logging to a standard output, such as a terminal window.

Parameter	Type	Description
`level`	either `DEBUG`, `INFO`, `WARNING`, `ERROR` or `CRITICAL`	Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Default value is `INFO`.

`file`¶

Part of logger configuration.

Include the file section to enable logging to a file. The files are rotated daily.

Parameter	Type	Description
`level`	either `DEBUG`, `INFO`, `WARNING`, `ERROR` or `CRITICAL`	Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Default value is `INFO`.
`path`	string	Required. Insert the path to the log file.
`retention`	integer	Specify the number of days to keep logs for. Default value is `7`.

`metrics`¶

Global parameter.

The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

Parameter	Type	Description
`push-gateways`	list	List of prometheus pushgateway configurations
`cognite`	object	Push metrics to CDF timeseries. Requires CDF credentials to be configured
`server`	object	The extractor can also be configured to expose a HTTP server with prometheus metrics for scraping

`push-gateways`¶

Part of metrics configuration.

List of prometheus pushgateway configurations

Each element of this list should be a the push-gateways sections contain a list of metric destinations.

Parameter	Type	Description
`host`	string	Enter the address of the host to push metrics to.
`job-name`	string	Enter the value of the `exported_job` label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique.
`username`	string	Enter the credentials for the pushgateway.
`password`	string	Enter the credentials for the pushgateway.
`clear-after`	either null or integer	Enter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion. Default is disabled.
`push-interval`	integer	Enter the interval in seconds between each push. Default value is `30`.

`cognite`¶

Part of metrics configuration.

Push metrics to CDF timeseries. Requires CDF credentials to be configured

Parameter	Type	Description
`external-id-prefix`	string	Required. Prefix on external ID used when creating CDF time series to store metrics.
`asset-name`	string	Enter the name for a CDF asset that will have all the metrics time series attached to it.
`asset-external-id`	string	Enter the external ID for a CDF asset that will have all the metrics time series attached to it.
`push-interval`	integer	Enter the interval in seconds between each push to CDF. Default value is `30`.
`data-set`	object	Data set the metrics will be created under

`data-set`¶

Part of cognite configuration.

Data set the metrics will be created under

Parameter	Type	Description
`id`	integer	Resource internal id
`external-id`	string	Resource external id

`server`¶

Part of metrics configuration.

The extractor can also be configured to expose a HTTP server with prometheus metrics for scraping

Parameter	Type	Description
`host`	string	Host to run the prometheus server on. Default value is `0.0.0.0`.
`port`	integer	Local port to expose the prometheus server on. Default value is `9000`.

`queries`¶

Global parameter.

List of queries to execute

Each element of this list should be a description of a SQL query against a database

Parameter	Type	Description
`database`	string	Required. Enter the name of the database to connect to. This must be one of the database names configured in the `databases` section.
`name`	string	Required. Enter a name of this query that will be used for logging and tagging metrics. The name must be unique for each query in the configuration file.
`query`	string	Required. SQL query to execute. Supports interpolation with `{incremental_field}` and `{start_at}`
`destination`	configuration for either RAW, Events, Assets, Time series, Sequence or Files	Required. The destination of the data in CDF. Examples: `{'destination': {'type': 'raw', 'database': 'my-database', 'table': 'my-table'}}` `{'destination': {'type': 'events'}}`
`primary-key`	string	Insert the format of the row key in CDF RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting `primary-key: row_{index}` will result in rows with keys `row_0`, `row_1`, etc. This is a required value if the destination is a `raw` type. Example: `row_{index}`
`incremental-field`	string	Insert the table column that holds the incremental field. Include to enable incremental loading, otherwise the extractor will default to a full run every time. To use incremental load, a state store is required
`freshness-field`	string	Which column to use for freshness metric. Must be specified along with freshness-field-timezone
`freshness-field-timezone`	string	Timezone to use for freshness metric
`initial-start`	either string, number or integer	Enter the `{start_at}` for the first run. Later runs will use the value stored in the state store. Will only be used on the initial run, subsequent runs will use the stored state. Required when incremental-field is set.
`schedule`	configuration for either Fixed interval or CRON expression	Enter the schedule for when this query should run. Make sure not to schedule runs too often, but leave some room for the previous execution to be done. Required when running in continuous mode, ignored otherwise. Examples: `{'schedule': {'type': 'interval', 'expression': '1h'}}` `{'schedule': {'type': 'cron', 'expression': '0 7-17 * * 1-5'}}`
`collection`	string	Specify the collection on which the query will be executed. This parameter is mandatory when connecting to `mongodb` databases.
`container`	string	Specify the container on which the query will be executed. This parameter is mandatory when connecting to `cosmosdb` databases.
`sheet`	string	Specify the sheet on which the query will be executed. This parameter is mandatory when connecting to `spreadsheet` files.
`skip_rows`	string	Specify the number of rows to be skipped when reading a spreadsheet. This parameter is optional when connecting to `spreadsheet` files.
`has_header`	string	Specify if the extractor should skip the file header while reading a spreadsheet. This parameter is optional when connecting to `spreadsheet` files.
`parameters`	string	Specify the parameters to be used when querying to AWS DynamoDB. This parameter is mandatory when connectong to `dynamodb` databases.

`destination`¶

Part of queries configuration.

The destination of the data in CDF.

Either one of the following options: - RAW - Events - Assets - Time series - Sequence - Files

Examples:

destination:
  type: raw
  database: my-database
  table: my-table

destination:
  type: events

`raw`¶

Part of destination configuration.

The raw destination writes data to the CDF staging area (RAW). The raw destination requires the primary-key parameter in the query configuration.

Parameter	Type	Description
`type`	always `raw`	Type of CDF destination, set to `raw` to write data to RAW.
`database`	string	Required. Enter the CDF RAW database to upload data into. This will be created if it doesn't exist.
`table`	string	Required. Enter the CDF RAW table to upload data into. This will be created if it doesn't exist.

`events`¶

Part of destination configuration.

The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

To ingest data into a events, the query must produce columns named

externalId

In addition, columns named * startTime * endTime * description * source * type * subType

may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

Parameter	Type	Description
`type`	always `events`	Type of CDF destination, set to `events` to write data to events.

`assets`¶

Part of destination configuration.

The assets destination inserts the resulting data as CDF assets. The assets destination is configured by setting the type parameter to assets. No other parameters are required.

To ingest data into a assets, the query must produce columns named * name

In addition, columns named * externalId * parentExternalId * description * source

may be included and will be mapped to corresponding fields in CDF assets. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for assets.

Parameter	Type	Description
`type`	always `assets`	Type of CDF destination, set to `assets` to write data to assets.

`time_series`¶

Part of destination configuration.

The time_series destination inserts the resulting data as data points in time series. The time series destination is configured by setting the type parameter to time_series. No other parameters are required.

To ingest data into a time series, the query must produce columns named * externalId * timestamp * value

The extractor will insert data points into time series identified by the externalId column. If a time series does not exist, the extractor will create a minimal time series with only an external ID and the isString property inferred from the type of first data point processed for that time series. All other time series attributes need to be added separately.

Parameter	Type	Description
`type`	always `time_series`	Type of CDF destination, set to `time_series` to write data to time series.

`sequence`¶

Part of destination configuration.

The sequence destination writes data to a CDF sequence.

The column set of the query result will determine the columns of the sequence.

The result must include a column named row_number, which must include an integer indicating which row number in the sequence to ingest the row into.

Parameter	Type	Description
`type`	always `sequence`	Type of CDF destination, set to `sequence` to write data to a sequence.
`external-id`	string	Required. Configured sequence external ID
`value-types`	either `convert`, `drop` or `assert`	How types are converted into the expected types in CDF. Convert attempts to make a conversion, which may fail. Drop drops the row if there is a mismatch. Assert fails the query if the types do not match. Default value is `convert`.

`files`¶

Part of destination configuration.

The files destination inserts the resulting data as CDF files. The files destination is configured by setting the type parameter to files. No other parameters are required.

To ingest data into a files, the query must produce columns named

name
externalId
content

content will be treated as binary data and uploaded to CDF files as the content of the file

In addition, columns named

source
mimeType
directory
sourceCreatedTime
sourceModifiedTime
asset_ids

may be included and will be mapped to corresponding fields in CDF files. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for files.

Parameter	Type	Description
`type`	always `files`	Type of CDF destination, set to `files` to write data to CDF files.
`content-column`	string	Column used as file content. Default value is `content`.

`schedule`¶

Part of queries configuration.

Enter the schedule for when this query should run. Make sure not to schedule runs too often, but leave some room for the previous execution to be done. Required when running in continuous mode, ignored otherwise.

Either one of the following options: - Fixed interval - CRON expression

Examples:

schedule:
  type: interval
  expression: 1h

schedule:
  type: cron
  expression: 0 7-17 * * 1-5

`fixed_interval`¶

Part of schedule configuration.

Parameter	Type	Description
`type`	always `interval`	Required. Type of time interval configuration. Use `interval` to configure a fixed interval.
`expression`	string	Required. Enter a time interval, with a unit. Avaiable units are `s` (seconds), `m` (minutes), `h` (hours) and `d` (days). Examples: `45s` `15m` `2h`

`cron_expression`¶

Part of schedule configuration.

Parameter	Type	Description
`type`	always `cron`	Required. Type of time interval configuration. Use `cron` to configure CRON schedule.
`expression`	string	Required. Enter a CRON expression. See crontab.guru for a guide on writing CRON expressions. Example: `/15 8-16 * *`

`databases`¶

Global parameter.

List of databases to connect to

Each element of this list should be a configuration for a database the extractor will connect to

Either one of the following options: - ODBC - PostgreSQL - Oracle DB - Snowflake - MongoDB - Azure Cosmos DB - Local spreadsheet files - Amazon Dynamo DB - Amazon Redshift - Google BigQuery

Example:

databases:
- type: odbc
  name: my-odbc-database
  connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=oracle
- type: postgres
  name: postgres-db
  host: pg.company.com
  user: postgres
  password: secret123Pas$word

`odbc`¶

Part of databases configuration.

Open Database Connectivity (ODBC) is a generic protocol for querying databases. To connect to a database using ODBC, you must first download and install an ODBC driver for your database system on the machine running the extractor. Consult the documentation or contact the vendor of your database system to find its driver.

Example:

type: odbc
name: asset-database
connection-string: Driver={ODBC Driver 17 for SQL Server};Server=10.24.5.162;Database=assets;UID=extractorUser;PWD=myPassword;

| Parameter | Type | Description | | - | - | - | | type | always odbc | Select the type of database connection. Set to odbc for ODBC databases. | | connection-string | string | Required. Enter the ODBC connection string. This will differ between database vendors.

Examples:

DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=oracle

DSN={MyDatabaseDsn} | | response-encoding | string | Override the encoding to expect on database responses if the driver does not adhere to the ODBC standard. Default is to follow the ODBC standard.

Examples:

utf8

iso-8859-1 | | query-encoding | string | Override the encoding to use on database queries if the driver does not adhere to the ODBC standard. Default is to follow the ODBC standard.

Examples:

utf8

iso-8859-1 | | timeout | integer | Enter the timeout in seconds for the ODBC connection and queries. The default value is no timeout.

Some ODBC drivers don't accept either the SQL_ATTR_CONNECTION_TIMEOUT or the SQL_ATTR_QUERY_TIMEOUT option. The extractor will log an exception with the message Could not set timeout on the ODBC driver - timeouts might not work properly. Extractions will continue regardless but without timeouts. To avoid this logline, you can disable timeouts for the database causing these problems. | | batch-size | integer | Enter the number of rows to fetch from the database at a time. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. Default value is 1000. | | name | string | Enter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. | | timezone | configuration for either local time zone, universal coordinated time or offset from UTC | Specify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:

utc

-8

5.5 |

`postgresql`¶

Part of databases configuration.

Example:

type: postgres
name: my-database
host: 10.42.39.12
user: extractor-user
password: mySecretPassword

| Parameter | Type | Description | | - | - | - | | type | always postgres | Required. Type of database connection, set to postgres for PostgreSQL databases. | | host | string | Required. Enter the hostname or address of postgres database

Examples:

123.234.123.234

postgres.my-domain.com

localhost | | user | string | Required. Enter the username for postgres database | | password | string | Required. Enter the password for postgres database | | database | string | Enter the database name to use. The default is to use the user name. | | port | integer | Enter the port to connect to. Default value is 5432. | | timeout | integer | Enter the timeout in seconds for the database connection and queries. The default value is no timeout. | | batch-size | integer | Enter the number of rows to fetch from the database at a time. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. Default value is 1000. | | name | string | Enter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. | | timezone | configuration for either local time zone, universal coordinated time or offset from UTC | Specify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:

utc

-8

5.5 |

`oracle_db`¶

Part of databases configuration.

The Cognite DB Extractor can connect directly to an Oracle Database version 12.1 or later.

Example:

type: oracle
name: my-database
host: 10.42.39.12
user: extractor-user
password: mySecretPassword

`snowflake`¶

Part of databases configuration.

Parameter	Type	Description
`type`	always `snowflake`	Type of database connection, set to `snowflake` for Snowflake data warehouses.
`user`	string	Required. User name for Snowflake
`password`	string	Required. Password for Snowflake
`account`	string	Required. Snowflake account ID
`organization`	string	Required. Snowflake organzation name
`database`	string	Required. Snowflake database to use
`schema`	string	Required. Snowflake schema to use
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`mongodb`¶

Part of databases configuration.

Parameter	Type	Description
`type`	always `mongodb`	Type of database connection, set to `mongodb` for MongoDB databases.
`uri`	string	Required. Adress and authentication data for the database as a Uniform Resource Identifier (URI). You can read more about MongoDB URIs here. Example: `mongodb://mymongo:port/?retryWrites=true&connectTimeoutMS=10000`
`database`	string	Required. Name of the related MongoDB database to use.
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`azure_cosmos_db`¶

Part of databases configuration.

Parameter	Type	Description
`type`	always `cosmosdb`	Type of database connection, set to `cosmosdb` for Cosmos DB databases.
`host`	string	Required. Host address for the database Example: `https://my-cosmos-db.documents.azure.com`
`key`	string	Required. Azure Key used to connect to the Cosms DB instance
`database`	string	Required. Database name to use
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`local_spreadsheet_files`¶

Part of databases configuration.

The Cognite DB extractor can run against excel spreadsheets and other files containting tabular data. The currently supported file types are * xlsx, xlsm and xlsb (modern Excel files) * xls (legacy excel files) * odf, ods and odt (OpenDocument Format, used by e.g. Libre Office and Open Office) * csv (Comma separated values)

When using Excel or OpenDocument Format spreadsheets, you need to provide an additional sheet parameter in the associated query configuration.

Parameter	Type	Description
`type`	always `spreadsheet`	Type of connection, set to `spreadsheet` for local spreadsheet files.
`path`	string	Required. Path to a single spreadsheet file Examples: `/path/to/my/excel/file.xlsx` `./relative/path/file.csv` `C:\\Users\\Robert\\Documents\\spreadsheet.xls`
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`amazon_dynamo_db`¶

Part of databases configuration.

Parameter	Type	Description
`type`	always `dynamodb`	Type of database connection, set to `dynamodb` for DynamoDB databases.
`aws-access-key-id`	string	Required. AWS authentication access key ID
`aws-secret-access-key`	string	Required. AWS authentication access key secret
`region-name`	string	Required. AWS region where your database is located. Example: `us-east-1`
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`amazon_redshift`¶

Part of databases configuration.

Parameter	Type	Description
`type`	always `redshift`	Type of database connection, set to `redshift` for Redshift databases.
`aws-access-key-id`	string	Required. AWS authentication access key ID
`aws-secret-access-key`	string	Required. AWS authentication access key secret
`region-name`	string	Required. AWS region where your database is located. Example: `us-east-1`
`database`	string	Required. Redshift database
`secret-arn`	string	AWS Secret ARN
`cluster-identifier`	string	Name of the Redshift cluster to connect. This parameter is required when connecting to a managed Redshift cluster.
`workgroup-name`	string	Name of the Redshift workgroup to connect. This parameter is mandatory when connecting to a Redshift Serverless database.
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`google_bigquery`¶

Part of databases configuration.

The Cognite DB Extractor can run against Google BigQuery using Google SQL(like) query.

Because this extends the Google SDK, you also authenticate with the Google suggested authentication methods by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your authentication key

Parameter	Type	Description
`type`	always `bigquery`	Type of database connection, set to `bigquery` for Google BigQuery
`name`	string	Enter a name for the database that will be used throughout the `queries` section and for logging. The name must be unique for each database in the configuration file.
`timezone`	configuration for either local time zone, universal coordinated time or offset from UTC	Specify how the extractor should handle timestamps from the source when timezone data is absent. Either `local` for the local timezone on the machine the extractor is running on, `utc` for UTC, or a number for a numerical offset from UTC. Default value is `local`. Examples: `utc` `-8` `5.5`

`extractor`¶

Global parameter.

General extractor configuration

Parameter	Type	Description
`state-store`	object	Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
`upload-queue-size`	integer	Maximum size of upload queue. Upload to CDF will be triggered once this limit is reached. Default value is `100000`.
`parallelism`	integer	Maximum number of parallel queries. Default value is `4`.
`mode`	either `continuous` or `single`	Extractor mode. Continuous runs the configured queries using the schedules configured per query. Single runs the queries once each.

`state-store`¶

Part of extractor configuration.

Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

Parameter	Type	Description
`raw`	object	A RAW state store stores the extraction state in a table in CDF RAW.
`local`	object	A local state store stores the extraction state in a JSON file on the local machine.

`raw`¶

Part of state-store configuration.

A RAW state store stores the extraction state in a table in CDF RAW.

Parameter	Type	Description
`database`	string	Required. Enter the database name in CDF RAW.
`table`	string	Required. Enter the table name in CDF RAW.
`upload-interval`	integer	Enter the interval in seconds between each upload to CDF RAW. Default value is `30`.

`local`¶

Part of state-store configuration.

A local state store stores the extraction state in a JSON file on the local machine.

Parameter	Type	Description
`path`	string	Required. Insert the file path to a JSON file.
`save-interval`	integer	Enter the interval in seconds between each save. Default value is `30`.

Configuration settings¶

Using values from environment variables¶

Using values from Azure Key Vault¶

cognite¶

idp-authentication¶

scopes¶

certificate¶

data-set¶

extraction-pipeline¶

connection¶

proxies¶

logger¶

console¶

file¶

metrics¶

push-gateways¶

cognite¶

data-set¶

server¶

queries¶

destination¶

raw¶

events¶

assets¶

time_series¶

sequence¶

files¶

schedule¶

fixed_interval¶

cron_expression¶

databases¶

odbc¶

postgresql¶

oracle_db¶

snowflake¶

mongodb¶

azure_cosmos_db¶

local_spreadsheet_files¶

amazon_dynamo_db¶

amazon_redshift¶

google_bigquery¶

extractor¶

state-store¶

raw¶

local¶

`cognite`¶

`idp-authentication`¶

`scopes`¶

`certificate`¶

`data-set`¶

`extraction-pipeline`¶

`connection`¶

`proxies`¶

`logger`¶

`console`¶

`file`¶

`metrics`¶

`push-gateways`¶

`cognite`¶

`data-set`¶

`server`¶

`queries`¶

`destination`¶

`raw`¶

`events`¶

`assets`¶

`time_series`¶

`sequence`¶

`files`¶

`schedule`¶

`fixed_interval`¶

`cron_expression`¶

`databases`¶

`odbc`¶

`postgresql`¶

`oracle_db`¶

`snowflake`¶

`mongodb`¶

`azure_cosmos_db`¶

`local_spreadsheet_files`¶

`amazon_dynamo_db`¶

`amazon_redshift`¶

`google_bigquery`¶

`extractor`¶

`state-store`¶

`raw`¶

`local`¶