Configure the OSDU extractor¶

To configure the OSDU extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format.

:::caution Naming the configuration file

You must name the configuration file config.yml.

:::

You can use the sample configuration file included with the extractor as a starting point for your configuration settings. You must as a minimum adjust these settings before you run the extractor:

yaml showLineNumbers connector: extract: raw-database: Enter the name of the target CDF RAW database. dataset-id: Enter the ID of the target CDF data set (16-digit integer). kinds: Check that this matches the list of OSDU schema IDs to extract.

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

cognite¶

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

Parameter	Description
`host`	Insert the base URL of the CDF project. The default value is https://api.cognitedata.com.
`project`	Insert the CDF project name you want to ingest data into.
`timeout`	Specify the number of seconds to wait for a response to a request made to CDF. The default value is 30 seconds.
`idp-authentication`	Insert the credentials for authenticating to CDF using an external identity provider.
`client-id`	Enter the client ID from the IdP.
`scopes`	List the scopes. This is usually [{host}/.default].
`secret`	Enter the client secret from the IdP.
`token-url`	Insert the URL to fetch authentication tokens from.
`extraction-pipeline`	Insert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section. This parameter is optional.
`external-id`	Enter the external ID of the extraction pipeline in CDF. This parameter is optional.
`id`	Enter the ID of the extraction pipeline in CDF. This parameter is optional.

osdu-client¶

Include the osdu-client to configure the connection to the OSDU platform.

authentication¶

Parameter	Description
`api-url`	Insert the base URL of the OSDU API.
`client-id`	Enter the OSDU client ID.
`client-secret`	Enter the OSDU client secret.
`data-partition`	Enter the name of the OSDU data partition in OSDU.
`scope`	Specify the OSDU scope.
`tenant-id`	Enter the Azure tenant ID. This parameter is optional.
`timeout`	Enter the maximum time in seconds to wait for a response to a request made to OSDU. The default value is 30 seconds.
`token-url`	Insert the URL to fetch authentication tokens from.

services¶

Parameter	Description
`cursor-fetch-size`	Specify the number of search results to fetch in each request.
`dms-parallelism`	Insert the number of parallel threads hitting the DMS API. The default value is 6.
`generic-parallelism`	Insert the number of parallel threads hitting the generic API. The default value is 32.

connector¶

Include the connector section to configure the general settings for the extractor.

Parameter	Description
`sleep-time`	Enter the number of seconds to pause between polls for changes.
`total-parallelism`	Enter the maximum number of threads employed. The default value is 64.
`upload-queue-size`	Enter the number of items to accumulate in a queue between extraction and writing to CDF RAW and CDF Files.
`upload-queue-interval`	Enter the number of seconds between uploads of a queue when the size is not reached.
`extract`	Settings specific to the extraction direction.
`raw-database`	Enter the database name in CDF RAW that stores the extracted OSDU records.
`statestore-table`	Enter the name of the auxiliary CDF RAW table that stores the state of the extractor.
`dataset-id`	Insert the data set ID for the extracted data files.
`kinds`	List of the OSDU data types to extract. Each kind has the following settings:
`name-pattern`	Enter the name / schema ID of the OSDU kind. You can include multiple kinds in a single entry by entering a pattern with Unix shell-style wildcards (`` and `?`). For example, `osdu:wks:master-data--Well:*` would match any version of `osdu:wks:master-data--Well` as well as `osdu:wks:master-data--Wellbore`.
`filter`	Enter a query following the Lucene syntax to filter which records to extract from OSDU, for example `"createTime:[2022-04-19T16 TO *]"` or `"data.Source:\"BLENDED\""`. This parameter is optional.
`dms-kind`	Use this parameter when the data for the kind is stored in a DDMS instead of generic OSDU files. Supported values are `wellbore_dms_well_log` and `wellbore_dms_trajectory`. This parameter is optional.

logger¶

Include the logger section to set up logging to a console and files.

Parameter	Description
`console`	Enable logging to a standard output, such as a terminal window. This parameter is optional.
`level`	Select the verbosity level for console logging. Valid options are `debug`, `info`, `warning`, and `error`. The default value is `info`.
`file`	Enable logging to a file. This parameter is optional.
`level`	Select the verbosity level for file logging. Valid options are `debug`, `info`, `warning`, and `error`. The default value is `info`.
`log_json`	Set to `true` to enable logging in JSON format. The default value is `false`.
`path`	Insert the file system path to the log file.
`retention`	Specify the maximum number of days to retain logs. The default value is 7.

metrics¶

Include the metrics section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

Parameter	Description
`cognite`	Cognite metrics configurations. This parameter is optional.
`external_id_prefix`	Enter an external ID prefix to identify the CDF time series created for each metric.
`push-interval`	Enter the interval in seconds between each push. The default value is 30.