Skip to content

Configure the OSDU extractor

To configure the OSDU extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format.

:::caution Naming the configuration file

You must name the configuration file config.yml.

:::

You can use the sample configuration file included with the extractor as a starting point for your configuration settings. You must as a minimum adjust these settings before you run the extractor:

yaml showLineNumbers connector: extract: raw-database: Enter the name of the target CDF RAW database. dataset-id: Enter the ID of the target CDF data set (16-digit integer). kinds: Check that this matches the list of OSDU schema IDs to extract.

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

Parameter Description
host Insert the base URL of the CDF project. The default value is https://api.cognitedata.com.
project Insert the CDF project name you want to ingest data into.
timeout Specify the number of seconds to wait for a response to a request made to CDF. The default value is 30 seconds.
idp-authentication Insert the credentials for authenticating to CDF using an external identity provider.
client-id Enter the client ID from the IdP.
scopes List the scopes. This is usually [{host}/.default].
secret Enter the client secret from the IdP.
token-url Insert the URL to fetch authentication tokens from.
extraction-pipeline Insert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section. This parameter is optional.
external-id Enter the external ID of the extraction pipeline in CDF. This parameter is optional.
id Enter the ID of the extraction pipeline in CDF. This parameter is optional.

osdu-client

Include the osdu-client to configure the connection to the OSDU platform.

authentication

Parameter Description
api-url Insert the base URL of the OSDU API.
client-id Enter the OSDU client ID.
client-secret Enter the OSDU client secret.
data-partition Enter the name of the OSDU data partition in OSDU.
scope Specify the OSDU scope.
tenant-id Enter the Azure tenant ID. This parameter is optional.
timeout Enter the maximum time in seconds to wait for a response to a request made to OSDU. The default value is 30 seconds.
token-url Insert the URL to fetch authentication tokens from.

services

Parameter Description
cursor-fetch-size Specify the number of search results to fetch in each request.
dms-parallelism Insert the number of parallel threads hitting the DMS API. The default value is 6.
generic-parallelism Insert the number of parallel threads hitting the generic API. The default value is 32.

connector

Include the connector section to configure the general settings for the extractor.

Parameter Description
sleep-time Enter the number of seconds to pause between polls for changes.
total-parallelism Enter the maximum number of threads employed. The default value is 64.
upload-queue-size Enter the number of items to accumulate in a queue between extraction and writing to CDF RAW and CDF Files.
upload-queue-interval Enter the number of seconds between uploads of a queue when the size is not reached.
extract Settings specific to the extraction direction.
raw-database Enter the database name in CDF RAW that stores the extracted OSDU records.
statestore-table Enter the name of the auxiliary CDF RAW table that stores the state of the extractor.
dataset-id Insert the data set ID for the extracted data files.
kinds List of the OSDU data types to extract. Each kind has the following settings:
name-pattern Enter the name / schema ID of the OSDU kind. You can include multiple kinds in a single entry by entering a pattern with Unix shell-style wildcards (* and ?). For example, osdu:wks:master-data--Well*:* would match any version of osdu:wks:master-data--Well as well as osdu:wks:master-data--Wellbore.
filter Enter a query following the Lucene syntax to filter which records to extract from OSDU, for example "createTime:[2022-04-19T16 TO *]" or "data.Source:\"BLENDED\"". This parameter is optional.
dms-kind Use this parameter when the data for the kind is stored in a DDMS instead of generic OSDU files. Supported values are wellbore_dms_well_log and wellbore_dms_trajectory. This parameter is optional.

logger

Include the logger section to set up logging to a console and files.

Parameter Description
console Enable logging to a standard output, such as a terminal window. This parameter is optional.
level Select the verbosity level for console logging. Valid options are debug, info, warning, and error. The default value is info.
file Enable logging to a file. This parameter is optional.
level Select the verbosity level for file logging. Valid options are debug, info, warning, and error. The default value is info.
log_json Set to true to enable logging in JSON format. The default value is false.
path Insert the file system path to the log file.
retention Specify the maximum number of days to retain logs. The default value is 7.

metrics

Include the metrics section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

Parameter Description
cognite Cognite metrics configurations. This parameter is optional.
external_id_prefix Enter an external ID prefix to identify the CDF time series created for each metric.
push-interval Enter the interval in seconds between each push. The default value is 30.