Configure the File Extractor¶

To configure the File Extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables:

config-parameter: ${CONFIG_VALUE}

:::info Note Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

config-parameter: !env 'PARAM=SYSTEM;CONFIG=${CONFIG_VALUE}'

::: The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

Logger¶

The optional logger section sets up logging to a console and files.

Parameter	Description
`console`	Sets up console logger configuration. See the Console section.
`file`	Sets ut file logger configuration. See the File section.

Console¶

Include the console section to enable logging to a standard output, such as a terminal window.

Parameter	Description
`level`	Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`.

File¶

Include the file section to enable logging to a file. The files are rotated daily.

Parameter	Description
`level`	Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`.
`path`	Insert the path to the log file.
`retention`	Specify the number of days to keep logs for. The default value is 7.

Cognite¶

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

Parameter	Description
`project`	Insert the CDF project name. This is a required value.
`host`	Insert the base URL of the CDF project. The default value is https://api.cognitedata.com.
`idp-authentication`	Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
`data-set`	Insert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either `id` or `external-id`.

Identity provider (IdP) authentication¶

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Azure AD..

Parameter	Description
`client-id`	Enter the client ID from the IdP. This is a required value.
`secret`	Enter the client secret from the IdP. This is a required value.
`scopes`	List the scopes. This is a required value.
`resource`	Insert token requests. This is an optional field.
`token-url`	Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
`tenant`	Enter the Azure tenant. You must enter either a token URL or an Azure tenant
`min-ttl`	Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor¶

The optional extractor section contains tuning parameters.

Parameter	Description
`errors_threshold`	Enter the amount of retries the extractor should execute when a file extraction fails. The default value is 5
`parallelism`	Insert the number of parallel queries to run. The default value is 4.
`state-store`	Set to `true` to configure state store. The default value is no state store, and the incremental load is deactivated. See the State store section.
`schedule`	Schedule the interval which the file extraction should be execute. Use this parameter when the extractor is set to `continuous` mode. See the Schedule section.

Schedule¶

Use the schedule subsection to schedule runs when the extractor runs as a service.

Parameter	Description
`type`	Insert the schedule type. Valid options are `cron` and `interval`. `cron` uses regular cron expressions. `interval` expects an interval-based schedule.
`expression`	Enter the cron or interval expression to trigger the query. For example, `1h` repeats the query hourly, and `5m` repeats the query every 5 minutes.

State store¶

Use the state store subsection to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

Parameter	Description
`local`	Local state store configuration. See the Local section.
`raw`	RAW state store configuration. See the RAW section.

Local¶

Use the local section to store the extraction state in a JSON file on a local machine.

Parameter	Description
`path`	Insert the file path to a JSON file.
`save-interval`	Enter the interval in seconds between each save. The default value is 30 seconds.

RAW¶

Use the RAW section to store the extraction state in a table in the CDF staging area.

Parameter	Description
`database`	Enter the database name in the CDF staging area.
`table`	Enter the table name in the CDF staging area.
`upload-interval`	Enter the interval in seconds between each save. The default value is 30 seconds.

Files¶

The files section contains the configuration needed in order to connect to the file source. The schema for the file configuration depends on which file source you are connecting to. These are distinguished by the type parameter. Possible file source types include:

Azure Blob Storage
FTP / FTPS
Google Cloud Storage
Local files
Amazon S3
Samba / SMB
SFTP
Sharepoint Online

Navigate to Integrate > Connect to source system > Cognite File Extractor in CDF to see all supported sources and the recommended approach.

This is the schema for Azure Blob Storage source:

Parameter	Description
`type`	Type of file source, set to `azure_blob_storage` for Azure Blob storage files.
`connection_string`	Connection string needed to connect to Azure Blob storage. This is a mandatory field.
`containers`	List of Azure blob containers. This is an optional field.

This is the schema for FTP/FTPS source:

Parameter	Description
`type`	Type of file source, set to `ftp` for FTP or FTPS source.
`base-url`	Enter the base URL for the FTP server. This is a mandatory field.
`port`	Enter the port related to the FTP server. This is an optional field.
`client-login`	Enter the FTP username. This is an mandatory field.
`client-password`	Enter the FTP password. This is an mandatory field.
`main-folder`	Enter the root directory on which the extractor will start the extractor. This is an optional field.
`with-subfolders`	Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are `true` or `false`. Default value is `false`. This is an optional field.
`use-ssl`	When set to `true`, it connects to the source using SSL (FTPS). Possible values are `true` or `false`. Default value is `false`. This is an optional field.
`certificate-file-path`	Enter the path to the certificate file. This is an optional field.

This is the schema for Google Cloud Storage source:

Parameter	Description
`type`	Type of file source, set to `gcp_cloud_storage` for Google Cloud Storage source.
`google-application-credentials`	Enter the Google Cloud Platform service account credentials (encoded in base64 format). This is a mandatory field.
`bucket`	Enter the name of the bucket where the files are located. This is a mandatory field.
`folders`	Enter the list of folders where the files are located . This is an mandatory field.

This is the schema for local files source:

Parameter	Description
`type`	Type of file source, set to `local` for local files.
`path`	Enter the path (absolute or relative) where the local files are located. This is a mandatory parameter.

This is the schema for Amazon S3 source:

Parameter	Description
`type`	Type of file source, set to `aws_s3` for Amazon S3 source.
`aws_access_key_id`	Enter the AWS Access Key ID. This is a mandatory parameter.
`aws_secret_access_key`	Enter the AWS Secret Access Key. This is a mandatory field.
`bucket`	Enter the name of the bucket where the files are located. This is a mandatory field.

This is the schema for Samba / SMB source:

Parameter	Description
`type`	Type of file source, set to `smb` for Samba source.
`server`	Enter the server address related to the Samba server. This is a mandatory field.
`share_path`	Enter the Samba server share path . This is a mandatory field.
`username`	Enter the Samba server username. This is an mandatory field.
`password`	Enter the Samba server password. This is an mandatory field.

This is the schema for FTP/FTPS source:

Parameter	Description
`type`	Type of file source, set to `sftp` for STFP source.
`base-url`	Enter the base URL for the FTP server. This is a mandatory field.
`port`	Enter the port related to the FTP server. This is an optional field.
`client-login`	Enter the FTP username. This is an mandatory field.
`client-password`	Enter the FTP password. This is an mandatory field.
`main-folder`	Enter the root directory on which the extractor will start the extractor. This is an optional field.
`with-subfolders`	Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are `true` or `false`. Default value is `false`. This is an optional field.
`certificate-file-path`	Enter the path to the certificate file. This is an optional field.

This is the schema for Sharepoint Online source:

Parameter	Description
`type`	Type of file source, set to `sharepoint_online` for Sharepoint Online source.
`client-id`	Enter the App registration client ID. This is a mandatory field.
`client-secret`	Enter the App registration secret. This is a mandatory field.
`tenant-id`	Enter the Azure tenant related to the App registration . This is a mandatory field.
`base-url`	Enter the Sharepoint Online base URL. This is an mandatory field.
`site`	Enter the Sharepoint site where the document library is located. This is a mandatory field.
`document-library`	Enter the Sharepoint document library where the files are located. This is a mandatory field.
`with-subfolders`	Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are `true` or `false`. Default value is `false`. This is an optional field.