Skip to content

Configure the File Extractor

To configure the File Extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables:

config-parameter: ${CONFIG_VALUE}

:::info Note Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

config-parameter: !env 'PARAM=SYSTEM;CONFIG=${CONFIG_VALUE}'

::: The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

Logger

The optional logger section sets up logging to a console and files.

Parameter Description
console Sets up console logger configuration. See the Console section.
file Sets ut file logger configuration. See the File section.

Console

Include the console section to enable logging to a standard output, such as a terminal window.

Parameter Description
level Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.

File

Include the file section to enable logging to a file. The files are rotated daily.

Parameter Description
level Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.
path Insert the path to the log file.
retention Specify the number of days to keep logs for. The default value is 7.

Cognite

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

Parameter Description
project Insert the CDF project name. This is a required value.
host Insert the base URL of the CDF project. The default value is https://api.cognitedata.com.
idp-authentication Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
data-set Insert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either id or external-id.

Identity provider (IdP) authentication

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Azure AD..

Parameter Description
client-id Enter the client ID from the IdP. This is a required value.
secret Enter the client secret from the IdP. This is a required value.
scopes List the scopes. This is a required value.
resource Insert token requests. This is an optional field.
token-url Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
tenant Enter the Azure tenant. You must enter either a token URL or an Azure tenant
min-ttl Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor

The optional extractor section contains tuning parameters.

Parameter Description
errors_threshold Enter the amount of retries the extractor should execute when a file extraction fails. The default value is 5
parallelism Insert the number of parallel queries to run. The default value is 4.
state-store Set to true to configure state store. The default value is no state store, and the incremental load is deactivated. See the State store section.
schedule Schedule the interval which the file extraction should be execute. Use this parameter when the extractor is set to continuous mode. See the Schedule section.

Schedule

Use the schedule subsection to schedule runs when the extractor runs as a service.

Parameter Description
type Insert the schedule type. Valid options are cron and interval.

  • cron uses regular cron expressions.
  • interval expects an interval-based schedule.
  • expression Enter the cron or interval expression to trigger the query. For example, 1h repeats the query hourly, and 5m repeats the query every 5 minutes.

    State store

    Use the state store subsection to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

    Parameter Description
    local Local state store configuration. See the Local section.
    raw RAW state store configuration. See the RAW section.

    Local

    Use the local section to store the extraction state in a JSON file on a local machine.

    Parameter Description
    path Insert the file path to a JSON file.
    save-interval Enter the interval in seconds between each save. The default value is 30 seconds.

    RAW

    Use the RAW section to store the extraction state in a table in the CDF staging area.

    Parameter Description
    database Enter the database name in the CDF staging area.
    table Enter the table name in the CDF staging area.
    upload-interval Enter the interval in seconds between each save. The default value is 30 seconds.

    Files

    The files section contains the configuration needed in order to connect to the file source. The schema for the file configuration depends on which file source you are connecting to. These are distinguished by the type parameter. Possible file source types include:

    • Azure Blob Storage
    • FTP / FTPS
    • Google Cloud Storage
    • Local files
    • Amazon S3
    • Samba / SMB
    • SFTP
    • Sharepoint Online

    Navigate to Integrate > Connect to source system > Cognite File Extractor in CDF to see all supported sources and the recommended approach.

    This is the schema for Azure Blob Storage source:

    Parameter Description
    type Type of file source, set to azure_blob_storage for Azure Blob storage files.
    connection_string Connection string needed to connect to Azure Blob storage. This is a mandatory field.
    containers List of Azure blob containers. This is an optional field.

    This is the schema for FTP/FTPS source:

    Parameter Description
    type Type of file source, set to ftp for FTP or FTPS source.
    base-url Enter the base URL for the FTP server. This is a mandatory field.
    port Enter the port related to the FTP server. This is an optional field.
    client-login Enter the FTP username. This is an mandatory field.
    client-password Enter the FTP password. This is an mandatory field.
    main-folder Enter the root directory on which the extractor will start the extractor. This is an optional field.
    with-subfolders Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.
    use-ssl When set to true, it connects to the source using SSL (FTPS). Possible values are true or false. Default value is false. This is an optional field.
    certificate-file-path Enter the path to the certificate file. This is an optional field.

    This is the schema for Google Cloud Storage source:

    Parameter Description
    type Type of file source, set to gcp_cloud_storage for Google Cloud Storage source.
    google-application-credentials Enter the Google Cloud Platform service account credentials (encoded in base64 format). This is a mandatory field.
    bucket Enter the name of the bucket where the files are located. This is a mandatory field.
    folders Enter the list of folders where the files are located . This is an mandatory field.

    This is the schema for local files source:

    Parameter Description
    type Type of file source, set to local for local files.
    path Enter the path (absolute or relative) where the local files are located. This is a mandatory parameter.

    This is the schema for Amazon S3 source:

    Parameter Description
    type Type of file source, set to aws_s3 for Amazon S3 source.
    aws_access_key_id Enter the AWS Access Key ID. This is a mandatory parameter.
    aws_secret_access_key Enter the AWS Secret Access Key. This is a mandatory field.
    bucket Enter the name of the bucket where the files are located. This is a mandatory field.

    This is the schema for Samba / SMB source:

    Parameter Description
    type Type of file source, set to smb for Samba source.
    server Enter the server address related to the Samba server. This is a mandatory field.
    share_path Enter the Samba server share path . This is a mandatory field.
    username Enter the Samba server username. This is an mandatory field.
    password Enter the Samba server password. This is an mandatory field.

    This is the schema for FTP/FTPS source:

    Parameter Description
    type Type of file source, set to sftp for STFP source.
    base-url Enter the base URL for the FTP server. This is a mandatory field.
    port Enter the port related to the FTP server. This is an optional field.
    client-login Enter the FTP username. This is an mandatory field.
    client-password Enter the FTP password. This is an mandatory field.
    main-folder Enter the root directory on which the extractor will start the extractor. This is an optional field.
    with-subfolders Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.
    certificate-file-path Enter the path to the certificate file. This is an optional field.

    This is the schema for Sharepoint Online source:

    Parameter Description
    type Type of file source, set to sharepoint_online for Sharepoint Online source.
    client-id Enter the App registration client ID. This is a mandatory field.
    client-secret Enter the App registration secret. This is a mandatory field.
    tenant-id Enter the Azure tenant related to the App registration . This is a mandatory field.
    base-url Enter the Sharepoint Online base URL. This is an mandatory field.
    site Enter the Sharepoint site where the document library is located. This is a mandatory field.
    document-library Enter the Sharepoint document library where the files are located. This is a mandatory field.
    with-subfolders Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.