Skip to content

Configure the Documentum extractor

To configure the Documentum extractor, you must create a configuration file. The file must be in YAML format.

:::caution Naming the configuration file

You must name the configuration file config.yml.

:::

:::tip Tip You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud. :::

The configuration file has a global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 1 of the configuration schema.

You can use substitutions with environment variables in the configuration files. The values wrapped in ${} are replaced with environment variables with that name. For example, ${COGNITE_PROJECT} will be replaced with the value of the environment variable called COGNITE_PROJECT.

`idp-authentication`:
  project: ${COGNITE_PROJECT}
    idp-authentication:
        tenant: ${COGNITE_TENANT_ID}
        client-id: ${COGNITE_CLIENT_ID}
        secret: ${COGNITE_CLIENT_SECRET}
        scopes:
          - ${COGNITE_SCOPE}

Logger

Include the logger section to set up logging to a console and to files.

Parameter Description
console Enable logging to a standard output, such as a terminal window. See the Console section.
file Enable logging to a file. See the File section.

Console

Include the console subsection to log events to a standard output, such as a terminal window. This section is optional. If level has an invalid value, no logs are sent to the console.

Parameter Description
level Select the verbosity level for console logging. Valid options, in decreasing levels of verbosity, are trace, debug, info, warning, error, fatal, off. The default value is info.

File

Include the file subsection to log events to a file. This subsection is optional. If level has an invalid value, no logs are sent to the file.

Parameter Description
level Select the verbosity level for file logging. Valid options, in decreasing levels of verbosity, are trace, debug, info, warning, error, fatal, off.
path Insert the path to the log file.

Cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

Parameter Description
project Insert the CDF project name you want to ingest data into. This is a required value.
idp-authentication Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
host Insert the base URL of the CDF project. The default value is <https://api.cognitedata.com>.
external-id-prefix Enter the external ID prefix to identify the documents in CDF. Leave empty for no prefix. See also External IDs.
source Enter the source of the external ID. The default value is documentum.
data-set-id Specify the data set ID to assign to CDF Files.
security-categories Insert a list of internal IDs for security categories added to CDF Files.
extraction-pipeline Insert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section.

Identity provider (IdP) authentication

Include the idp-authentication subsection to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.

Parameter Description
client-id Enter the client ID from the IdP. This is a required value.
secret Enter the client secret from the IdP. This is a required value.
scopes List the scopes. This is a required value.
tenant Enter the Azure tenant. This is a required value.
authority Insert the base URL of the authority. The default value is <https://login.microsoftonline.com>
min-ttl Insert the minimum time in seconds a token will be valid. The cached token is refreshed if it expires in less than min-ttl seconds. The default value is 30.

Extractor

The extractor section contains various configurations for the operation of the extractor.

Parameter Description
tmp-folder Insert a folder where the extractor places temporary files. The default value is data/files relative to the working directory.
keep-files Set to true to keep temporary files after processing. The default value is false, which means temporary files are deleted.
upload Set to false to run the extractor in dry-run mode where files are accessed and processed, but no changes are made in CDF. The default value is true.
delete Set to true to delete files from CDF. There are two triggers for deleting documents:
  • A document with the same source and external ID prefix as this extractor exists in CDF but is absent from the query (requires syncMode to be set to full) will be deleted from CDF.
  • A document with the configured soft-delete-key in the metadata with a value equal to the configured soft-delete-values is considered voided and will be deleted from CDF.
delete-threshold Insert a ratio between 0 and 1 of how much to maximum delete in a single run. For full sync mode, this is measured towards the size of CDF Files. For quick sync mode, this is measured towards the size of the current extraction. The default value is 1, indicating no threshold.
threads Enter the number of parallel documents to run. Note that this isn't number of connections to CDF or Documentum. The default value is 10.
sync-mode Set the synchronization mode. Options are full or quick. Full sync is typically faster for many files, while quick sync is typically faster for a smaller number of files. The default value is full. See Sync data modes.
quick-sync-interval Enter the number of hours to go back for a quick sync. For instance, if you set this value to 24, only the documents changed during the last day are included. The default value is 24.
dump-json-file Enter the name of a file to dump this extraction to. This is used to activate a JSON dump. The JSON dump is only intended for debugging purposes and will use a lot of RAM. Don't use this parameter for extractions where you expect over ~50k documents. The default value is no dump.

Metrics

Include the metrics section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional.

Pushgateways

Include the push-gateways subsection to describe an array of Prometheus Pushgateway destinations to which the extractor will push metrics. This subsection is optional.

Parameter Description
host Insert the absolute URL of the Pushgateway host. Example: http://localhost:9091. If you are using Cognite's Pushgateway, this is https://prometheus-push.cognite.ai/. The default value is null/empty.
job-name Enter the value of the exported_job label you want to associate with metrics.
username Enter the user name in the Pushgateway. The default value is null/empty.
password Enter the password. The default value is null/empty.
push-interval Enter the interval in seconds between each push. The default value is 30.

If you configure this section, the extractor pushes metrics that you, for instance, can display in Grafana. Create Grafana dashboards with the extractor metrics using Prometheus as the data source.

Documentum

Include the documentum section to configure the queries, sync mode, and access to Documentum.

If you're connecting via the Documentum Foundation Classes (DFC) Java SDK, you don't need to enter username, password, host, timeout, and retries since the extractor reads these values from dfc.properties.

Parameter Description
mode Enter how the extractor connects to Documentum. This is either via the D2 REST API (recommended) or the DFC Java SDK. The default value is D2.
query Enter a data query language (DQL) query to execute on the Documentum server. This is a required value.
metadata-properties Insert the fields in a document's metadata that contains important information to the extractor. See the Metadata properties section. This is a required value.
username Enter the username for authenticating to D2. This is a required value for D2 extractions.
password Enter the password for authenticating to D2. This is a required value for D2 extractions.
host Insert the base URL of the D2 repository. This is a required value for D2 extractions.
timeout Specify the timeout in seconds for HTTP requests on D2. The default value is 60.
retries Specify the number of retries to failed requests before stopping the extractor. The default value is 5.

Metadata properties

Include the metadata properties section to describe where the extractor will look for information in a document's metadata.

Parameter Description
file-type-short Enter a shortened file type. This is the file name suffix, such as pdf. The default value is dos_extension.
file-type-full Enter a full mime type. This is the full file type, such as applications/pdf. The default value is mime_type.
soft-delete-key Include this parameter to turn on detection of soft-deletion. Include the metadata field that indicates a deleted document. The default value is no value.
soft-delete-values Insert values that trigger a deletion when this value and soft-delete-key form a key-value pair in the files' metadata. Values are case-sensitive. The default value is an empty list.
object-id This is the file ID for a document that tracks changes and generates external IDs in CDF. This value should be unique across repository and stay the same when a document changes. The recommended value is i_chronicle_id.
modify-date This is the time when a document was last changed. Use this parameter to track changes.