Spark Properties for Spark Worker daemon @ Unravel

Table of Contents

Live Pipeline

PropertyDefinition
Default

com.unraveldata.spark.live.pipeline.enabled

Specifies when to process stages and jobs. Job/stage data is sent by the Unravel sensor.

True: process as soon as the job/stage completes execution. 

False: process after the application completes and the event log file has been processed

False

com.unraveldata.spark.live.pipeline.maxStoredStages

Maximum number of jobs/stages stored in the DB. If an application has jobs/stages > maxStoredStages only the last maxStoredStages are stored.

This setting affects only the live pipeline. When processing the event log file (after the application has completed its execution) this property is not considered. 

1000

Event log processing

PropertyDefinition
Default

com.unraveldata.spark.eventlog.location


All the possible locations of the event log files. Multiple locations are supported as a comma separated list of values.

This property is used only when the Unravel sensor is not enabled. When the sensor is enabled, the event log path is taken from the application configuration at runtime.

hdfs:///user/spark/applicationHistory/
com.unraveldata.spark.eventlog.maxSize
Maximum size of the event log file that will be processed by the Spark worker daemon. Event logs larger than MaxSize will not be processed.
1000000000 (~1GB)

com.unraveldata.spark.hadoopFsMulti.useFilteredFiles


Specifies how to search the event log files.

True:  prefix search

False: prefix + suffix search

Prefix + suffix search is faster as it avoids listFiles()API which may take a long time for large directories on HDFS. This search requires that all the possible suffixes for the event log files are known. Possible suffixes are specified by com.unraveldata.spark.hadoopFsMulti.eventlog.suffixes property.

False

com.unraveldata.spark.hadoopFsMulti.eventlog.suffixes 


Specifies suffixes used for prefix+suffix search of the event logs when  com.unraveldata.spark.hadoopFsMulti.useFilteredFiles=False.

NOTE: the empty suffix (,,) be  part of this value.

value:_1,_1.lz4,_1.snappy,_1.inprogress,,.lz4,.snappy,   .inprogress,_2,_2.lz4,_2.snappy,_2.inprogress

com.unraveldata.spark.appLoading.maxAttempt

Maximum number of attempts for loading the event log file from HDFS/S3/ ADL/WASB etc.
3
com.unraveldata.spark.appLoading.delayForRetry

Delay used among consecutive retries when loading the event log files. The actual delay is not constant, it increases progressively by 2^attempt  * delayForRetry.

2000  (2 s)

Executor log processing

PropertyDefinition
Default

com.unraveldata.max.attempt.log.dir.size.in.bytes

Maximum size of the aggregated executor log that will be imported and processed by the Spark worker for a successful application.

500000000 (~500 MB)
com.unraveldata.max.failed.attempt.log.dir.size.in.bytesMaximum size of the aggregated executor log that will be imported and processed by the Spark worker for a failed application. 2000000000 (~2GB)

Tagging

PropertyDefinition
Default

com.unraveldata.tagging.enabled

Enables tagging functionality.

True
com.unraveldata.tagging.script.enabledEnables tagging.
False

com.unraveldata.app.tagging.script.path

Specifies tagging script path to use when com.unraveldata.tagging.script.enabled=True/usr/local/unravel/etc/apptag.py

com.unraveldata.app.tagging.script.method.name

Method name that will be executed as part of the tagging script.

generate_unravel_tags

Events Related

PropertyDefinition
Default

com.unraveldata.spark.events.enableCaching

Enables logic for executing caching events.

False

Other Properties

PropertyDefinition
Default

com.unraveldata.spark.appLoading.maxConcurrentApps

Specifies the maximum number of applications stored in the in-memory cache of AppStateInfo object of the Spark worker.

5
com.unraveldata.spark.time.histogram

Specifies whether the timeline histogram is generated or not.

Note: Timeline histogram generation is memory intensive.

False

S3 specific properties

PropertyDefinition
Default

com.unraveldata.s3.profile.config.file.path

The path to the s3 profile file, e.g., /usr/local/unravel/etc/s3ro.properties.

-

com.unraveldata.spark.s3.profilesToBuckets

Comma separated list of profile to bucket mappings in the following format: <s3_profile>:<s3_bucket>, i.e.,  com.unraveldata.spark.s3.profileToBuckets=profile-prod:com.unraveldata.dev,profile-dev:com.unraveldata.dev

IMPORTANT

Ensure that the profiles defined in the property above are actually present in the s3 properties file and that each profile has associated a corresponding pair  of credentials aws_access_key and aws_secret_access_key. The old format: access_key/secretKey is no longer supported.)

-


EMR/HDInsight specific properties

PropertyDefinition
Default
com.unraveldata.onprem

Specifies whether the deployment is on premise or on cloud.

Important

On EMR / HDInsight set to False

True


The following properties are set with values obtained from Microsoft's Azure. See Finding Unravel properties' values in Microsoft's Azure for details on locating the values.

Block storage specific properties (for HDInsight)

 For a storage account with the name STORAGE_NAME the corresponding storage account name is fs.azure.accountkey.STORAGE_NAME.blob.core.windows.net. A storage account name has two access keys. Both access keys are required.

PropertyDefinition
Default

com.unraveldata.hdinsight.storage-account-name-1

Storage account name

retrieve from Microsoft Azure
com.unraveldata.hdinsight.primary-access-keyStorage account access key1retrieve from Microsoft Azure

com.unraveldata.hdinsight.storage-account-name-2

Storage account nameset to com.unraveldatahdinsight.storage-account-name-1

com.unraveldata.hdinsight.secondary-access-key

Storage account access key2

retrieve from Microsoft Azure

Data Lake (ADL) specific data properties

PropertyDefinition
Default

com.unraveldata.adl.accountFQDN

The data lake fully qualified domain name, e.g., mydatalake.azuredatalakestore.net

retrieve from Microsoft Azure
com.unraveldata.adl.clientId

Also known as the application Id. An application registration has to be created in the Azure Active Directory.

retrieve from Microsoft Azure
com.unraveldata.adl.clientKey

Also known as the application access key. A key can be created after registering an application

retrieve from Microsoft Azure

com.unraveldata.adl.accessTokenEndpoint

It is the OAUTH 2.0 Token Endpoint. It is obtained from the application registration tab.

retrieve from Microsoft Azure
com.unraveldata.adl.clientRootPathIt is the path in the Data lake store where the target cluster has been given access. For instance, on our deployment cluster “spk21utj02” has been given access to “/clusters/spk21utj02” on Data Lake store.retrieve from Microsoft Azure