Jumbo configuration (jet.config)

The jet.config configuration file provides configuration for the Jumbo Jet distributed execution engine.

Contents

The ookii.jumbo.jet element

The <ookii.jumbo.jet> element provides configuration for Jumbo clients that need to connect to the Jumbo Jet distributed execution engine, for the Jumbo Jet JobServer and TaskServers, and for Jumbo Jet jobs.

<ookii.jumbo.jet>
</ookii.jumbo.jet>

Child elements

ElementMin occursMax occurs
<jobServer>01
<taskServer>01
<fileChannel>01
<tcpChannel>01
<mergeRecordReader>01

The ookii.jumbo.jet/​jobServer element

The <jobServer> element provides information for clients to access the Jumbo Jet execution engine, and configuration for the Jumbo Jet JobServer. For client applications, only the hostName and port attributes of this element are used.

<jobServer
    hostName=xs:string
    port=xs:int
    jetDfsPath=xs:string
    archiveDirectory=xs:string
    scheduler=xs:string
    maxTaskAttempts=xs:int
    maxTaskFailures=xs:int
    taskServerTimeout=xs:int
    taskServerSoftTimeout=xs:int
    dataInputSchedulingMode=SchedulingMode
    nonDataInputSchedulingMode=SchedulingMode
    schedulingThreshold=xs:float
    broadcastAddress=xs:string
    broadcastPort=xs:int
    listenIPv4AndIPv6=xs:boolean />

Attributes

AttributeUseDescription
hostNameoptional The host name of the Jumbo Jet JobServer. The default value is "localhost".
portoptional The port number for the Jumbo Jet JobServer's RPC service. The default value is "9500".
jetDfsPathoptional The path on the Jumbo DFS (or other file system configured in dfs.config) where files related to Jumbo Jet jobs are stored. The default value is "/JumboJet".
archiveDirectoryoptional The local directory where job configuration and statistics are archived for future retrieval via the Jumbo Jet web portal. If not specified, jobs are not archived and their information cannot be retrieved after they are removed from the current list of completed jobs.
scheduleroptional The assembly qualified type name of a type that implements Ookii.Jumbo.Jet.Scheduling.ITaskScheduler that will be used to schedule task execution on the Jumbo Jet cluster. The default value is "Ookii.Jumbo.Jet.Scheduling.DefaultScheduler, Ookii.Jumbo.Jet".
maxTaskAttemptsoptional The maximum number of times a single task may be re-executed before the job that contains the task is failed. The default value is "5".
maxTaskFailuresoptional The maximum number of failures across all tasks that a job may experience before the job is failed. The default value is "20".
taskServerTimeoutoptional The time in milliseconds after which a TaskServer is considered dead if it has not sent a heartbeat. The default value is "600000".
taskServerSoftTimeoutoptional The time in milliseconds after which a TaskServer will no longer be considered for new tasks during scheduling if it has not sent a heartbeat. The default value is "60000".
dataInputSchedulingModeoptional The scheduling mode to use for tasks that have data input (e.g. a DFS file). This setting may not be used by all schedulers (the default scheduler does use it). The default value is "MoreServers".
nonDataInputSchedulingModeoptional The scheduling mode to use for tasks that do not have data input (no input or channel input). This setting may not be used by all schedulers (the default scheduler does use it). The OptimalLocality value is not applicable to this setting. The default value is "MoreServers".
schedulingThresholdoptional The fraction of tasks (between 0 and 1) of a stage using a file output channel that must be finished before tasks of the receiving stage of the channel are eligible to be scheduled. This setting is likely to only have significant impact if the stage has a number of tasks close to or less than the cluster's capacity. The default value is "0.4".
broadcastAddressoptional The UDP broadcast address for task completion notification. This should be set to your network's UDP multicast address if task completion notification broadcasts are enabled. For example, if your network address is 192.168.1.x with a network mask of 255.255.255.0, your multicast address is 192.168.1.255. The default value is "255.255.255.255".
broadcastPortoptional The port number to use for UDP broadcast for task completion notification. If this value is set to 0, task completion notification broadcasts are disabled. The standard Jumbo port number to use for this is 9550. The default value is "0".
listenIPv4AndIPv6optional Indicates whether the JobServer's RPC service should listen on both IPv6 and IPv4 addresses. On Windows, it is required to explicitly listen on both addresses if both are supported; on Linux, listening on IPv6 will automatically listen on the corresponding IPv4 address, so attempting to manually bind to that address will fail. If this setting is not specified, it defaults to "true" on Windows and "false" on Unix (which is correct for Linux, but you may need to manually set it for other Unix variants like FreeBSD). If either IPv6 or IPv4 connectivity is not available on the system, this setting has no effect.

The ookii.jumbo.jet/​taskServer element

The <taskServer> element configures the Jumbo Jet TaskServers.

<taskServer
    taskDirectory=xs:string
    taskSlots=xs:int
    port=xs:int
    fileServerPort=xs:int
    fileServerMaxConnections=xs:int
    fileServerMaxIndexCacheSize=xs:int
    processCreationDelay=xs:int
    runTaskHostInAppDomain=xs:boolean
    logSystemStatus=xs:boolean
    progressInterval=xs:int
    heartbeatInterval=xs:int
    taskTimeout=xs:int
    immediateCompletedTaskNotification=xs:boolean
    listenIPv4AndIPv6=xs:boolean />

Attributes

AttributeUseDescription
taskDirectoryrequired The local directory where temporary and intermediate data files for tasks of running jobs are stored on a TaskServer.
taskSlotsoptional The maximum number of simultaneous tasks to execute per TaskServer. The default value is "2".
portoptional The port number for the TaskServer's RPC service (used for clients and the task umbilical protocol). The default value is "9501".
fileServerPortoptional The port number for the file server used for file channel shuffling operations. The default value is "9502".
fileServerMaxConnectionsoptional The maximum number of connections that the file server will accept. If a task attempts to connect to the file server while there are already the indicated maximum number of connections, the connection is refused so the task knows to connect to a different server first. This helps balance the load of the shuffling operation by preventing all tasks from reading data from the same TaskServer simultaneously. A reasonable guideline for this value is twice the number of task slots. The default value is "10".
fileServerMaxIndexCacheSizeoptional The maximum number of entries in the file channel server's index cache before entries get evicted by new entries. The default value is "25".
processCreationDelayoptional The delay, in milliseconds, to apply before creating a new TaskHost process. This setting was used to work around a bug with rapid process creation in older versions of Mono, and should be left at 0 unless you are experiencing problems. The default value is "0".
runTaskHostInAppDomainoptional Indicates whether to use an AppDomain rather than a process for running tasks. This is not recommended except for debugging purposes. Tasks will always execute in an AppDomain regardless of this setting if a debugger is attached to the TaskServer. The default value is "false".
logSystemStatusoptional Indicates whether tasks should periodically log processor and memory status information to the task log file. System status is logged based on the progress interval. The default value is "false".
progressIntervaloptional The interval in milliseconds at which tasks report progress to the TaskServer. The default value is "3000".
heartbeatIntervaloptional The interval in milliseconds at which the TaskServer sends heartbeats to the JobServer. The default value is "3000".
taskTimeoutoptional The time in milliseconds after which a task is failed and scheduled for re-execution if it has not reported progress. The default value is "600000".
immediateCompletedTaskNotificationoptional Indicates whether to send immediate out-of-band heartbeats to the JobServer when a task completes. If set to false, the JobServer is not notified of task completion until the next heartbeat interval. The default value is "true".
listenIPv4AndIPv6optional Indicates whether the TaskServer's RPC service and file channel server should listen on both IPv6 and IPv4 addresses. On Windows, it is required to explicitly listen on both addresses if both are supported; on Linux, listening on IPv6 will automatically listen on the corresponding IPv4 address, so attempting to manually bind to that address will fail. If this setting is not specified, it defaults to "true" on Windows and "false" on Unix (which is correct for Linux, but you may need to manually set it for other Unix variants like FreeBSD). If either IPv6 or IPv4 connectivity is not available on the system, this setting has no effect.

The ookii.jumbo.jet/​fileChannel element

The <fileChannel> element configures default settings for file channels between stages in a Jumbo Jet job.

<fileChannel
    readBufferSize=BinarySize
    writeBufferSize=BinarySize
    deleteIntermediateFiles=xs:boolean
    memoryStorageSize=BinarySize
    memoryStorageWaitTimeout=xs:int
    compressionType=CompressionType
    spillBufferSize=BinarySize
    spillBufferLimit=xs:float
    spillSortMinSpillsForCombineDuringMerge=xs:int
    enableChecksum=xs:boolean />

Attributes

AttributeUseDescription
readBufferSizeoptional The buffer size to use when reading intermediate files. The default value is "64KB".
writeBufferSizeoptional The buffer size to use when writing intermediate files. The default value is "64KB".
deleteIntermediateFilesoptional Indicates whether to delete intermediate files after they are no longer needed or when the job finishes or fails. Set this to false to preserve the files to debug a failing job. The default value is "true".
memoryStorageSizeoptional The maximum size of the in-memory storage to use for shuffled segments. Setting this to a high value can improve performance, but may lead to problems if the task itself uses a lot of memory. When using the MergeRecordReader on a file channel and the purgeMemoryBeforeFinalPass setting on the <mergeRecordReader element is set to true, you can use a high value even if the task uses a large amount of memory. Because this value is task dependent, it can be more useful to override it using the job settings. The default value is "100MB".
memoryStorageWaitTimeoutoptional The time in milliseconds to wait for memory storage to become available for a shuffled segment before falling back to disk storage. The default value is "60000".
compressionTypeoptional The type of compression to apply to the file channel's intermediate data files. Enabling file channel compression will reduce intermediate file size and network load, but may significantly increase the CPU load of the tasks. Only use this if your data is highly compressable, the network is slow, or disk space for intermediate files is low. The default value is "None".
spillBufferSizeoptional The size of the in-memory buffer in which to collect intermediate data produced by a task with a file output channel. The default value is "100MB".
spillBufferLimitoptional The threshold (between 0 and 1) at which a spill is triggered (the contents of the in-memory buffer holding intermediate output data are written to disk). The default value is "0.8".
spillSortMinSpillsForCombineDuringMergeoptional If using a SpillSort operation with a combiner for the channel, the minimum number of spills that must have occurred in order to re-apply the combiner during merging. The default value is "3".
enableChecksumoptional Indicates whether to compute and verify checksums for intermediate data. The default value is "true".

The ookii.jumbo.jet/​tcpChannel element

The <tcpChannel> element configures default settings for TCP channels between stages in a Jumbo Jet job.

<tcpChannel
    spillBufferSize=BinarySize
    spillBufferLimit=xs:float
    reuseConnections=xs:boolean />

Attributes

AttributeUseDescription
spillBufferSizeoptional The size of the in-memory buffer in which intermediate data is collected before sending it to the receiving stage's tasks. The default value is "20MB".
spillBufferLimitoptional The threshold (between 0 and 1) at which the contents of the intermediate data buffer are sent to the receiving stage's tasks. The default value is "0.6".
reuseConnectionsoptionalIndicates whether the TCP channel keeps the connections to the receiving stage's tasks open in between spills.The default value is "false".

The ookii.jumbo.jet/​mergeRecordReader element

The <mergeRecordReader> element configures default settings for the Ookii.Jumbo.Jet.MergeRecordReader<T> class, which is used when sorting using e.g. the JobBuilder.SpillSort method.

<mergeRecordReader
    maxFileInputs=xs:int
    memoryStorageTriggerLevel=xs:float
    mergeStreamReadBufferSize=BinarySize
    purgeMemoryBeforeFinalPass=xs:boolean />

Attributes

AttributeUseDescription
maxFileInputsoptional The maximum number of on-disk segments that may be merged in a single merge pass. The merge record reader will merge the input data in multiple passes until the remaining number of on-disk segments is below this value. During shuffling, a background disk merge is triggered if the number of on-disk segments exceeds twice this value. The default value is "100".
memoryStorageTriggerLeveloptional The threshold (between 0 and 1) of memory storage usage at which a background merge is started to merge all current in-memory segments to disk. The default value is "0.6".
mergeStreamReadBufferSizeoptional The buffer size to use per segment for reading on-disk segments during a merge. The default value is "1MB".
purgeMemoryBeforeFinalPassoptional Indicates whether to merge all in-memory segments to disk before the final pass (the final pass will use only on-disk segments). This is useful if the task that is consuming the merged records requires a lot of memory. The default value is "false".

The SchedulingMode type

Indicates the scheduling strategy to use by a task scheduler.

Enumeration values

ValueDescription
Default

Use the default strategy, which is MoreServers.

MoreServers

Favor TaskServers with a large amount of free task slots, spreading a job over as many nodes as possible.

FewerServers

Favor TaskServers with a small amount of free task slots, spreading the job over as few nodes as possible.

OptimalLocality

Do not schedule non-local tasks on a TaskServer even if there are no other tasks that could be assigned to that TaskServer.

The BinarySize type

A quantity expressed using a binary scale suffix such as B, KB, MB, GB, TB or PB. The B is optional. Also allows IEC suffixes (e.g. KiB, MiB). Examples of valid values include "5KB", "7.5M" and "9GiB". Suffixes are not case sensitive. Scale is based on powers of 2, so K = 1024, M = 1048576, G = 1073741824, and so forth.

The CompressionType type

The type of compression to use.

Enumeration values

ValueDescription
None

The data will not be compressed.

GZip

The data will be compressed using the gzip compression algorithm.