Jumbo job configuration

Job configuration XML files are used for each job to specify the stages, tasks, and channels for the job, including inputs and outputs and other assorted settings related to a job.

It is almost never necessary to manually create or modify a job configuration XML file, as it is preferable to use the JobBuilder or JobConfiguration classes.

Job configuration files are created by serializing the JobConfiguration class to XML, and read by deserializing.

Contents

The Job element

The <Job> element specifies the configuration of a job. This is the root element of a job configuration XML file.

<Job
    name=xs:string>
</Job>

Attributes

AttributeUseDescription
nameoptional A friendly name for the job.

Child elements

ElementMin occursMax occurs
<AssemblyFileNames>11
<Stages>11
<AdditionalProgressCounters>11
<SchedulerOptions>01
<JobSettings>01

The Job/​AssemblyFileNames element

The <AssemblyFileNames> element contains a collection of assembly file names that must be loaded for the job's tasks.

<AssemblyFileNames>
</AssemblyFileNames>

Child elements

ElementMin occursMax occurs
<string>0unbounded

The Job/​AssemblyFileNames/​string element

The <string> element for the <AssemblyFileNames> element specifies the file name (without path information) of an assembly that should be loaded for the tasks in this job.

<string>xs:string<string>

The Job/​Stages element

The <Stages> element contains a collection of all the stages (except child stages) in the job. The definition order of the stages does not matter, as they will be ordered by dependencies for scheduling.

<Stages>
</Stages>

Child elements

ElementMin occursMax occurs
<Stage>0unbounded

The Job/​Stages/​Stage element

The <Stage> element specifies a processing stage in a Jumbo Jet job.

<Stage
    id=StageId
    taskCount=xs:int>
</Stage>

Attributes

AttributeUseDescription
idrequired The identifier of this stage. For root stages, this must be unique within the job. For child stages, this value does not need to be unique.
taskCountrequired The number of tasks in this stage. For stages with data input this attribute is informational, since the number of splits defined by the data input determines the actual number of tasks.

Child elements

ElementMin occursMax occurs
<TaskType>11
<DataInputType>01
<DataOutputType>01
<ChildStage>01
<ChildStagePartitionerType>01
<StageSettings>01
<OutputChannel>01
<MultiInputRecordReaderType>01
<DependentStages>01

The Job/​Stages/​Stage/​TaskType element

The <TaskType> element specifies the assembly qualified type name of a type implementing the ITask<TInput, TOutput> interface that provides the data processing operation for this stage.

<TaskType>xs:string<TaskType>

The Job/​Stages/​Stage/​DataInputType element

The <DataInputType> element specifies the assembly qualified type name of a type implementing the IDataInput interface that determines the input for the task, such as a DFS file. This element is not used for stages that have an input channel or have no input.

<DataInputType>xs:string<DataInputType>

The Job/​Stages/​Stage/​DataOutputType element

The <DataOutputType> element specifies the assembly qualified type name of a type implementing the IDataOutput interface that determines the output for the task, such as a DFS file. This element is not used for stages that have an output channel or have no output.

<DataOutputType>xs:string<DataOutputType>

The Job/​Stages/​Stage/​ChildStage element

The <ChildStage> element specifies a child stage for the current stage; that is, a stage connected to the current stage using a pipeline channel.

See the <Stage> element.

The Job/​Stages/​Stage/​ChildStagePartitionerType element

The <ChildStagePartitionerType> element specifies the assembly qualified type name of a type implementing the IPartitioner<T> interface that is used to partition records for the child stage of this stage. This element is not used if the stage has no child stage or the taskCount attribute of the <ChildStage> element is 1, in which case no partitioning occurs. Note that partitioning may happen only once in a compound task.

<ChildStagePartitionerType>xs:string<ChildStagePartitionerType>

The Job/​Stages/​Stage/​StageSettings element

The <StageSettings> element specifies arbitrary settings that are available during task execution that apply only to the current stage.

<StageSettings>
</StageSettings>

Child elements

ElementMin occursMax occurs
<Setting>0unbounded

The Job/​Stages/​Stage/​StageSettings/​Setting element

The <Setting> element specifies a setting.

<Setting
    key=xs:string
    value=xs:string />

Attributes

AttributeUseDescription
keyrequired The key that can be used to retrieve the setting.
valuerequired The value of the setting.

The Job/​Stages/​Stage/​OutputChannel element

The <OutputChannel> element specifies a file or TCP output channel for the stage. This element is not used if the stage has data output or no output, or if the stage has a pipeline output channel (in which case the <ChildStage> element is used instead.

<OutputChannel
    type=ChannelType
    partitionsPerTask=xs:int
    disableDynamicPartitionAssignment=xs:boolean
    partitionAssignmentMethod=PartitionAssignmentMethod
    forceFileDownload=xs:boolean>
</OutputChannel>

Attributes

AttributeUseDescription
typerequired The type of the channel: file or TCP.
partitionsPerTaskrequired The number of partitions that each task in the receiving stage will process (setting this higher than 1 will create a stage with more partitions than tasks, allowing for the use of dynamic partition assignment for load balancing).
disableDynamicPartitionAssignmentrequired Indicates whether to disable dynamic partition assignment if the partitionsPerTask attribute is higher than 1. Use for debugging purposes.
partitionAssignmentMethodrequired Specifies the partition assignment method to use if the partitionsPerTask attribute is higher than 1.
forceFileDownloadrequired Indicates whether to force file download even for local files with a file channel. Use for debugging purposes.

Child elements

ElementMin occursMax occurs
<MultiInputRecordReaderType>11
<OutputStage>11
<PartitionerType>11

The Job/​Stages/​Stage/​OutputChannel/​MultiInputRecordReaderType element

The <MultiInputRecordReaderType> element specifies the assembly qualified name of a type that derives from MultiInputRecordReader<T> that is used as the multi-input record reader to combine input data from multiple tasks in the tasks of the receiving stage.

<MultiInputRecordReaderType>xs:string<MultiInputRecordReaderType>

The Job/​Stages/​Stage/​OutputChannel/​OutputStage element

The <OutputStage> element specifies the ID of the receiving stage of this channel. This may not be a compound stage ID.

<OutputStage>StageId<OutputStage>

The Job/​Stages/​Stage/​OutputChannel/​PartitionerType element

The <PartitionerType> element specifies the assembly qualified type name of a type that implements the IPartitioner<T> interface that is used to partition records across tasks in the receiving stage.

<PartitionerType>xs:string<PartitionerType>

The Job/​Stages/​Stage/​MultiInputRecordReaderType element

The <MultiInputRecordReaderType> element specifies the assembly qualified type name of a type that derives from the MultiInputRecordReader<T> class that is used as the stage multi-input record reader to combine input from multiple channels. This element is only used if the stage has more than one input channel.

<MultiInputRecordReaderType>xs:string<MultiInputRecordReaderType>

The Job/​Stages/​Stage/​DependentStages element

The <DependentStages> element specifies a collection of stage IDs of stages that may not be scheduled until all tasks in this stage have completed. This is used to specify stages that depend on the output of this stage but are not directly or indirectly connected to this stage via a channel. The specified stages may not be child stages (so the ID may not be a compound stage ID).

<DependentStages>
</DependentStages>

Child elements

ElementMin occursMax occurs
<string>0unbounded

The Job/​Stages/​Stage/​DependentStages/​string element

The <string> element for the <DependentStages> element specifies the ID of a stage that may not be scheduled until all the tasks in this stage have completed.

<string>StageId<string>

The Job/​AdditionalProgressCounters element

The <AdditionalProgressCounters> element contains a collection of additional progress counters for the job.

<AdditionalProgressCounters>
</AdditionalProgressCounters>

Child elements

ElementMin occursMax occurs
<AdditionalProgressCounter>0unbounded

The Job/​AdditionalProgressCounters/​AdditionalProgressCounter element

The <AdditionalProgressCounter> element specifies a friendly name for an additional progress counter used by the job.

<AdditionalProgressCounter>
</AdditionalProgressCounter>

Child elements

ElementMin occursMax occurs
<TypeName>11
<DisplayName>11

The Job/​AdditionalProgressCounters/​AdditionalProgressCounter/​TypeName element

The <TypeName> element specifies the assembly qualified type name of a type implementing the IHasAdditionalProgress interface that provides additional progress information about the tasks in this job.

<TypeName>xs:string<TypeName>

The Job/​AdditionalProgressCounters/​AdditionalProgressCounter/​DisplayName element

The <DisplayName> element specifies a friendly name for the additional progress counter used by the Jet web portal.

<DisplayName>xs:string<DisplayName>

The Job/​SchedulerOptions element

The <SchedulerOptions> element specifies custom scheduler settings for a job. Custom schedulers are not required to obey these settings (the default scheduler does).

<SchedulerOptions
    maximumDataDistance=xs:int
    dfsInputSchedulingMode=SchedulingMode
    nonInputSchedulingMode=SchedulingMode />

Attributes

AttributeUseDescription
maximumDataDistancerequired The maximum allowed distance between a task and its input data. The value 0 means the data must be local, 1 allows rack-local data, and 2 or higher allows any distance. This attribute only applies to stages with data input that is provided by a file system that supports locality (such as the Jumbo DFS).
dfsInputSchedulingModerequired The scheduling mode for tasks with data input.
nonInputSchedulingModerequired The scheduling mode for tasks without data input.

The Job/​JobSettings element

The <JobSettings> element specifies arbitrary settings available during task execution that apply to the job as a whole.

<JobSettings>
</JobSettings>

Child elements

ElementMin occursMax occurs
<Setting>0unbounded

The Job/​JobSettings/​Setting element

The <Setting> element specifies a setting.

<Setting
    key=xs:string
    value=xs:string />

Attributes

AttributeUseDescription
keyrequired The key that can be used to retrieve the setting.
valuerequired The value of the setting.

The SchedulingMode type

Indicates the scheduling strategy to use by a task scheduler.

Enumeration values

ValueDescription
Default

Use the strategy specified in the jet.config file.

MoreServers

Favor TaskServers with a large amount of free task slots, spreading a job over as many nodes as possible.

FewerServers

Favor TaskServers with a small amount of free task slots, spreading the job over as few nodes as possible.

OptimalLocality

Do not schedule non-local tasks on a TaskServer even if there are no other tasks that could be assigned to that TaskServer.

The ChannelType type

Indicates the type of a channel between stages.

Enumeration values

ValueDescription
File

Intermediate data is materialized in local files which are shuffled over the network.

Tcp

Data is transfered directly between tasks via TCP connection without materializing it. It must be possible to run all tasks in the receiving stage simultaneously, and task level fault tolerance will not be available for this job.

The PartitionAssignmentMethod type

Indicates how to assign partitions to tasks during initial partition assignment if the stage uses more than one partition per task.

Enumeration values

ValueDescription
Linear

Each task gets a linear sequence of partitions, e.g. task 1 gets partitions 1, 2 and 3, task 2 gets partitions 4, 5 and 6, and so forth.

Striped

The partitions are striped across the tasks, e.g. task 1 gets partitions 1, 3, and 5, and task 2 gets partitions 2, 4, and 6.

The StageId type

A string defining a non-compound stage ID. May not contain the characters '.', '_' or '-'.