Click or drag to resize

PFPGrowth Class

JobRunner for the Parallel FP-growth algorithm.
Inheritance Hierarchy

Namespace:  Ookii.Jumbo.Jet.Samples.FPGrowth
Assembly:  Ookii.Jumbo.Jet.Samples (in Ookii.Jumbo.Jet.Samples.dll) Version: 2.0.0
Syntax
public class PFPGrowth : JobBuilderJob

The PFPGrowth type exposes the following members.

Constructors
  NameDescription
Public methodPFPGrowth
Initializes a new instance of the PFPGrowth class
Top
Properties
  NameDescription
Public propertyAccumulatorTaskCount
Gets or sets the number of feature count accumulator tasks.
Public propertyAggregateTaskCount
Gets or sets the aggregate task count.
Public propertyBinaryOutput
Gets or sets a value indicating whether the output format is binary.
Public propertyBlockSize
Gets or sets the block size of the job's output files.
(Inherited from BaseJobRunner.)
Public propertyConfigOnly
Gets or sets a value indicating whether the job runner will only create and print the job configuration, instead of running the job.
(Inherited from JobBuilderJob.)
Public propertyDfsConfiguration
Gets or sets the configuration used to access the Distributed File System.
(Inherited from Configurable.)
Protected propertyFileSystemClient
Gets the DFS client.
(Inherited from BaseJobRunner.)
Public propertyFPGrowthTaskCount
Gets or sets the FP growth task count.
Public propertyGroups
Gets or sets the number of groups.
Public propertyInputPath
Gets or sets the input path.
Public propertyIsInteractive
Gets or sets a value that indicates whether the job runner should wait for user input before starting the job and before exitting.
(Inherited from BaseJobRunner.)
Protected propertyJetClient
Gets the jet client.
(Inherited from BaseJobRunner.)
Public propertyJetConfiguration
Gets or sets the configuration used to access the Jet servers.
(Inherited from Configurable.)
Public propertyJobOrStageProperties
Gets or sets the property values that will override predefined values in the job configuration.
(Inherited from BaseJobRunner.)
Public propertyJobOrStageSettings
Gets or sets additional job or stage settings that will be defined in the job configuration.
(Inherited from BaseJobRunner.)
Public propertyMinSupport
Gets or sets the min support.
Public propertyOutputPath
Gets or sets the output path.
Public propertyOverwriteOutput
Gets or sets a value that indicates whether the output directory should be deleted, if it exists, before the job is executed.
(Inherited from BaseJobRunner.)
Public propertyPartitionsPerTask
Gets or sets a value indicating the number of partitions per task for the MineTransactions stage.
Public propertyPatternCount
Gets or sets the pattern count.
Public propertyReplicationFactor
Gets or sets the replication factor of the job's output files.
(Inherited from BaseJobRunner.)
Public propertyTaskContext
Gets or sets the configuration for the task attempt.
(Inherited from Configurable.)
Top
Methods
  NameDescription
Public methodStatic memberAccumulateFeatureCounts
Accumulates the feature counts.
Public methodStatic memberAggregatePatterns
Aggregates the patterns.
Protected methodApplyJobPropertiesAndSettings
Adds the values of properties marked with the JobSettingAttribute to the JobSettings dictionary, applies properties set by the JobOrStageProperties property, and adds settings defined by the JobOrStageSettings property, and .
(Inherited from BaseJobRunner.)
Protected methodBuildJob
Constructs the job configuration using the specified job builder.
(Overrides JobBuilderJobBuildJob(JobBuilder).)
Protected methodCheckAndCreateOutputPath
If OverwriteOutput is , deletes the output path and then re-creates it; otherwise, checks if the output path exists and creates it if it doesn't exist and fails if it does.
(Inherited from BaseJobRunner.)
Public methodStatic memberCountFeatures
Counts the features.
Public methodEquals
Determines whether the specified object is equal to the current object.
(Inherited from Object.)
Protected methodFinalize
Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection.
(Inherited from Object.)
Public methodFinishJob
Called after the job finishes.
(Inherited from BaseJobRunner.)
Public methodStatic memberGenerateGroupTransactions
Generates the group transactions.
Public methodGetHashCode
Serves as the default hash function.
(Inherited from Object.)
Protected methodGetInputFileSystemEntry
Gets a JumboFileSystemEntry instance for the specified path, or throws an exception if the input doesn't exist.
(Inherited from BaseJobRunner.)
Public methodGetType
Gets the Type of the current instance.
(Inherited from Object.)
Protected methodMemberwiseClone
Creates a shallow copy of the current Object.
(Inherited from Object.)
Public methodStatic memberMineTransactions
Mines the transactions.
Public methodNotifyConfigurationChanged
Indicates the configuration has been changed. ApplyConfiguration(Object, DfsConfiguration, JetConfiguration, TaskContext) calls this method after setting the configuration.
(Inherited from BaseJobRunner.)
Protected methodOnJobCreated
Called when the job has been created on the job server, but before running it.
(Inherited from JobBuilderJob.)
Protected methodPromptIfInteractive
Prompts the user to start or exit, if IsInteractive is .
(Inherited from BaseJobRunner.)
Public methodRunJob
Starts the job.
(Inherited from JobBuilderJob.)
Public methodToString
Returns a string that represents the current object.
(Inherited from Object.)
Protected methodWriteOutput
Writes the result of the operation to the DFS using this instance's settings for BlockSize and ReplicationFactor.
(Inherited from JobBuilderJob.)
Top
Remarks

This job is an implementation of the Parallel FP Growth algorithm described in the paper "PFP: Parallel FP-Growth for Query Recommendation" by Li et al., 2008.

This algorithm calculates the top-K frequent patterns for each item in the database, only regarding patterns that have the specified minimum support.

The algorithm has three steps: first, it counts how often each item occurs in the input database, filters out the infrequent features, and divides the resulting feature list into groups. Next, it generates group-dependent transactions from the input and runs the FP-Growth algorithm on each group. Finally, the results from each group are aggregated to form the final result.

The number of groups should be carefully selected so that the number of items per group it not too large. Ideally, each group should have 5-10 items at most for a large database.

The input for this job should be a plain text file (or files) where each line represents a transaction containing a space-delimited list of transactions.

This example demonstrates a more complicated Jumbo job, with several stages including more than one stage with file input. It uses scheduling dependencies, group aggregation, partition-based grouping using multiple partitions per task, dynamic partition assignment, and custom progress providers.

See Also