Decision Trees on MapReduce CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Decision Tree Learning Give one attribute (e.g., lifespan), try to predict the value of new peoples lifespans by means of some of the other available attribute Input attributes: d features/attributes: x(1), x(2), x(d) Each x(j) has domain Oj Categorical: Oj = {red, blue} Numerical: Hj = (0, 10) Y is output variable with domain OY: Categorical: Classification, Numerical: Regression Data D: examples (, ) where is a -dim feature vector, is output variable Task: Given an input data vector predict 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

Decision Trees A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output A s n ye X(1)

F I 3 Decision Trees (1) Decision trees: Split the data at each internal node Each leaf node makes a prediction A s n ye (1) (1) o X

X(2)

Y= 0.42 the tree until it hits a leaf node F X(2)

v1 X1 v3 ++ + + + + + + + + + + + + + + + + + + + +

+ + + + + + + + + v2 01/30/2020 + + + + + + + + + v4 X2 A X1

-- Y= -- H v5 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu Y= + D X2

traversing the edge |D|=90 |D|=10 X(1)

A Imagine we are currently B at some node G Let DG be the data that reaches G C D F E G H I There is a decision we have to make: Do we continue building the tree? If yes, which variable and which value do we use for a split? Continue building the tree recursively If not, how do we make a prediction? We need to build a predictor node 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

9 3 steps in constructing a tree BuildSubtree (1) (2) (3) BuildSubtree BuildSubtree Requires at least a single pass over the data! 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10 How to construct a tree? (1) How to split? Pick attribute & value that optimizes some criterion Regression: Purity |D|=10 A |D|=90 (1) (1)

.4 X

and maximizes: variance of in 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11 How to construct a tree? (1) How to split? Pick attribute & value that optimizes some criterion Classification: Information Gain |D|=10 A |D|=90 (1) (1) .4 X

|D|=45 X(2)

bits, on average, per symbol, needed to transmit a stream of symbols drawn from Xs distribution? The entropy of X: High Entropy: X is from a uniform (boring) distribution A histogram of the frequency distribution of values of X is flat Low Entropy: X is from a varied (peaks/valleys) distrib. A histogram of the frequency distribution of values of X would have many lows and one or two highs Low entropy 01/30/2020 High entropy Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13 Why Information Gain? Entropy Suppose I want to predict and I have input = College Major = Likes Casablanca X Y Math

Yes History No CS Yes Math No Math No CS Yes Math Yes History No 01/30/2020 From this data we estimate

Note: Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14 Why Information Gain? Entropy Suppose I want to predict and I have input = College Major = Likes Casablanca X Y Math Yes History No CS Yes Math No Math

No CS Yes Math Yes History No 01/30/2020 Def: Specific Conditional Entropy H(Y | X=v) = The entropy of Y among only those records in which X has value v Example: Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15 Why Information Gain? Suppose I want to predict and I have input = College Major

= Likes Casablanca X Y Math Yes History No CS Yes Math No Math No CS Yes Math Yes History

No 01/30/2020 Def: Conditional Entropy The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that rows value of X = Expected number of bits to transmit Y if both sides will know the value of X = Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16 Why Information Gain? Suppose I want to predict and I have input The average specific conditional entropy of X Y Example:

Math Yes History No CS Yes Math No Math No CS Yes Math Yes History 0.25 0

History No CS 0.25 0 01/30/2020 Vj P(X=vj) H(Y|X=vj) So: H(Y|X)=0.5*1+0.25*0+0.25*0 = 0.5 Math 0.5 1 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17 Why Information Gain? Suppose I want to predict and I have input Def: Information Gain

X Y Math Yes History No CS Yes Math No Math No CS Yes Math Yes History

No 01/30/2020 I must transmit Y. How many bits on average would it save me if both ends of the line knew X? Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 0.5 = 0.5 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18 What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years From historical data you might find: IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.4 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells us how much information about

is contained in So attribute X that has high IG(Y|X) is a good split! 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19 3 steps in constructing a tree BuildSubtree (1) (2) (3) BuildSubtree BuildSubtree 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20 When to stop? (2) When to stop? Many different heuristic options Two ideas: |D|=10 A

|D|=90 (1) (1) .4 X

I (1) When the leaf is pure The target variable does not vary too much: Var(yi) < (2) When # of examples in the leaf is too small For example, |D| 100 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21 How to predict? (3) How to predict? Many options Regression: Predict average yi of the examples in the leaf Build a linear regression model on the examples in the leaf |D|=10 A |D|=90 (1) (1) .4 X

2B |D|=45 |D|=20 F D |D|=45 X(2)

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22 Building Decision Trees Using MapReduce Problem: Building a tree BuildSubTree Given a large dataset with hundreds of attributes Build a decision tree! BuildSubTree BuildSubTree General considerations: Tree is small (can keep it memory): Shallow (~10 levels) Dataset too large to keep in memory Dataset too big to scan over on a single machine MapReduce to the rescue! 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

Todays Lecture: PLANET Parallel Learner for Assembling Numerous Ensemble Trees [Panda et al., VLDB 09] A sequence of MapReduce jobs that builds a decision tree Setting: Hundreds of numerical (discrete & continuous, but not categorical) attributes Target variable is numerical: Regression Splits are binary: X(j) < v Decision tree is small enough for each Mapper to keep it in memory Data too large to keep in memory 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25 PLANET Architecture Mast er Model Attribut e metada ta A B

C D F E G H I Keeps track of the model and decides how to grow the tree Intermedi ate results Input data MapReduce 01/30/2020 MapReduce: Given a set of split candidates compute their quality Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26 PLANET: Building the Tree

D The tree will be built in levels One level at a time: DL A B C D F j DR X

01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27 Decision trees on MapReduce Hard part: Computing quality of a split 1) Master tells the Mappers which splits (n, X(j), v) to consider 2) Each Mapper gets a subset of data and computes partial statistics for a given split 3) Reducers collect partial statistics and output the final quality for a given split (n, X(j), v) 4) Master makes final decision where to split 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28 PLANET Overview A B C D F E

G H I We build the tree level by level One MapReduce step builds one level of the tree Mapper Considers a number of candidate splits (node, attribute, value) on its subset of the data For each split it stores partial statistics Partial split-statistics is sent to Reducers Reducer Collects all partial statistics and determines best split Master grows the tree for one level 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29 PLANET Overview A B C D F

E G H I Mapper loads the model and info about which attribute splits to consider Each mapper sees a subset of the data D* Mapper drops each datapoint to find the appropriate leaf node L For each leaf node L it keeps statistics about (1) the data reaching L (2) the data in left/right subtree under split S Reducer aggregates the statistics (1), (2) and determines the best split for each tree node 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30 PLANET: Components A B C

D F E G H I Master Monitors everything (runs multiple MapReduce jobs) Three types of MapReduce jobs: (1) MapReduce Initialization (run once first) For each attribute identify values to be considered for splits (2) MapReduce FindBestSplit (run multiple times) MapReduce job to find best split (when there is too much data to fit in memory) (3) MapReduce InMemoryBuild (run once last) Similar to BuildSubTree (but for small data) Grows an entire sub-tree once the data fits in memory Model file A file describing the state of the model 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

31 PLANET: Components (1) (2) (3) (4) Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last) PLANET: Master A B C D F E G H I Master controls the entire process Determines the state of the tree and grows it:

(1) Decides if nodes should be split (2) If there is little data entering a tree node, Master runs an InMemoryBuild MapReduce job to grow the entire subtree below that node (3) For larger nodes, Master launches MapReduce FindBestSplit to evaluate candidates for best split Master also collects results from FindBestSplit and chooses the best split for a node (4) Updates the model 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33 PLANET: Components (1) (2) (3) (4) Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last) Initialization: Attribute metadata Initialization job: Identifies all the attribute values which need to be considered for splits Initialization process generates attribute metadata to be loaded in memory by other tasks Main question:

Which splits to even consider? A split is defined by a triple: (node n, attribute X(j), value v) 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu D n X(j) < v 35 Initialization: Attribute metadata Which splits to even consider? For small data we can sort the values along a particular feature and consider every possible split But data values may not be uniformly populated so many splits may not really make a difference X(j): 1.2 1.3 1.4 1.6 2.1 7.2 8.1 9.8 10.1 10.2 10.3 10.4 11.5 11.7 12.8 12.9 Idea: Consider a limited number of splits such that splits move about the same amount of data 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

Initialization: Attribute metadata D* Splits for numerical attributes: j X

Goal: Equal number of elements per bucket (B buckets total) Construct by first sorting and then taking B-1 equally-spaced splits 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20 Faster construction: Sample & take equally-spaced splits in the sample Nearly equal buckets 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38 PLANET: Components (1) (2) (3) (4) Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last) FindBestSplit Goal: For a particular split node j find attribute X(j) and value v that maximizes Purity: D training data (xi, yi) reaching the node j

DL training data xi, where xi(j) < v DR training data xi, where xi(j) v 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu D DL j DR X(j) < v 40 FindBestSplit D To compute Purity we need DL Important observation: Variance can be j DR X(j) < v computed from sufficient statistics: N, S=yyi, Q=yyi2 Each Mapper m processes subset of data Dm, and computes Nm, Sm, Qm for its own Dm

Reducer combines the statistics and computes global variance and then Purity: 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41 FindBestSplit: Map A B C D F E G H I Mapper: Initialized by loading results of Initialization task Current model (to find which node each datapoint xi ends up) Attribute metadata (all split points for each attribute) Load the set of candidate splits: {(node, attribute, value)} For each data record run the Map algorithm: For each node store statistics of the data entering

the node and at the end emit (to all reducers): For each split store statistics and at the end emit: SplitID = (node n, attribute X(j), split value v) 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42 A FindBestSplit: Reducer B C D F G E H I Reducer: (1) Load all the

pairs and aggregate the per node statistics (2) For all the aggregate the statistics For each NodeID, output the best split found 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43 Overall system architecture Master gives the mappers: (1) Tree (2) Set of nodes (3) Set of candidate splits A B C D F E G H I

Nodes: F, G, H, I Split candidates: (G, X(1),v(1)), (G, X(1),v(2)), (H, X(3),v(3)), (H, X(4),v(4)) Mapper Data Mapper Mappers output 2 types of key-value pairs: (NodeID: S,Q,N) (NodeID, Spit: S,Q,N) Reduc er For every (NodeID, Split) Reducer(s) compute the Purity and output the best split Mapper 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44 Overall system architecture Example: Need to split nodes F, G, H, I Map and Reduce:

FindBestSplit::Map (each mapper) A B C D1 D D2 F G D3 E D4 H I Load the current model M Drop every example xi down the tree If it hits G or H, update in-memory hash tables: For each node: Tn: (Node){vS, Q, N} For each (Split, Node): Tn,j,s: (Node, Attribute, SplitValue){vS, Q, N} Map::Finalize: output the key-value pairs from above hashtables FindBestSplit::Reduce (each reducer) Collect: T1: T2:<(Node, Attr. Split), List{S, Q, N}> <(Node, Attr. Split), {S, Q, N}> Compute impurity for each node using T1, T2 Return best split to Master (which then decides on globally best split) 01/30/2020

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45 Back to the Master A B C D F E G H I Collects outputs from FindBestSplit reducers For each node decides the best split If data in DL/DR is small enough, later run a MapReduce job InMemoryBuild on the node Else run MapReduce FindBestSplit job for both nodes 01/30/2020

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu D DL B A DR X(j) < v C 46 Decision Trees: Conclusion Decision Trees Decision trees are the single most popular data mining tool: 01/30/2020 Easy to understand Easy to implement

Easy to use Computationally cheap Its possible to get in trouble with overfitting They do classification as well as regression! Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48 Learning Ensembles Learn multiple trees and combine their predictions Gives better performance in practice Bagging: Learns multiple trees over independent samples of the training data For a dataset D on n data points: Create dataset D of n points but sample from D with replacement: 33% points in D will be duplicates, 66% will be unique Predictions from each tree are averaged to compute the final model prediction 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49 Bagging Decision Trees 01/30/2020

Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50 Bagged Decision Trees How to create random samples of D? Compute a hash of a training records id and tree id Use records that hash into a particular range to learn a tree This way the same sample is used for all nodes in a tree Note: This is sampling D without replacement (but samples of D* should be created with replacement) 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51 Improvement: Random Forests Train a Bagged Decision Tree but Use a modified tree learning algorithm that selects (at each candidate split) a random subset of the features If we have features, consider random features This is called: Feature bagging Benefit: Breaks correlation between trees If one feature is very strong predictor, then every tree will

select it, causing trees to be correlated. 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 52 SVM vs. DT SVM Classification & Regression Classification Multiple (~10) classes Usually only 2 classes Real valued and categorical features Few (hundreds) of features Usually dense features Complicated decision boundaries Real valued features (no categorical ones) Tens/hundreds of thousands of features Very sparse features

Simple decision boundary No issues with overfitting Example applications Text classification Spam detection Computer vision 01/30/2020 Decision trees Overfitting! Early stopping Example applications User profile classification Landing page bounce prediction Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 53 References B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. In Proc. VLDB 2009.

J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient Boosted Distributed Decision Trees. In Proc. CIKM 2009. 01/30/2020 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 54