DECISION TREE

Figure 1: Decision Tree

This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.

What is decision tree

Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regression and linear regression, decision trees emphasizes on the splitting of nodes into multiple leaves that aims to predict the outcome of the scenario base on its independent variables

Concept of decision tree

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

The decision tree model is divided into two types which are

classification trees: the tree contains categorical values
regression trees: the tree contains numerical values

However, a decision tree can contain mixed data type of both categorical and numerical values but despite being within the same decision tree, the numerical threshold can be different for the same attribute when splitting into multiple leaves. The terms that are most commonly used when discussing about decision trees include roots, branch and leaves.

very top of the tree = roots
got arrow enter and come out= branch
got arrow enter but not out = leaves

A decision tree will keep on expanding until all the records within the leaf have the same output, at that moment the leaf is considered a pure leaf as it only contains record of the same output.

Scenario

Today you wish to take your friend out on a treat and you want to know if your friend loves cool as Ice. However, the information you have about your friend is only his age, whether he loves popcorn or soda. With such limited information, are you able to give your friend a wonderful hang out experience? Fear not as the decision tree's model will assist you.

While applying decision tree model, there are several aspects to take into account which will be explained within this blogpost.

1) Gini Impurity

- Determines the sequence of features within the tree

- Gini impurity is calculated using this method demonstrated as below.

1) Separate the dependent and independent values.

Figure 2: Example of Data

For this scenario, the dependent variable is "Loves Cool as Ice" and the remaining variables are independent variables. However, the method to calculate the gini impurity for dependent and independent variables are different.

2) For categorical variables, separate the records base on True or False and calculate the amount of records under each outcome of independent variables.

Figure 3: Example of decision tree

Within this example, for people that favors popcorn, only 1 of them likes cool as ice meanwhile the other 3 doesn't. On the other hand, people that dislikes popcorn, 2 of them likes cool as ice and only 1 of them dislikes.

Figure 4: Example of decision tree

As for this example, for people that favors soda, 3 of them likes cool as ice meanwhile the only 1 doesn't. On the other hand, people that dislikes soda, none of them likes cool as ice and all 3 of them dislikes.

3) The formula of calculating gini impurity for a leaf is demonstrated as below

Gini Impurity for a Leaf = 1- (the probability of "Yes")^2 - (the probability of "No")^2

Figure 5: Example of decision tree

Gini Impurity of the leaf for people that likes popcorn

= 1- (1/4)^2 - (3/4)^2

= 1- 0.625

= 0.375

Gini Impurity of the leaf for people that dislikes popcorn

= 1- (2/3)^2 - (1/3)^2

= 1- 0.556

= 0. 444

4) We calculate the total gini impurity which is the final gini impurity value that we will take for that independent variable. The formula for calculating the total gini impurity can be simplified as

Total Gini Impurity = weighted average of Gini Impurities of the Leaves

Figure 6: Example of decision tree

The total gini impurity for Popcorn can be calculated as

= (people that likes popcorn / total people)*Gini Impurity of the leaf for people that likes popcorn + (people that dislikes popcorn / total people)*Gini Impurity of the leaf for people that dislikes popcorn

= (4/7)*0.375+ (3/7)*0.444

= 0.405

Figure 7: Total Gini Impurity for Soda

Through repeating the process, we are able to find the total impurity of Soda which is 0.214.

As for numerical values like the age, the method to calculate the total Gini impurity are demonstrated as below.

1) Sort the numerical values in order from small to big

Figure 8: Sorting the age in ascending order

2) Get the middle number in between 2 consecutive numbers

3) Use the middle number as a threshold to split the data, for example, the middle number between 7 and 12 is 9.5. Hence, 9.5 is used as the threshold to split the leaves.

4) Using the same formula, calculate the gini impurity of each leaves.

Figure 9: Gini Impurity for the age with threshold 9.5

4) Calculate the total gini impurity of the variable with the threshold.

Figure 10: Calculating the total gini impurity

5) Repeat the process for all middle number

6) The middle number with the lowest Gini impurity will be taken and used

A Gini impurity of an independent variable also effects its

Entropy: The quantitative measure of the randomness of the information being processed.

A high value of Entropy means that the randomness in the system is high and thus making accurate predictions is tough.

A low value of Entropy means that the randomness in the system is low and thus making accurate predictions is easier.

Information Gain : The measure of how much information a feature provides about a class.

Low entropy leads to increased Information Gain and high entropy leads to low Information Gain.

The relationship between Gini Impurity, Entropy and Information Gain can be concluded as below.

If the Gini impurity increases, Entropy increases but the information gain decreases.

2) Feature Selection and Handling Missing Data

Feature Selection

- Objective: Reduce number of tress ,which leads to reduction in complexity and prevent overfitting from happening.

- Ensure each split reduces the impurity of the leaf.

Figure 11: Example of decision tree

From the scenario above, we can see the variable Chest Pain does not assist in impurity reduction.

Figure 12: Example of decision tree

Hence, that variable is removed from the decision tree to reduce its complexity.

Handling missing data

For categorical variables, the missing data can be replaced with the most frequent category within that variable.

Figure 13: Blocked arteries data

For this scenario, there are 2 'No' and 1 'Yes' for the variable called Blocked Arteries. Hence, the missing data can be replaced with 'No'.

Another alternatives involve referring to another variable that is highly correlated with the variable that contains missing value.

Figure 15: Example of data

For this scenario, blocked arteries is correlated with chest pain and the value of chest pain is the same as the value of blocked arteries. Hence, the missing value can be replaced with Yes.

As for categorical values, the missing values can be replaced with this method.

Figure 16: Example of Data

From the scenario above, we assume the weight and height of a person is highly correlated.

Figure 17: Relationship between weight and height

A linear regression graph is plotted using these 2 variable and the corresponding value presented by the best fit line can be used to replace the missing value.

3) Pruning

- Prevent overfitting the training data so the decision tree will do a better job for the testing data.

- The most common method of pruning is called Cost complexity pruning which calculate the Sum of Square Residuals(SSR) for a full size tree and proceed to repeat the same for the rest of the trees with fewer leaves.

Figure 18: SSR of a full decision tree

The SSR for a full size tree is 543.8 as displayed.

Figure 19 and 20: Reduction of SSR

From the figures, we can observe that the SSR increases as the number of features selected to build the tree decreases. This is because we want the pruned tree to not fit the training data as well as full-sized tree

We select the pruning result base on the tree that contains the lowest tree score which is calculated using this formula.

Tree score = SSR + aT

a= alpha is a tuning parameter that we find using Cross validation.

How to find the value of a

1) Use all of the data to build trees with different alphas

2) Use cross validation to compare alpha

3) Select the alpha value that gives the lowest SSR with the testing data

T = number of leaves

The relationship between a and T is directly proportional. If a increases, T increases.

Parameters that you can tune for decision tree

1. Model Complexity

max_depth: Maximum depth of the decision tree.
min_samples_split: The least count of samples which is needed to divide a node.
min_samples_leaf: The minimum number of samples required to be in the form of leaf nodes.
max_features: Features to look at when finding the optimal split.
max_leaf_nodes: In decision trees, the highest number of outcomes that a tree can have at the terminal end.
min_weight_fraction_leaf: The minimum of the weighted count for all the samples or input records in a single node of the decision trees.

2. Splitting Criteria

criterion: The parameter which defines the level of quality in splitting of the data within the decision tree.

Information_gain: Select the criterion that gives the maximum amount of information entropy decrease for the split in the decision tree.
Gain_ratio: Adjusts information gain formula by reducing the prejudice of the attributes having numerous distinct values, so as to select the attribute that brings maximal relative information.
Accuracy: Localized on the accurate prediction to choose the split that results in the best overall accuracy of the decision tree.
Gini_index: Reduces impurity in Measures through calculation of the probability of the label of the elements.
Least_square: Calculates squared difference in observed and predicted values to obtain the least total squared errors to determine splits.

splitter: Chooses the split at each node.

3. Miscellaneous Parameters

Random_state: Helps in the regulation of variation of the estimator.
max_samples: Regulates the number of samples to take from the overall data to train each base estimator.
class_weight: This can be managed by the provision of higher weights for the minority classes in the datasets.
ccp_alpha: All nodes that have a cost complexity value less than the specified threshold are removed.

4. Others

Apply pruning: Use pruning to eliminate insignificant branches and simplify the selected model.
Confidence: Defines the degree of pruning.
Apply pre-pruning: Stop the tree building when further splits do not contribute to better decision-making
Minimal gain: Setting a minimum threshold of the information gain for split to be made.
Minimal leaf size: Defines the least number of instances allowable in a leaf node.
Minimal size for split: Determines the number of instances that must be present in a node before the node can be split.
Number of pre-pruning alternatives: Setting the number of split that the algorithm examines before pre-pruning the tree.