AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...
DECISION TREE
| Figure 1: Decision Tree |
This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.
What is decision tree
Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regression and linear regression, decision trees emphasizes on the splitting of nodes into multiple leaves that aims to predict the outcome of the scenario base on its independent variables
Concept of decision tree
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.
The decision tree model is divided into two types which are
- classification trees: the tree contains categorical values
- regression trees: the tree contains numerical values
However, a decision tree can contain mixed data type of both categorical and numerical values but despite being within the same decision tree, the numerical threshold can be different for the same attribute when splitting into multiple leaves. The terms that are most commonly used when discussing about decision trees include roots, branch and leaves.
- very top of the tree = roots
- got arrow enter and come out= branch
- got arrow enter but not out = leaves
A decision tree will keep on expanding until all the records within the leaf have the same output, at that moment the leaf is considered a pure leaf as it only contains record of the same output.
Scenario
Today you wish to take your friend out on a treat and you want to know if your friend loves cool as Ice. However, the information you have about your friend is only his age, whether he loves popcorn or soda. With such limited information, are you able to give your friend a wonderful hang out experience? Fear not as the decision tree's model will assist you.
While applying decision tree model, there are several aspects to take into account which will be explained within this blogpost.
1) Gini Impurity
- Determines the sequence of features within the tree
- Gini impurity is calculated using this method demonstrated as below.
1) Separate the dependent and independent values.
| Figure 2: Example of Data |
For this scenario, the dependent variable is "Loves Cool as Ice" and the remaining variables are independent variables. However, the method to calculate the gini impurity for dependent and independent variables are different.
2) For categorical variables, separate the records base on True or False and calculate the amount of records under each outcome of independent variables.
| Figure 3: Example of decision tree |
Within this example, for people that favors popcorn, only 1 of them likes cool as ice meanwhile the other 3 doesn't. On the other hand, people that dislikes popcorn, 2 of them likes cool as ice and only 1 of them dislikes.
| Figure 4: Example of decision tree |
As for this example, for people that favors soda, 3 of them likes cool as ice meanwhile the only 1 doesn't. On the other hand, people that dislikes soda, none of them likes cool as ice and all 3 of them dislikes.
Gini Impurity for a Leaf = 1- (the probability of "Yes")^2 - (the probability of "No")^2
| Figure 5: Example of decision tree |
Gini Impurity of the leaf for people that likes popcorn
= 1- (1/4)^2 - (3/4)^2
= 1- 0.625
= 0.375
Gini Impurity of the leaf for people that dislikes popcorn
= 1- (2/3)^2 - (1/3)^2
= 1- 0.556
= 0. 444
4) We calculate the total gini impurity which is the final gini impurity value that we will take for that independent variable. The formula for calculating the total gini impurity can be simplified as
Total Gini Impurity = weighted average of Gini Impurities of the Leaves
| Figure 6: Example of decision tree |
The total gini impurity for Popcorn can be calculated as
= (people that likes popcorn / total people)*Gini Impurity of the leaf for people that likes popcorn + (people that dislikes popcorn / total people)*Gini Impurity of the leaf for people that dislikes popcorn
= (4/7)*0.375+ (3/7)*0.444
= 0.405
| Figure 7: Total Gini Impurity for Soda |
Through repeating the process, we are able to find the total impurity of Soda which is 0.214.
As for numerical values like the age, the method to calculate the total Gini impurity are demonstrated as below.
1) Sort the numerical values in order from small to big
2) Get the middle number in between 2 consecutive numbers
3) Use the middle number as a threshold to split the data, for example, the middle number between 7 and 12 is 9.5. Hence, 9.5 is used as the threshold to split the leaves.
4) Using the same formula, calculate the gini impurity of each leaves.
A Gini impurity of an independent variable also effects its
| Figure 9: Gini Impurity for the age with threshold 9.5 |
4) Calculate the total gini impurity of the variable with the threshold.
6) The middle number with the lowest Gini impurity will be taken and used
- Entropy: The quantitative measure of the randomness of the information being processed.
A high value of Entropy means that the randomness in the system is high and thus making accurate predictions is tough.
A low value of Entropy means that the randomness in the system is low and thus making accurate predictions is easier.
- Information Gain : The measure of how much information a feature provides about a class.
Low entropy leads to increased Information Gain and high entropy leads to low Information Gain.
The relationship between Gini Impurity, Entropy and Information Gain can be concluded as below.
If the Gini impurity increases, Entropy increases but the information gain decreases.
2) Feature Selection and Handling Missing Data
Feature Selection
- Objective: Reduce number of tress ,which leads to reduction in complexity and prevent overfitting from happening.
- Ensure each split reduces the impurity of the leaf.
![]() |
| Figure 11: Example of decision tree |
From the scenario above, we can see the variable Chest Pain does not assist in impurity reduction.
![]() |
| Figure 12: Example of decision tree |
Hence, that variable is removed from the decision tree to reduce its complexity.
Handling missing data
For categorical variables, the missing data can be replaced with the most frequent category within that variable.
![]() |
| Figure 13: Blocked arteries data |
For this scenario, there are 2 'No' and 1 'Yes' for the variable called Blocked Arteries. Hence, the missing data can be replaced with 'No'.
Another alternatives involve referring to another variable that is highly correlated with the variable that contains missing value.
![]() |
| Figure 15: Example of data |
For this scenario, blocked arteries is correlated with chest pain and the value of chest pain is the same as the value of blocked arteries. Hence, the missing value can be replaced with Yes.
As for categorical values, the missing values can be replaced with this method.
![]() |
| Figure 16: Example of Data |
From the scenario above, we assume the weight and height of a person is highly correlated.
![]() |
| Figure 17: Relationship between weight and height |
A linear regression graph is plotted using these 2 variable and the corresponding value presented by the best fit line can be used to replace the missing value.
3) Pruning
- Prevent overfitting the training data so the decision tree will do a better job for the testing data.
- The most common method of pruning is called Cost complexity pruning which calculate the Sum of Square Residuals(SSR) for a full size tree and proceed to repeat the same for the rest of the trees with fewer leaves.
| Figure 19 and 20: Reduction of SSR |
From the figures, we can observe that the SSR increases as the number of features selected to build the tree decreases. This is because we want the pruned tree to not fit the training data as well as full-sized tree
We select the pruning result base on the tree that contains the lowest tree score which is calculated using this formula.
Tree score = SSR + aT
a= alpha is a tuning parameter that we find using Cross validation.
How to find the value of a
1) Use all of the data to build trees with different alphas
2) Use cross validation to compare alpha
3) Select the alpha value that gives the lowest SSR with the testing data
T = number of leaves
The relationship between a and T is directly proportional. If a increases, T increases.
Parameters that you can tune for decision tree
1. Model Complexity
- max_depth: Maximum depth of the decision tree.
- min_samples_split: The least count of samples which is needed to divide a node.
- min_samples_leaf: The minimum number of samples required to be in the form of leaf nodes.
- max_features: Features to look at when finding the optimal split.
- max_leaf_nodes: In decision trees, the highest number of outcomes that a tree can have at the terminal end.
- min_weight_fraction_leaf: The minimum of the weighted count for all the samples or input records in a single node of the decision trees.
2. Splitting Criteria
criterion: The parameter which defines the level of quality in splitting of the data within the decision tree.
- Information_gain: Select the criterion that gives the maximum amount of information entropy decrease for the split in the decision tree.
- Gain_ratio: Adjusts information gain formula by reducing the prejudice of the attributes having numerous distinct values, so as to select the attribute that brings maximal relative information.
- Accuracy: Localized on the accurate prediction to choose the split that results in the best overall accuracy of the decision tree.
- Gini_index: Reduces impurity in Measures through calculation of the probability of the label of the elements.
- Least_square: Calculates squared difference in observed and predicted values to obtain the least total squared errors to determine splits.
splitter: Chooses the split at each node.
3. Miscellaneous Parameters
Random_state: Helps in the regulation of variation of the estimator.max_samples: Regulates the number of samples to take from the overall data to train each base estimator.class_weight: This can be managed by the provision of higher weights for the minority classes in the datasets.ccp_alpha: All nodes that have a cost complexity value less than the specified threshold are removed.
4. Others
- Apply pruning: Use pruning to eliminate insignificant branches and simplify the selected model.
- Confidence: Defines the degree of pruning.
- Apply pre-pruning: Stop the tree building when further splits do not contribute to better decision-making
- Minimal gain: Setting a minimum threshold of the information gain for split to be made.
- Minimal leaf size: Defines the least number of instances allowable in a leaf node.
- Minimal size for split: Determines the number of instances that must be present in a node before the node can be split.
- Number of pre-pruning alternatives: Setting the number of split that the algorithm examines before pre-pruning the tree.
Implementation of decision tree in python
Import the dataset
| Figure 21: Importing the dataset |
Getting a preview of the dataset
| Figure 22: Code to see the first 5 records |
| Figure 23: First 5 records of the dataset |
Understanding dataset information
| Figure 24: Code to get the data type of each variable |
| Figure 25: Data type of each variable |
Data transformation using OneHotEncoder and OrdinalEncoder
| Figure 26: Performing Ordinal Encoder |
| Figure 27: Performing One Hot Encoder |
Understanding the correlation between features
| Figure 28: Code to generate the heatmap |
| Figure 29: Heatmap of the dataset |
Determining the dependent and independent variable
| Figure 30: Splitting the dataset base on dependent and independent variables |
Splitting the data and implement decision tree model
| Figure 31: Implementing the decision tree model |
Test model performance
| Figure 33: Model accuracy |
Generate the confusion matrix for the model
| Figure 34: Generating the confusion matrix of the model |
| Figure 35: Confusion matrix |








Comments
Post a Comment