Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

DECISION TREE

 DECISION TREE

Figure 1: Decision Tree

    This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many. 

What is decision tree

    Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regression and linear regression, decision trees emphasizes on the splitting of nodes into multiple leaves that aims to predict the outcome of the scenario base on its independent variables

Concept of decision tree

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.
The decision tree model is divided into two types which are 

  • classification trees: the tree contains categorical values
  • regression trees: the tree contains numerical values

However, a decision tree can contain mixed data type of both categorical and numerical values but despite being within the same decision tree, the numerical threshold can be different for the same attribute when splitting into multiple leaves. The terms that are most commonly used when discussing about decision trees include roots, branch and leaves.

  • very top of the tree = roots
  • got arrow enter and come out= branch
  • got arrow enter but not out = leaves 

A decision tree will keep on expanding until all the records within the leaf have the same output, at that moment the leaf is considered a pure leaf as it only contains record of the same output.

Scenario

Today you wish to take your friend out on a treat and you want to know if your friend loves cool as Ice. However, the information you have about your friend is only his age, whether he loves popcorn or soda. With such limited information, are you able to give your friend a wonderful hang out experience? Fear not as the decision tree's model will assist you.

While applying decision tree model, there are several aspects to take into account which will be explained within this blogpost.

1) Gini Impurity

- Determines the sequence of features within the tree
- Gini impurity is calculated using this method demonstrated as below.

1) Separate the dependent and independent values. 

Figure 2: Example of Data

For this scenario, the dependent variable is "Loves Cool as Ice" and the remaining variables are independent variables. However, the method to calculate the gini impurity for dependent and independent variables are different.

2) For categorical variables, separate the records base on True or False and calculate the amount of records under each outcome of independent variables.

Figure 3: Example of decision tree 

Within this example, for people that favors popcorn, only 1 of them likes cool as ice meanwhile the other 3 doesn't. On the other hand, people that dislikes popcorn, 2 of them likes cool as ice and only 1 of them dislikes.
Figure 4: Example of decision tree 

As for this example, for people that favors soda, 3 of them likes cool as ice meanwhile the only 1 doesn't. On the other hand, people that dislikes soda, none of them likes cool as ice and all 3 of them dislikes.

3) The formula of calculating gini impurity for a leaf is demonstrated as below

Gini Impurity for a Leaf = 1- (the probability of "Yes")^2 - (the probability of "No")^2
Figure 5: Example of decision tree 

Gini Impurity of the leaf for people that likes popcorn
= 1- (1/4)^2 - (3/4)^2
= 1- 0.625
= 0.375

Gini Impurity of the leaf for people that dislikes popcorn
= 1- (2/3)^2 - (1/3)^2
= 1- 0.556
= 0. 444

4) We calculate the total gini impurity which is the final gini impurity value that we will take for that independent variable. The formula for calculating the total gini impurity can be simplified as

Total Gini Impurity = weighted average of Gini Impurities of the Leaves

Figure 6: Example of decision tree 
                                      
The total gini impurity for Popcorn can be calculated as
= (people that likes popcorn / total people)*Gini Impurity of the leaf for people that likes popcorn + (people that dislikes popcorn / total people)*Gini Impurity of the leaf for people that dislikes popcorn
= (4/7)*0.375+ (3/7)*0.444
=  0.405

Figure 7: Total Gini Impurity for Soda
Through repeating the process, we are able to find the total impurity of Soda which is 0.214.


As for numerical values like the age, the method to calculate the total Gini impurity are demonstrated as below.

1) Sort the numerical values in order from small to big
Figure 8: Sorting the age in ascending order


2) Get the middle number in between 2 consecutive numbers

3) Use the middle number as a threshold to split the data, for example, the middle number between 7 and 12 is 9.5. Hence, 9.5 is used as the threshold to split the leaves.

4) Using the same formula, calculate the gini impurity of each leaves.
Figure 9: Gini Impurity for the age with threshold 9.5

4) Calculate the total gini impurity of the variable with the threshold.
Figure 10: Calculating the total gini impurity

5) Repeat the process for all middle number
6) The middle number with the lowest Gini impurity will be taken and used

A Gini impurity of an independent variable also effects its
  • Entropy: The quantitative measure of the randomness of the information being processed.
A high value of Entropy means that the randomness in the system is high and thus making accurate predictions is tough.

A low value of Entropy means that the randomness in the system is low and thus making accurate predictions is easier.

  • Information Gain : The measure of how much information a feature provides about a class. 
Low entropy leads to increased Information Gain and high entropy leads to low Information Gain.


The relationship between Gini Impurity, Entropy and Information Gain can be concluded as below.

If the Gini impurity increases, Entropy increases but the information gain decreases.

2) Feature Selection and Handling Missing Data

Feature Selection

- Objective: Reduce number of tress ,which leads to reduction in complexity and prevent overfitting from happening.
- Ensure each split reduces the impurity of the leaf.
Figure 11: Example of decision tree

From the scenario above, we can see the variable Chest Pain does not assist in impurity reduction.

Figure 12: Example of decision tree

Hence, that variable is removed from the decision tree to reduce its complexity.

Handling missing data

For categorical variables, the missing data can be replaced with the most frequent category within that variable.

Figure 13: Blocked arteries data

For this scenario, there are 2 'No' and 1 'Yes' for the variable called Blocked Arteries. Hence, the missing data can be replaced with 'No'.

Another alternatives involve referring to another variable that is highly correlated with the variable that contains missing value.

Figure 15: Example of data 

For this scenario, blocked arteries is correlated with chest pain and the value of chest pain is the same as the value of blocked arteries. Hence, the missing value can be replaced with Yes.

As for categorical values, the missing values can be replaced with this method.

Figure 16: Example of Data

From the scenario above, we assume the weight and height of a person is highly correlated. 

Figure 17: Relationship between weight and height

A linear regression graph is plotted using these 2 variable and the corresponding value presented by the best fit line can be used to replace the missing value.

3) Pruning

- Prevent overfitting the training data so the decision tree will do a better job for the testing data.
- The most common method of pruning is called Cost complexity pruning which calculate the Sum of Square Residuals(SSR) for a full size tree and proceed to repeat the same for the rest of the trees with fewer leaves.

Figure 18: SSR of a full decision tree

The SSR for a full size tree is 543.8 as displayed.

Figure 19 and 20: Reduction of SSR

From the figures, we can observe that the SSR increases as the number of features selected to build the tree decreases. This is because we want the pruned tree to not fit the training data as well as full-sized tree

We select the pruning result base on the tree that contains the lowest tree score which is calculated using this formula.

Tree score = SSR + aT

a= alpha is a tuning parameter that we find using Cross validation.

How to find the value of a
1) Use all of the data to build trees with different alphas
2) Use cross validation to compare alpha
3) Select the alpha value that gives the lowest SSR with the testing data


T = number of leaves

The relationship between a and T is directly proportional. If a increases, T increases.

Parameters that you can tune for decision tree

1. Model Complexity 

  • max_depth: Maximum depth of the decision tree.
  • min_samples_split: The least count of samples which is needed to divide a node.
  • min_samples_leaf: The minimum number of samples required to be in the form of leaf nodes.
  • max_features: Features to look at when finding the optimal split.
  • max_leaf_nodes: In decision trees, the highest number of outcomes that a tree can have at the terminal end.
  • min_weight_fraction_leaf: The minimum of the weighted count for all the samples or input records in a single node of the decision trees.

2. Splitting Criteria

  • criterion: The parameter which defines the level of quality in splitting of the data within the decision tree.
  1. Information_gain: Select the criterion that gives the maximum amount of information entropy decrease for the split in the decision tree.
  2. Gain_ratio: Adjusts information gain formula by reducing the prejudice of the attributes having numerous distinct values, so as to select the attribute that brings maximal relative information.
  3. Accuracy: Localized on the accurate prediction to choose the split that results in the best overall accuracy of the decision tree.
  4. Gini_index: Reduces impurity in Measures through calculation of the probability of the label of the elements.
  5. Least_square: Calculates squared difference in observed and predicted values to obtain the least total squared errors to determine splits.
  • splitter: Chooses the split at each node. 

3. Miscellaneous Parameters

  • Random_state: Helps in the regulation of variation of the estimator.
  • max_samples: Regulates the number of samples to take from the overall data to train each base estimator.
  • class_weightThis can be managed by the provision of higher weights for the minority classes in the datasets.
  • ccp_alpha: All nodes that have a cost complexity value less than the specified threshold are removed.

4. Others

  • Apply pruning: Use pruning to eliminate insignificant branches and simplify the selected model.
  • Confidence: Defines the degree of pruning.
  • Apply pre-pruning: Stop the tree building when further splits do not contribute to better decision-making
  • Minimal gain: Setting a minimum threshold of the information gain for split to be made.
  • Minimal leaf size: Defines the least number of instances allowable in a leaf node.
  • Minimal size for split: Determines the number of instances that must be present in a node before the node can be split.
  • Number of pre-pruning alternatives: Setting the number of split that the algorithm examines before pre-pruning the tree.

Implementation of decision tree in python

Import the dataset 

Figure 21: Importing the dataset

Getting a preview of the dataset


Figure 22: Code to see the first 5 records


Figure 23: First 5 records of the dataset

Understanding dataset information


Figure 24: Code to get the data type of each variable

Figure 25: Data type of each variable

Data transformation using OneHotEncoder and OrdinalEncoder


Figure 26: Performing Ordinal Encoder


Figure 27: Performing One Hot Encoder

Understanding the correlation between features


Figure 28: Code to generate the heatmap


Figure 29: Heatmap of the dataset


Determining the dependent and independent variable 


Figure 30: Splitting the dataset base on dependent and independent variables

Splitting the data and implement decision tree model

Figure 31: Implementing the decision tree model

Test model performance

Figure 32: Getting the result of the model


Figure 33: Model accuracy

Generate the confusion matrix for the model


Figure 34: Generating the confusion matrix of the model


Figure 35: Confusion matrix


Advantages and disadvantages of decision tree

Advantages 

  1. Decision trees are intuitive and straightforward which makes it suitable to explain the decision-making process to stakeholders who are non-technical as the results of decision trees are easy to interpret.
  2. Decision trees require minimal data preprocessing compared to other algorithms. They do not require feature scaling or normalization and can handle both numerical and categorical data.
  3. Decision trees can be applied for the purpose of classification and regression which leads to the capability in handling situations with multiple output or contain features with non-linear relationship

Disadvantages 

  1. Decision trees can easily overfit when the decision trees are deep and complex because the tree might capture noise and small changes within the data instead of the underlying pattern within the dataset.
  2. Minor changes within the dataset might differs the outcome of the algorithm, making this machine learning model to be sensitive to variations in the data. The lack of robustness of decision tree is one of the cons that holds back the performance of this model.
  3. Decision trees are biased towards the classes with more instances in the event of handling imbalanced datasets. The usage of class weights or other techniques without appropriate data preprocessing before the implementation of the model will cause the result of the decision tree to be biased towards the majority class and opposite towards the minority class.

Implementation of decision tree in real life

1. Customer Churn Prediction 

Figure 36: Customer Churn

  • AT&T applies decision trees to predict customer churn. Through analyzing customer data such as call patterns as well as billing information, decision trees is applied within this scenario to predict which customer is most likely to leave. Through offering incentives or enhancing customer care for their clientele, AT&T can lower their attrition rates and boost the efficacy of their target retention strategy.

2. Fraud Detection 

Figure 37: Fraud detection

  • PayPal detects fraudulent transactions with the assistance of decision trees. Identifying patterns indicative of fraud is made easier through identifying customer's behavior and track if there's any unusual actions. The model gains the ability to identify potentially fraudulent behaviors and flag them for additional examination, guarding the clients against financial losses and keeps client accounts safe from hackers or unauthorized users.

3. Personalized Marketing 

Figure 38: Personalized Marketing

  • Amazon applies decision trees for personalized marketing and recommendation systems. It is because decision trees can segment customers and predict what products they are interested in by analyzing their preferences, search history and the amount of time they spent on that particular category. Amazon utilizes decision tree to provide personalized recommendations and targeted advertisements which enhances the shopping experience which leads to the increment of the sales in the company.

Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

LINEAR REGRESSION

 LINEAR REGRESSION Figure 1: Linear regression figure This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression. What is linear regression Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis...