Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

RANDOM FOREST

  

RANDOM FOREST

                                                

This blog post will provide you with a comprehensive overview of Random Forest, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of Random Forest, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!

WHAT IS RANDOM FOREST

                            

Random Forest is an ensemble machine learning algorithm primarily used for classification and regression tasks. It builds upon the concept of combining multiple decision trees to create a more robust and accurate model. Each tree in the forest is trained on a different subset of the data (using random sampling with replacement, also known as bootstrap sampling) and considers only a random subset of features for each split, which promotes diversity among the trees. During prediction, each tree in the forest casts a "vote," and the class with the majority of votes becomes the model’s final prediction, making it highly resistant to overfitting.


Concept of random forest

Scenario

Today we want to identify whether if the patient is diagnosed with heart disease through the medical records of the patient.

                                            

Bootstrap Dataset

Lets's start by randomly selecting 4 records from the sample dataset. In terms of bootstrap dataset, the records selected may include duplicate records as it is selected randomly (same record is selected more than 1 time) 



For example, from the bootstrap dataset above, we can observe that the 3rd and 4th record are the same.

Bootstrap Aggregating (Bagging)

Now randomly select several attributes from our dataset and build a decision tree from it.

                                         


The same process is repeated for n times depends on the parameter we set for this algorithm to train the model. The process from bootstrapping the dataset to building the decision tree is called bootstrap aggregation which in short, bagging. This is how random forest classifier is made out of

How to determine the performance of random forest

Test with new data

One of the methods to determine the performance of this algorithm is applying a new data on it.

                                       

Within the record above, we do not know whether the patient is diagnosed with heart disease or not. Hence, we apply the independent variables to each of the decision tree that we previously trained our model with.

                                                         

Assuming we trained our model with 6 decision trees and 5 of the decision tree returns the value as Yes and only 1 returns No. From here we can conclude that the record is classified as a yes and we can compare it with its actual boolean value for the Heart Disease to determine its accuracy.

Out-of-bag dataset

We can use the left out records that are not selected to create our bootstrap dataset to test our model. By compiling all these record into a dataset, we created the out-of-bag dataset.

                                  

The remaining process of testing the performance of the model remains the same.
                                   

How to deal with missing data

Training dataset

If we encounter missing data within the training dataset, the algorithm fills in the missing data with these approaches.

Mode and mean

For the dataset below, we do not know the values for the blocked arteries and the weight of the patient.

                                     

In terms of blocked arteries, we can determine the boolean value through the mode of the boolean value of Blocked Arteries. Since 2 of the 3 records of the Blocked Arteries are classified as No, we assume the unknown value for the blocked arteries as No. As for the weight, we attempt to replace the missing value with the mean value for the weight.


The cleaned dataset is presented as above.


Proximity matrix

We can further improve the cleaning process by applying the proximity matrix to determine what value to replace the missing values.

We first run the record through this decision tree and determine which leaf node each of the records ends up in
                                      
Assuming the third and fourth record ended up in the same leaf, we fill in the proximity matrix in this manner

                                                       
Repeat the process for each decision tree, assuming we have 10 decision tree in total. The end result will be displayed as below.

                                                     

We divide each of the value by 10 as it is the total amount of decision tree we have within our random forest algorithm.

                                                     

Next, we proceed to calculate the weighted frequency of the boolean value using this formula.

Weighted frequency = Frequency of the boolean value * proximity value 

For example, 
                                    

Since "No" appeared for 2 times within the blocked arteries.
The frequency for "No" is = 2/3

Assuming the combination of (4,1) and (4,3) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "No" is 0.1+0.8 = 0.9


Hence, 

Weighted frequency for "No" = 2/3*0.9
                                                = 0.6



On the other hand, "Yes" appeared for 1 times within the blocked arteries.
The frequency for "Yes" is = 1/3

Assuming the combination of (4,2) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "Yes" is 0.1


Hence, 

Weighted frequency for "Yes"= 1/3*0.1
                                                = 0.03


Since the Weighted frequency for "No" is higher, we replace the missing value with "No"


As for the missing value for weight, we take on a different approach to fill in the missing value.

                                                                     

Since weight is a numerical value, we replace the missing value with the weighted average. 

Weighted average = (125 * 0.1) + (180 * 0.1) + (210 * 0.8)
                              = 198.5



The final cleaned dataset is displayed as above.

Extra

Distance matrix

Distance matrix measures how close are the records in terms of similarity and is calculated through

1 - Proximity Matrix

                                                       

The distance matrix of the sample dataset above is displayed as above. In terms of visualization, we can apply heatmap and obtain the visualization as below

                                                           

Testing dataset

But what happens of the missing values occur within out testing dataset? 

                                           

From the example above, we can see the blocked arteries contain missing values. 
Hence we prepare 2 version of the record as below

                                           

The version with the highest accuracy rate will be taken to replace the missing values.

Implementation of random forest in python

Importing libraries 

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

Loading dataset

wine=pd.read_csv('C:/Users/User/Desktop/Dataset_example/winequality-red.csv',sep=',')

Determining dependent and independent variable

X=wine.drop('quality',axis=1)
Y=wine['quality']

Splitting the dataset into testing and training 

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

Applying the model

rfc=RandomForestClassifier(n_estimators=200)
rfc.fit(x_train,y_train)

Get prediction result

pred_rfc=rfc.predict(x_test)

Get prediction accuracy

accuracy=accuracy_score(y_test, y_pred)

precision recall f1-score support 0 0.92 0.97 0.94 273 1 0.73 0.51 0.60 47 accuracy 0.90 320 macro avg 0.82 0.74 0.77 320 weighted avg 0.89 0.90 0.89 320

Parameters that you can tune in random forest

  • Number of Trees (n_estimators): This parameter determines the total number of decision trees in the forest. More trees typically enhance the model's robustness and accuracy but also increase computational time. The optimal number balances accuracy with efficiency, as too few trees may underfit while too many can lead to diminishing returns on accuracy.
  • Maximum Depth of Trees (max_depth): This controls how deep each tree in the forest can grow. Limiting the depth prevents trees from becoming overly complex, which can reduce overfitting by simplifying each individual tree. Deeper trees may capture more details in the data but can lead to overfitting, especially with noisy datasets.
  • Minimum Samples per Split (min_samples_split) and per Leaf (min_samples_leaf): These parameters control how many samples are needed to make a split and how many samples a node must have to become a leaf. Higher values for these parameters result in trees with fewer branches, reducing the model's complexity and likelihood of overfitting. 
  • Maximum Number of Features (max_features): This parameter defines how many features to consider when splitting a node. Lower values can lead to more diverse trees because each tree is more likely to use different subsets of features. However, setting it too low may reduce the model’s overall predictive power. 
  • Bootstrap Sampling (bootstrap): This boolean parameter determines whether each tree is trained on a randomly sampled subset of the data (with replacement). Enabling bootstrap sampling promotes diversity among trees, leading to improved generalization. 

Advantages and disadvantages of random forest

Advantages

  • High Accuracy and Robustness:

    • Random Forest typically provides high accuracy due to the combination of multiple decision trees, reducing the risk of overfitting.
    • It performs well on large datasets and complex classification tasks, handling both categorical and continuous data effectively.
  • Feature Importance:

    • Random Forest can provide insights into the relative importance of features, which is useful for feature selection and understanding the underlying patterns in data.
  • Works Well with Missing Data and Noise:

    • The algorithm is resilient to noise in the dataset, and individual decision trees can handle missing values independently, which makes it robust and versatile for real-world datasets.

Disadvantages

  • High Computational Cost:

    • Training a large number of trees can be computationally expensive, especially with large datasets, which can lead to longer training times and higher memory usage.
  • Lack of Interpretability:

    • While it can show feature importance, the overall model can be difficult to interpret as it’s an ensemble of many decision trees. This makes it a “black box” compared to simpler models like linear regression or decision trees.
  • Risk of Overfitting with Large Trees:

    • Although Random Forest is generally resistant to overfitting, there is still a risk if too many trees are used, especially if trees are allowed to grow too deep without regularization.

Implementation of random forest in real life

Price Optimization and Listing Quality



                

Amazon uses Random Forest to enhance its recommendation engine and analyze customer reviews.
Random Forest helps Amazon recommend products by classifying user behavior data and identifying
patterns based on past purchases, browsing history, and similar users’ behavior.

Fraud Detection and Demand Forecasting


Uber applies Random Forest in fraud detection and predicting ride demand. In fraud detection, Random
Forest models classify rides as “fraudulent” or “legitimate” based on factors like payment method,
location, and user history. For demand forecasting, Uber analyzes historical demand patterns, weather
data, and local events to predict high-demand areas and times.

Price Optimization and Listing Quality



Airbnb uses Random Forest for dynamic pricing and assessing the quality of listings. Random Forestmodels consider various factors like location, seasonality, amenities, and local competition torecommend optimal prices for hosts. Additionally, it evaluates listing quality by classifying elements
like photos,reviews, and descriptions, helping Airbnb suggest improvements.


Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

LINEAR REGRESSION

 LINEAR REGRESSION Figure 1: Linear regression figure This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression. What is linear regression Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis...

DECISION TREE

 DECISION TREE Figure 1: Decision Tree      This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.  What is decision tree      Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regr...