AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...
RANDOM FOREST
This blog post will provide you with a comprehensive overview of Random Forest, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of Random Forest, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!
WHAT IS RANDOM FOREST

Random Forest is an ensemble machine learning algorithm primarily used for classification and regression tasks. It builds upon the concept of combining multiple decision trees to create a more robust and accurate model. Each tree in the forest is trained on a different subset of the data (using random sampling with replacement, also known as bootstrap sampling) and considers only a random subset of features for each split, which promotes diversity among the trees. During prediction, each tree in the forest casts a "vote," and the class with the majority of votes becomes the model’s final prediction, making it highly resistant to overfitting.
Concept of random forest
Scenario
Today we want to identify whether if the patient is diagnosed with heart disease through the medical records of the patient.
Bootstrap Dataset
For example, from the bootstrap dataset above, we can observe that the 3rd and 4th record are the same.
Bootstrap Aggregating (Bagging)
The same process is repeated for n times depends on the parameter we set for this algorithm to train the model. The process from bootstrapping the dataset to building the decision tree is called bootstrap aggregation which in short, bagging. This is how random forest classifier is made out of
How to determine the performance of random forest
Test with new data
One of the methods to determine the performance of this algorithm is applying a new data on it.
Within the record above, we do not know whether the patient is diagnosed with heart disease or not. Hence, we apply the independent variables to each of the decision tree that we previously trained our model with.
Assuming we trained our model with 6 decision trees and 5 of the decision tree returns the value as Yes and only 1 returns No. From here we can conclude that the record is classified as a yes and we can compare it with its actual boolean value for the Heart Disease to determine its accuracy.
Out-of-bag dataset
We can use the left out records that are not selected to create our bootstrap dataset to test our model. By compiling all these record into a dataset, we created the out-of-bag dataset.
The remaining process of testing the performance of the model remains the same.
How to deal with missing data
Training dataset
If we encounter missing data within the training dataset, the algorithm fills in the missing data with these approaches.
Mode and mean
For the dataset below, we do not know the values for the blocked arteries and the weight of the patient.
In terms of blocked arteries, we can determine the boolean value through the mode of the boolean value of Blocked Arteries. Since 2 of the 3 records of the Blocked Arteries are classified as No, we assume the unknown value for the blocked arteries as No. As for the weight, we attempt to replace the missing value with the mean value for the weight.
Proximity matrix
We can further improve the cleaning process by applying the proximity matrix to determine what value to replace the missing values.
We first run the record through this decision tree and determine which leaf node each of the records ends up in

Assuming the third and fourth record ended up in the same leaf, we fill in the proximity matrix in this manner

Repeat the process for each decision tree, assuming we have 10 decision tree in total. The end result will be displayed as below.
We divide each of the value by 10 as it is the total amount of decision tree we have within our random forest algorithm.
Next, we proceed to calculate the weighted frequency of the boolean value using this formula.
Weighted frequency = Frequency of the boolean value * proximity value
For example,
Since "No" appeared for 2 times within the blocked arteries.
The frequency for "No" is = 2/3
Assuming the combination of (4,1) and (4,3) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "No" is 0.1+0.8 = 0.9
Hence,
Weighted frequency for "No" = 2/3*0.9
= 0.6
On the other hand, "Yes" appeared for 1 times within the blocked arteries.
The frequency for "Yes" is = 1/3
Assuming the combination of (4,2) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "Yes" is 0.1
Hence,
Weighted frequency for "Yes"= 1/3*0.1
= 0.03
As for the missing value for weight, we take on a different approach to fill in the missing value.
Since weight is a numerical value, we replace the missing value with the weighted average.
Weighted average = (125 * 0.1) + (180 * 0.1) + (210 * 0.8)
= 198.5
The final cleaned dataset is displayed as above.
Extra
Distance matrix
Distance matrix measures how close are the records in terms of similarity and is calculated through
1 - Proximity Matrix
The distance matrix of the sample dataset above is displayed as above. In terms of visualization, we can apply heatmap and obtain the visualization as below
Testing dataset
But what happens of the missing values occur within out testing dataset?
From the example above, we can see the blocked arteries contain missing values.
Hence we prepare 2 version of the record as below
The version with the highest accuracy rate will be taken to replace the missing values.
Implementation of random forest in python
Importing libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
Loading dataset
wine=pd.read_csv('C:/Users/User/Desktop/Dataset_example/winequality-red.csv',sep=',')
Determining dependent and independent variable
X=wine.drop('quality',axis=1)
Y=wine['quality']
Splitting the dataset into testing and training
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)
Applying the model
rfc=RandomForestClassifier(n_estimators=200)
rfc.fit(x_train,y_train)
Get prediction result
pred_rfc=rfc.predict(x_test)
Get prediction accuracy
accuracy=accuracy_score(y_test, y_pred)
precision recall f1-score support
0 0.92 0.97 0.94 273
1 0.73 0.51 0.60 47
accuracy 0.90 320
macro avg 0.82 0.74 0.77 320
weighted avg 0.89 0.90 0.89 320
Parameters that you can tune in random forest
- Number of Trees (
n_estimators): This parameter determines the total number of decision trees in the forest. More trees typically enhance the model's robustness and accuracy but also increase computational time. The optimal number balances accuracy with efficiency, as too few trees may underfit while too many can lead to diminishing returns on accuracy.
- Maximum Depth of Trees (
max_depth): This controls how deep each tree in the forest can grow. Limiting the depth prevents trees from becoming overly complex, which can reduce overfitting by simplifying each individual tree. Deeper trees may capture more details in the data but can lead to overfitting, especially with noisy datasets.
- Minimum Samples per Split (
min_samples_split) and per Leaf (min_samples_leaf): These parameters control how many samples are needed to make a split and how many samples a node must have to become a leaf. Higher values for these parameters result in trees with fewer branches, reducing the model's complexity and likelihood of overfitting.
- Maximum Number of Features (
max_features): This parameter defines how many features to consider when splitting a node. Lower values can lead to more diverse trees because each tree is more likely to use different subsets of features. However, setting it too low may reduce the model’s overall predictive power.
- Bootstrap Sampling (
bootstrap): This boolean parameter determines whether each tree is trained on a randomly sampled subset of the data (with replacement). Enabling bootstrap sampling promotes diversity among trees, leading to improved generalization.
Advantages and disadvantages of random forest
Advantages
High Accuracy and Robustness:
- Random Forest typically provides high accuracy due to the combination of multiple decision trees, reducing the risk of overfitting.
- It performs well on large datasets and complex classification tasks, handling both categorical and continuous data effectively.
Feature Importance:
- Random Forest can provide insights into the relative importance of features, which is useful for feature selection and understanding the underlying patterns in data.
Works Well with Missing Data and Noise:
- The algorithm is resilient to noise in the dataset, and individual decision trees can handle missing values independently, which makes it robust and versatile for real-world datasets.
Disadvantages
High Computational Cost:
- Training a large number of trees can be computationally expensive, especially with large datasets, which can lead to longer training times and higher memory usage.
Lack of Interpretability:
- While it can show feature importance, the overall model can be difficult to interpret as it’s an ensemble of many decision trees. This makes it a “black box” compared to simpler models like linear regression or decision trees.
Risk of Overfitting with Large Trees:
- Although Random Forest is generally resistant to overfitting, there is still a risk if too many trees are used, especially if trees are allowed to grow too deep without regularization.
Implementation of random forest in real life
Price Optimization and Listing Quality
Amazon uses Random Forest to enhance its recommendation engine and analyze customer reviews.
Random Forest helps Amazon recommend products by classifying user behavior data and identifying
patterns based on past purchases, browsing history, and similar users’ behavior.
Fraud Detection and Demand Forecasting
Uber applies Random Forest in fraud detection and predicting ride demand. In fraud detection, Random
Forest models classify rides as “fraudulent” or “legitimate” based on factors like payment method,
location, and user history. For demand forecasting, Uber analyzes historical demand patterns, weather
data, and local events to predict high-demand areas and times.
Price Optimization and Listing Quality
Airbnb uses Random Forest for dynamic pricing and assessing the quality of listings. Random Forestmodels consider various factors like location, seasonality, amenities, and local competition torecommend optimal prices for hosts. Additionally, it evaluates listing quality by classifying elements
like photos,reviews, and descriptions, helping Airbnb suggest improvements.


















Comments
Post a Comment