RANDOM FOREST

This blog post will provide you with a comprehensive overview of Random Forest, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of Random Forest, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!

WHAT IS RANDOM FOREST

Random Forest is an ensemble machine learning algorithm primarily used for classification and regression tasks. It builds upon the concept of combining multiple decision trees to create a more robust and accurate model. Each tree in the forest is trained on a different subset of the data (using random sampling with replacement, also known as bootstrap sampling) and considers only a random subset of features for each split, which promotes diversity among the trees. During prediction, each tree in the forest casts a "vote," and the class with the majority of votes becomes the model’s final prediction, making it highly resistant to overfitting.

Concept of random forest

Scenario

Today we want to identify whether if the patient is diagnosed with heart disease through the medical records of the patient.

Bootstrap Dataset

Lets's start by randomly selecting 4 records from the sample dataset. In terms of bootstrap dataset, the records selected may include duplicate records as it is selected randomly (same record is selected more than 1 time)

For example, from the bootstrap dataset above, we can observe that the 3rd and 4th record are the same.

Bootstrap Aggregating (Bagging)

Now randomly select several attributes from our dataset and build a decision tree from it.

The same process is repeated for n times depends on the parameter we set for this algorithm to train the model. The process from bootstrapping the dataset to building the decision tree is called bootstrap aggregation which in short, bagging. This is how random forest classifier is made out of

How to determine the performance of random forest

Test with new data

One of the methods to determine the performance of this algorithm is applying a new data on it.

Within the record above, we do not know whether the patient is diagnosed with heart disease or not. Hence, we apply the independent variables to each of the decision tree that we previously trained our model with.

Assuming we trained our model with 6 decision trees and 5 of the decision tree returns the value as Yes and only 1 returns No. From here we can conclude that the record is classified as a yes and we can compare it with its actual boolean value for the Heart Disease to determine its accuracy.

Out-of-bag dataset

We can use the left out records that are not selected to create our bootstrap dataset to test our model. By compiling all these record into a dataset, we created the out-of-bag dataset.

The remaining process of testing the performance of the model remains the same.

How to deal with missing data

Training dataset

If we encounter missing data within the training dataset, the algorithm fills in the missing data with these approaches.

Mode and mean

For the dataset below, we do not know the values for the blocked arteries and the weight of the patient.

In terms of blocked arteries, we can determine the boolean value through the mode of the boolean value of Blocked Arteries. Since 2 of the 3 records of the Blocked Arteries are classified as No, we assume the unknown value for the blocked arteries as No. As for the weight, we attempt to replace the missing value with the mean value for the weight.

The cleaned dataset is presented as above.

Proximity matrix

We can further improve the cleaning process by applying the proximity matrix to determine what value to replace the missing values.

We first run the record through this decision tree and determine which leaf node each of the records ends up in

Assuming the third and fourth record ended up in the same leaf, we fill in the proximity matrix in this manner

Repeat the process for each decision tree, assuming we have 10 decision tree in total. The end result will be displayed as below.

We divide each of the value by 10 as it is the total amount of decision tree we have within our random forest algorithm.

Next, we proceed to calculate the weighted frequency of the boolean value using this formula.

Weighted frequency = Frequency of the boolean value * proximity value

For example,

Since "No" appeared for 2 times within the blocked arteries.

The frequency for "No" is = 2/3

Assuming the combination of (4,1) and (4,3) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "No" is 0.1+0.8 = 0.9

Hence,

Weighted frequency for "No" = 2/3*0.9

= 0.6

On the other hand, "Yes" appeared for 1 times within the blocked arteries.

The frequency for "Yes" is = 1/3

Assuming the combination of (4,2) within the proximity matrix have leaves in which the Blocked Arteries is labeled as "No". Hence the proximity value for "Yes" is 0.1

Hence,

Weighted frequency for "Yes"= 1/3*0.1

= 0.03

Since the Weighted frequency for "No" is higher, we replace the missing value with "No"

As for the missing value for weight, we take on a different approach to fill in the missing value.

Since weight is a numerical value, we replace the missing value with the weighted average.

Weighted average = (125 * 0.1) + (180 * 0.1) + (210 * 0.8)

= 198.5

The final cleaned dataset is displayed as above.

Extra

Distance matrix

Distance matrix measures how close are the records in terms of similarity and is calculated through

1 - Proximity Matrix

The distance matrix of the sample dataset above is displayed as above. In terms of visualization, we can apply heatmap and obtain the visualization as below

Testing dataset

But what happens of the missing values occur within out testing dataset?

From the example above, we can see the blocked arteries contain missing values.

Hence we prepare 2 version of the record as below

The version with the highest accuracy rate will be taken to replace the missing values.

Implementation of random forest in python

Importing libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

Loading dataset

wine=pd.read_csv('C:/Users/User/Desktop/Dataset_example/winequality-red.csv',sep=',')

Determining dependent and independent variable

X=wine.drop('quality',axis=1)
Y=wine['quality']

Splitting the dataset into testing and training

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

Applying the model

rfc=RandomForestClassifier(n_estimators=200)
rfc.fit(x_train,y_train)

Get prediction result

pred_rfc=rfc.predict(x_test)

Get prediction accuracy

accuracy=accuracy_score(y_test, y_pred)

              precision    recall  f1-score   support

           0       0.92      0.97      0.94       273
           1       0.73      0.51      0.60        47

    accuracy                           0.90       320
   macro avg       0.82      0.74      0.77       320
weighted avg       0.89      0.90      0.89       320

Parameters that you can tune in random forest

Number of Trees (n_estimators): This parameter determines the total number of decision trees in the forest. More trees typically enhance the model's robustness and accuracy but also increase computational time. The optimal number balances accuracy with efficiency, as too few trees may underfit while too many can lead to diminishing returns on accuracy.

Maximum Depth of Trees (max_depth): This controls how deep each tree in the forest can grow. Limiting the depth prevents trees from becoming overly complex, which can reduce overfitting by simplifying each individual tree. Deeper trees may capture more details in the data but can lead to overfitting, especially with noisy datasets.

Minimum Samples per Split (min_samples_split) and per Leaf (min_samples_leaf): These parameters control how many samples are needed to make a split and how many samples a node must have to become a leaf. Higher values for these parameters result in trees with fewer branches, reducing the model's complexity and likelihood of overfitting.

Maximum Number of Features (max_features): This parameter defines how many features to consider when splitting a node. Lower values can lead to more diverse trees because each tree is more likely to use different subsets of features. However, setting it too low may reduce the model’s overall predictive power.

Bootstrap Sampling (bootstrap): This boolean parameter determines whether each tree is trained on a randomly sampled subset of the data (with replacement). Enabling bootstrap sampling promotes diversity among trees, leading to improved generalization.

Advantages and disadvantages of random forest

Advantages

High Accuracy and Robustness:
- Random Forest typically provides high accuracy due to the combination of multiple decision trees, reducing the risk of overfitting.
- It performs well on large datasets and complex classification tasks, handling both categorical and continuous data effectively.
Feature Importance:
- Random Forest can provide insights into the relative importance of features, which is useful for feature selection and understanding the underlying patterns in data.
Works Well with Missing Data and Noise:
- The algorithm is resilient to noise in the dataset, and individual decision trees can handle missing values independently, which makes it robust and versatile for real-world datasets.

Disadvantages

High Computational Cost:
- Training a large number of trees can be computationally expensive, especially with large datasets, which can lead to longer training times and higher memory usage.
Lack of Interpretability:
- While it can show feature importance, the overall model can be difficult to interpret as it’s an ensemble of many decision trees. This makes it a “black box” compared to simpler models like linear regression or decision trees.
Risk of Overfitting with Large Trees:
- Although Random Forest is generally resistant to overfitting, there is still a risk if too many trees are used, especially if trees are allowed to grow too deep without regularization.

Implementation of random forest in real life

Price Optimization and Listing Quality

Amazon uses Random Forest to enhance its recommendation engine and analyze customer reviews.

Random Forest helps Amazon recommend products by classifying user behavior data and identifying

patterns based on past purchases, browsing history, and similar users’ behavior.

Fraud Detection and Demand Forecasting

Uber applies Random Forest in fraud detection and predicting ride demand. In fraud detection, Random

Forest models classify rides as “fraudulent” or “legitimate” based on factors like payment method,

location, and user history. For demand forecasting, Uber analyzes historical demand patterns, weather

data, and local events to predict high-demand areas and times.

Price Optimization and Listing Quality

Airbnb uses Random Forest for dynamic pricing and assessing the quality of listings. Random Forestmodels consider various factors like location, seasonality, amenities, and local competition torecommend optimal prices for hosts. Additionally, it evaluates listing quality by classifying elements

like photos,reviews, and descriptions, helping Airbnb suggest improvements.

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

QDO

ADABOOST

RANDOM FOREST

RANDOM FOREST

WHAT IS RANDOM FOREST

Concept of random forest

Scenario

Bootstrap Dataset

Bootstrap Aggregating (Bagging)

How to determine the performance of random forest

Test with new data

Out-of-bag dataset

How to deal with missing data

Training dataset

Mode and mean

Proximity matrix

Extra

Distance matrix

Testing dataset

Implementation of random forest in python

Importing libraries

Loading dataset

Determining dependent and independent variable

Splitting the dataset into testing and training

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

Applying the model

Get prediction result

Get prediction accuracy

Parameters that you can tune in random forest

Advantages and disadvantages of random forest

Advantages

Disadvantages

Implementation of random forest in real life

Price Optimization and Listing Quality

Fraud Detection and Demand Forecasting

Price Optimization and Listing Quality

Comments

Post a Comment

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

LINEAR REGRESSION

DECISION TREE