Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

LINEAR REGRESSION

 LINEAR REGRESSION

Figure 1: Linear regression figure

This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression.

What is linear regression

Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis and come out with an accurate prediction outcome using this algorithm?

Concept of linear regression

Scenario

Imagine today you want to predict the house price base on the information you have on the house includes the number of rooms, the house area, how many floors etc. How can you perform your task with this info? Fear not as linear regression is here to save the day. Despite being under the regression category, linear regression is not the same as logistic regression as linear regression emphasizes on predicting continuous values instead of discrete values and the algorithm of linear regression is world's apart from logistic regression.

The formula for linear regression
y=mx+c

This formula is commonly seen in mathematics but the similar formula applies as well in linear regression algorithm. Although it looks simple, there are also a few aspects to take note while implementing linear regression within your dataset.

1) Use least squares to fit a line to the data



METHODS TO DRAW THE BEST FIT LINE

1) draw a line to the data
2) measure the distance from the line to the data (residual)
3) Sum it up and square
4) rotate and repeat
5) the rotation with the lowest sum of square will be used as the best fit line


2) Calculate  r2

In the context of linear regression, r2 displays how much variation in the independent variable can be explained for each dependent variable taken into account.


r2= 0.6 = 60%

This means that knowing the value of the particular independent variable helps you to predict 60% of the dependent variable

r2= 100% means knowing the independent variable help you determine all the dependent variable
r2= 0% indicates that the independent variable does not help in predicting dependent variable

Formula for r2

Figure 2: Formula of r2

Methods of getting the value of var(fit) and var(mean) are displayed as below.

Var(fit)

1) Measure the vertical distance from the data to the best fit line.

Figure 3: Example of data

Figure 4: Formula of Var(fit)

In short Var(fit) represents the variance obtained from squaring the difference of the value of the data with its distance with the best fit data and proceed to divide it with the number of data plotted on the graph.

Var(mean)

1) Push all the data to align in a straight line on the y-axis then proceed to draw a horizontal line on the mean of the dataset. Similar to the calculations in getting the var(fit), calculate the distance from the data to the man line.
Figure 5: Example of data

Figure 6: Formula of Var(mean)

In short Var(fit) represents the variance obtained from squaring the difference of the value of the data with its distance with the best fit data and proceed to divide it with the number of data plotted on the graph.

3) Calculate p value for r2

The p value for r2 could be described as the average explained variance per degree of freedom. It helps to determine whether the additional predictors in a regression model significantly improve the model's ability to explain the variation in the response variable compared to a simpler model.

Figure 7: Formula of F

The p-value for r2 could be addressed as F in the context of linear regression. This is also called the F-test in terms of hypothesis testing for this machine learning model.

The value of SS(mean) and SS(fit) are already obtained while calculating the value of r2 but the value of p(fit), p(mean) and n are yet remained unknown.

p(fit)

This variable represents the number of parameters on a fit line.

Example: 


Figure 8: Example of data
                                           
The formula of the best fit line could be represented with the formula

 y= slope X + y-intercept

Parameters

1) slope
2) y-intercept

Hence, we can conclude that p(fit) =2

p(mean)

p(means) on the other hand refers to the number of parameters on the mean line

Figure 9: Example of data

 The formula for the mean line is 
y=y-intercept

Parameters

1) y-intercept

Hence, we can conclude that the value of p(mean) = 1

n

The variable of n refers to the total number of observations in your dataset


Relationship between r2 and F to the linear regression model

r2 represents the relationship between variables within the data. Hence, the bigger the value of r2, the better the model.

F determines how reliable that relationship in r2 is and contrary of r2 the lower the value of F, The better the performance of the model.


Parameters that you can tune for linear regression

Feature selection techniques 

- this parameter is responsible for identifying and selecting a subset of features or independent variables for the construction of the model. The types of feature selection techniques are displayed as below

  • M5-prime - builds regression trees and can be used for feature selection by evaluating the importance of each feature in predicting the target variable. It emphasizes on the outlier detection and removal in linear regression models.
  • T-Test - evaluate whether each feature is significantly related to the target variable by testing the significance of parameters in a linear regression model.
  • Iterative T-Test - select the most significant features, removing the least significant ones in each iteration until a specified condition is met
  • Greedy- iteratively adds or removes features to discover the subset of features that provide the best results
  • Min tolerance -Sets a minimum threshold for the convergence criteria during model optimization to ensure that the algorithm stops iterating when the improvement in the objective function falls below this tolerance level.

Ridge 

-  Penalizes large coefficients in the regression model in order to control the complexity of the model and boost the algorithm's generalization performance

Eliminate colinear features

- Automatically identifies and removes independent variables that are highly correlated with each other,

Fit Intercept

- Whether to calculate the intercept for the model. 

Copy X

- If True, X will be copied; otherwise, it may be overwritten.

n_jobs

- The number of jobs to use for the computation. 

Implementation of linear regression in python

Importing the dataset


Figure 10: Code to import the dataset


Figure 11: Overview of the dataset

Determine the X and Y attribute


Figure 12: Determining the X and Y value


Figure 14: Plotting the X and Y values on the graph

Splitting the dataset for training and testing


Figure 15: Code to split the dataset

Figure 16: Reshaping the dataset

Apply model

Figure 17: Code to apply Linear Regression model


Discover the intercept and coefficient of the linear regression model

Figure 18: Code to get the intercept and coefficient

Figure 19: Intercept and coefficient

Test the model

Figure 20: Predict the training and testing model

Visualize the result

Figure 21: Visualizing the result


Figure 22: Drawing the linear regression line on the graph

Determine the mean squared error

Figure 23: Code to get the mean squared error

Figure 24: Mean squared error

Obtaining the r2

Figure 25: Code to get the r2 score

Figure 26: r2 score

Advantages and disadvantages of linear regression

Advantages 

  1. Linear regression is computationally efficient which makes it suitable to be applied to perform analysis on large datasets.
  2. If the relationship between the independent variable and dependent variable is linear, linear regression provides a decent outcome.
  3. If the assumptions of linear regression is satisfied, this machine learning algorithm could be a robust model in the aspect of prediction.

Disadvantages 

  1. For the real-life data, the relationships of independent and dependent variables may not be linear so if anytime there is an assumption of linearity, there will be error.
  2. When using many predictor variables in connection with training, it is possible that the model is overly complex and only fits the training data and poorly on valuable data sets.
  3. When the independent variables are highly correlated it may lead to difficulties as to stability as well as interpretation of coefficients.

Implementation of linear regression in real life

Predicting House Prices

Figure 27: House price prediction


Zillow applies linear regression models to predict the prices of homes based on various features namely location, number of bedrooms, square footage, to name but a few. Through understanding how each feature contributes to the overall price, Zillow is able to provide accurate and real-time estimates of home values.

Sales Forecasting

Figure 28: Sales forecasting

The usage of linear regression is present at Walmart  where the company applies it for sale forecasting of its stores. By using historical data and also the factors such as promotion, holidays, and other economical indicators, Walmart can execute the projection of the following sales to avoid them from wastage of their funds on the stocks.

Risk Management in Finance

Figure 29: Risk Management

Linear regression is highly relevant to risk management in the scope of finance as a domain. Such models applied by JPMorgan Chase to estimate the possible losses of the investment portfolios, based on the correlation of the prices with the factors affecting the markets. This is useful when it comes to decision making in investments and avoiding or minimizing risks.

Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

DECISION TREE

 DECISION TREE Figure 1: Decision Tree      This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.  What is decision tree      Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regr...