AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...
LINEAR REGRESSION
| Figure 1: Linear regression figure |
This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression.
What is linear regression
Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis and come out with an accurate prediction outcome using this algorithm?
Concept of linear regression
Scenario
Imagine today you want to predict the house price base on the information you have on the house includes the number of rooms, the house area, how many floors etc. How can you perform your task with this info? Fear not as linear regression is here to save the day. Despite being under the regression category, linear regression is not the same as logistic regression as linear regression emphasizes on predicting continuous values instead of discrete values and the algorithm of linear regression is world's apart from logistic regression.
The formula for linear regression
y=mx+c
This formula is commonly seen in mathematics but the similar formula applies as well in linear regression algorithm. Although it looks simple, there are also a few aspects to take note while implementing linear regression within your dataset.
1) Use least squares to fit a line to the data
Reference: https://youtu.be/7ArmBVF2dCs?feature=shared
METHODS TO DRAW THE BEST FIT LINE
1) draw a line to the data
2) measure the distance from the line to the data (residual)
3) Sum it up and square
4) rotate and repeat
5) the rotation with the lowest sum of square will be used as the best fit line
2) Calculate r2
In the context of linear regression, r2 displays how much variation in the independent variable can be explained for each dependent variable taken into account.
r2= 0.6 = 60%
This means that knowing the value of the particular independent variable helps you to predict 60% of the dependent variable
r2= 100% means knowing the independent variable help you determine all the dependent variable
r2= 0% indicates that the independent variable does not help in predicting dependent variable
Formula for r2
| Figure 2: Formula of r2 |
Methods of getting the value of var(fit) and var(mean) are displayed as below.
Var(fit)
1) Measure the vertical distance from the data to the best fit line.
| Figure 3: Example of data |
| Figure 4: Formula of Var(fit) |
In short Var(fit) represents the variance obtained from squaring the difference of the value of the data with its distance with the best fit data and proceed to divide it with the number of data plotted on the graph.
Var(mean)
1) Push all the data to align in a straight line on the y-axis then proceed to draw a horizontal line on the mean of the dataset. Similar to the calculations in getting the var(fit), calculate the distance from the data to the man line.
3) Calculate p value for r2
The p value for r2 could be described as the average explained variance per degree of freedom. It helps to determine whether the additional predictors in a regression model significantly improve the model's ability to explain the variation in the response variable compared to a simpler model.
| Figure 7: Formula of F |
The p-value for r2 could be addressed as F in the context of linear regression. This is also called the F-test in terms of hypothesis testing for this machine learning model.
The value of SS(mean) and SS(fit) are already obtained while calculating the value of r2 but the value of p(fit), p(mean) and n are yet remained unknown.
p(fit)
This variable represents the number of parameters on a fit line.
Example:
| Figure 8: Example of data |
The formula of the best fit line could be represented with the formula
y= slope X + y-intercept
Parameters
1) slope2) y-intercept
Hence, we can conclude that p(fit) =2
p(mean)
p(means) on the other hand refers to the number of parameters on the mean line
Hence, we can conclude that the value of p(mean) = 1
n
The variable of n refers to the total number of observations in your dataset
Relationship between r2 and F to the linear regression model
r2 represents the relationship between variables within the data. Hence, the bigger the value of r2, the better the model.
F determines how reliable that relationship in r2 is and contrary of r2 the lower the value of F, The better the performance of the model.
Parameters that you can tune for linear regression
Feature selection techniques
- this parameter is responsible for identifying and selecting a subset of features or independent variables for the construction of the model. The types of feature selection techniques are displayed as below
- M5-prime - builds regression trees and can be used for feature selection by evaluating the importance of each feature in predicting the target variable. It emphasizes on the outlier detection and removal in linear regression models.
- T-Test - evaluate whether each feature is significantly related to the target variable by testing the significance of parameters in a linear regression model.
- Iterative T-Test - select the most significant features, removing the least significant ones in each iteration until a specified condition is met
- Greedy- iteratively adds or removes features to discover the subset of features that provide the best results
- Min tolerance -Sets a minimum threshold for the convergence criteria during model optimization to ensure that the algorithm stops iterating when the improvement in the objective function falls below this tolerance level.
Ridge
- Penalizes large coefficients in the regression model in order to control the complexity of the model and boost the algorithm's generalization performance
Eliminate colinear features
- Automatically identifies and removes independent variables that are highly correlated with each other,
Fit Intercept
- Whether to calculate the intercept for the model.
Copy X
- If
True, X will be copied; otherwise, it may be overwritten.n_jobs
- The number of jobs to use for the computation.
Implementation of linear regression in python
Importing the dataset
| Figure 10: Code to import the dataset |
| Figure 11: Overview of the dataset |
Determine the X and Y attribute
| Figure 12: Determining the X and Y value |
| Figure 14: Plotting the X and Y values on the graph |
Splitting the dataset for training and testing
| Figure 15: Code to split the dataset |
| Figure 16: Reshaping the dataset |
Apply model
Discover the intercept and coefficient of the linear regression model
| Figure 18: Code to get the intercept and coefficient |
| Figure 19: Intercept and coefficient |
Test the model
| Figure 20: Predict the training and testing model |
Visualize the result
| Figure 21: Visualizing the result |
| Figure 22: Drawing the linear regression line on the graph |
Determine the mean squared error
| Figure 23: Code to get the mean squared error |
| Figure 24: Mean squared error |
Obtaining the r2
| Figure 25: Code to get the r2 score |
| Figure 26: r2 score |
Advantages and disadvantages of linear regression
Advantages
- Linear regression is computationally efficient which makes it suitable to be applied to perform analysis on large datasets.
- If the relationship between the independent variable and dependent variable is linear, linear regression provides a decent outcome.
- If the assumptions of linear regression is satisfied, this machine learning algorithm could be a robust model in the aspect of prediction.
Disadvantages
- For the real-life data, the relationships of independent and dependent variables may not be linear so if anytime there is an assumption of linearity, there will be error.
- When using many predictor variables in connection with training, it is possible that the model is overly complex and only fits the training data and poorly on valuable data sets.
- When the independent variables are highly correlated it may lead to difficulties as to stability as well as interpretation of coefficients.
Implementation of linear regression in real life
Predicting House Prices
Zillow applies linear regression models to predict the prices of homes based on various features namely location, number of bedrooms, square footage, to name but a few. Through understanding how each feature contributes to the overall price, Zillow is able to provide accurate and real-time estimates of home values.
Sales Forecasting
The usage of linear regression is present at Walmart where the company applies it for sale forecasting of its stores. By using historical data and also the factors such as promotion, holidays, and other economical indicators, Walmart can execute the projection of the following sales to avoid them from wastage of their funds on the stocks.
Risk Management in Finance
![]() |
| Figure 29: Risk Management |
Linear regression is highly relevant to risk management in the scope of finance as a domain. Such models applied by JPMorgan Chase to estimate the possible losses of the investment portfolios, based on the correlation of the prices with the factors affecting the markets. This is useful when it comes to decision making in investments and avoiding or minimizing risks.


Comments
Post a Comment