LINEAR REGRESSION

Figure 1: Linear regression figure

This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression.

What is linear regression

Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis and come out with an accurate prediction outcome using this algorithm?

Concept of linear regression

Scenario

Imagine today you want to predict the house price base on the information you have on the house includes the number of rooms, the house area, how many floors etc. How can you perform your task with this info? Fear not as linear regression is here to save the day. Despite being under the regression category, linear regression is not the same as logistic regression as linear regression emphasizes on predicting continuous values instead of discrete values and the algorithm of linear regression is world's apart from logistic regression.

The formula for linear regression

y=mx+c

This formula is commonly seen in mathematics but the similar formula applies as well in linear regression algorithm. Although it looks simple, there are also a few aspects to take note while implementing linear regression within your dataset.

1) Use least squares to fit a line to the data

Reference: https://youtu.be/7ArmBVF2dCs?feature=shared

METHODS TO DRAW THE BEST FIT LINE

1) draw a line to the data

2) measure the distance from the line to the data (residual)

3) Sum it up and square

4) rotate and repeat

5) the rotation with the lowest sum of square will be used as the best fit line

2) Calculate r2

In the context of linear regression, r2 displays how much variation in the independent variable can be explained for each dependent variable taken into account.

r2= 0.6 = 60%

This means that knowing the value of the particular independent variable helps you to predict 60% of the dependent variable

r2= 100% means knowing the independent variable help you determine all the dependent variable

r2= 0% indicates that the independent variable does not help in predicting dependent variable

Formula for r2

Figure 2: Formula of r2

Methods of getting the value of var(fit) and var(mean) are displayed as below.

Var(fit)

1) Measure the vertical distance from the data to the best fit line.

Figure 3: Example of data

Figure 4: Formula of Var(fit)

In short Var(fit) represents the variance obtained from squaring the difference of the value of the data with its distance with the best fit data and proceed to divide it with the number of data plotted on the graph.

Var(mean)

1) Push all the data to align in a straight line on the y-axis then proceed to draw a horizontal line on the mean of the dataset. Similar to the calculations in getting the var(fit), calculate the distance from the data to the man line.

Figure 5: Example of data

Figure 6: Formula of Var(mean)

3) Calculate p value for r2

The p value for r2 could be described as the average explained variance per degree of freedom. It helps to determine whether the additional predictors in a regression model significantly improve the model's ability to explain the variation in the response variable compared to a simpler model.

Figure 7: Formula of F

The p-value for r2 could be addressed as F in the context of linear regression. This is also called the F-test in terms of hypothesis testing for this machine learning model.

The value of SS(mean) and SS(fit) are already obtained while calculating the value of r2 but the value of p(fit), p(mean) and n are yet remained unknown.

p(fit)

This variable represents the number of parameters on a fit line.

Example:

Figure 8: Example of data

The formula of the best fit line could be represented with the formula

y= slope X + y-intercept

Parameters

1) slope

2) y-intercept

Hence, we can conclude that p(fit) =2

p(mean)

p(means) on the other hand refers to the number of parameters on the mean line

Figure 9: Example of data

The formula for the mean line is

y=y-intercept

Parameters

1) y-intercept

Hence, we can conclude that the value of p(mean) = 1

n

The variable of n refers to the total number of observations in your dataset

Relationship between r2 and F to the linear regression model

r2 represents the relationship between variables within the data. Hence, the bigger the value of r2, the better the model.

F determines how reliable that relationship in r2 is and contrary of r2 the lower the value of F, The better the performance of the model.

Parameters that you can tune for linear regression

Feature selection techniques

- this parameter is responsible for identifying and selecting a subset of features or independent variables for the construction of the model. The types of feature selection techniques are displayed as below

M5-prime - builds regression trees and can be used for feature selection by evaluating the importance of each feature in predicting the target variable. It emphasizes on the outlier detection and removal in linear regression models.
T-Test - evaluate whether each feature is significantly related to the target variable by testing the significance of parameters in a linear regression model.
Iterative T-Test - select the most significant features, removing the least significant ones in each iteration until a specified condition is met
Greedy- iteratively adds or removes features to discover the subset of features that provide the best results
Min tolerance -Sets a minimum threshold for the convergence criteria during model optimization to ensure that the algorithm stops iterating when the improvement in the objective function falls below this tolerance level.

Ridge

- Penalizes large coefficients in the regression model in order to control the complexity of the model and boost the algorithm's generalization performance

Eliminate colinear features

- Automatically identifies and removes independent variables that are highly correlated with each other,

Fit Intercept

- Whether to calculate the intercept for the model.

Copy X

- If True, X will be copied; otherwise, it may be overwritten.

n_jobs

- The number of jobs to use for the computation.

Implementation of linear regression in python

Importing the dataset

Figure 10: Code to import the dataset

Figure 11: Overview of the dataset

Determine the X and Y attribute

Figure 12: Determining the X and Y value

Figure 14: Plotting the X and Y values on the graph

Splitting the dataset for training and testing

Figure 15: Code to split the dataset

Figure 16: Reshaping the dataset

Apply model

Figure 17: Code to apply Linear Regression model

Discover the intercept and coefficient of the linear regression model

Figure 18: Code to get the intercept and coefficient

Figure 19: Intercept and coefficient

Test the model

Figure 20: Predict the training and testing model

Visualize the result

Figure 21: Visualizing the result

Figure 22: Drawing the linear regression line on the graph

Determine the mean squared error

Figure 23: Code to get the mean squared error

Figure 24: Mean squared error

Obtaining the r2

Figure 25: Code to get the r2 score

Figure 26: r2 score

Advantages and disadvantages of linear regression

Advantages

Linear regression is computationally efficient which makes it suitable to be applied to perform analysis on large datasets.
If the relationship between the independent variable and dependent variable is linear, linear regression provides a decent outcome.
If the assumptions of linear regression is satisfied, this machine learning algorithm could be a robust model in the aspect of prediction.

Disadvantages

For the real-life data, the relationships of independent and dependent variables may not be linear so if anytime there is an assumption of linearity, there will be error.
When using many predictor variables in connection with training, it is possible that the model is overly complex and only fits the training data and poorly on valuable data sets.
When the independent variables are highly correlated it may lead to difficulties as to stability as well as interpretation of coefficients.

Implementation of linear regression in real life

Predicting House Prices

Figure 27: House price prediction

Zillow applies linear regression models to predict the prices of homes based on various features namely location, number of bedrooms, square footage, to name but a few. Through understanding how each feature contributes to the overall price, Zillow is able to provide accurate and real-time estimates of home values.

Sales Forecasting

Figure 28: Sales forecasting

The usage of linear regression is present at Walmart where the company applies it for sale forecasting of its stores. By using historical data and also the factors such as promotion, holidays, and other economical indicators, Walmart can execute the projection of the following sales to avoid them from wastage of their funds on the stocks.

Risk Management in Finance

Figure 29: Risk Management

Linear regression is highly relevant to risk management in the scope of finance as a domain. Such models applied by JPMorgan Chase to estimate the possible losses of the investment portfolios, based on the correlation of the prices with the factors affecting the markets. This is useful when it comes to decision making in investments and avoiding or minimizing risks.

QDO

ADABOOST