PRINCIPAL COMPONENT ANALYSIS (PCA)

Figure 1: PCA

This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model.

What is PCA

PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships.

Concept of PCA

One of the key concepts behind PCA concerns diminishing the complexity of high-dimensional data while preserving as much variability as possible. Basically, it does computation for eigenvalues and eigenvectors of the covariance matrix of the data, leading to directions where this very data has the highest variance. After projecting these principal components onto the original data, PCA gets a new reduced set of variables that hold on to the principal structure and features of the data for further efficient analysis and visualization.

Scenario

Imagine that you are are gemologist and you are trying to cluster the mouse base on its gene.

Figure 2: Example of data

If we only have 2 types of genes, we can plot the data on a 2 dimensional graph and cluster it base on the location of the data plotted there.

Figure 3: Example of data

If we have 3 types of genes, we can plot the data on a 3 dimensional graph and cluster it base on the location of the data plotted there.

Figure 4: Example of data

If we have 4 types of genes, there's no way for us to plot the graph manually. Hence, how do we cluster the data.

Figure 5: PCA plotting

Through implementing Principal Component Analysis (PCA), we can plot the data base on the value of its principal component and does not have the worry about the dimension issue because the dimension will be reduced due to this machine learning model.

1) Principal component

A principal component is simply a linear combination of the original variables, typically computed to capture the maximum possible variance. However, how do we obtain the principal component of the graph.

Figure 6: Finding middle point

First and foremost, find the middle point of the x and y axis within the graph.

Figure 7: Middle point in origin

Set the middle point as the origin of the graph.

Figure 10: Measure distance

Draw a line through the origin and keep on rotating it until you obtain the best fit line. While rotating the line, you will notice that when the distance from projected point to origin increases, the distance from the original point to projected point (d1) decreases.

Calculate the sum of square distance through squaring the distance from the original point to projected point for all the projected point. This is also called the sum of square distance.

Figure 11: Calculate SS value

Once, the best fit line is drawn, the line will have the largest sum of square distance and is called a principal component

Figure 12: Drawing the best fit line

PC2 perpendicular to PC1 and pass the origin

2) Eigenvalues

Figure 13: Best fit line

Assuming that the slope of the best fit line is 0.25

This indicates that the PC is made out 1 unit of y axis and 4 unit of x axis in ratio, x axis is more important

From here, we can use the theorem Pythagoras formula to get the unit of the principal component

Figure 14: Getting the hypotenuses

After obtaining the result, scale the value of the hypotenuses to 1

4.12/4.12 =1

1/4.12 = 0.97

4/4.12 = 0.242

This 1 unit vector consisting 0.97 y and 0.242 x is knows as eigenvectors.

3) Eigenvalues

On the other hand, eigenvalue of the principal component can be obtained through this formula

Eigenvalue = SS/(n-1)

Which means by dividing the number of sum squares by the amount of data -1

Eigenvalues also represents the variation of data represented by each principal component.

Assuming the eigenvalue for PCA 1 is 15 and eigenvalue for PCA 2 is 3. Hence, the total percentage of variance represented by both PCA 1 and PCA 2 respectively are

PCA1

= 15/18

= 83%

PCA2

= 3/18

= 17%

Parameters that you can tune for PCA

Variance threshold

- the minimum amount of variance that a principal component should capture to be retained.

Dimensionality reduction

- Options include:

keep variance: This will retain enough components to capture a specified amount of variance, as mentioned above with the variance threshold.
fixed number: This mode sets a fixed number of principal components to retain

Number of Components (n_components):

- The number of principal components to retain.

- May be an integer or a float between 0 and 1. If an integer, it decides the number of components to be chosen.

- If this is float, it would be the percentage of variance to be conserved.

Whitening (whiten):

- The boolean parameter that when True ensures that the principal components are with unit variance.

- It is useful when the components should be de-correlated and standardized.

Solver (svd_solver):

- Determines the algorithm to use for the decomposition.

Options include:

- 'auto': Automatically chooses the best solver for the data.

- 'full': uses a full Singular Value Decomposition.

- 'arpack': Uses the ARPACK package.

- 'randomized': Uses a randomized algorithm for faster approximate decomposition.

Random State (random_state):

- Controls the randomness of the 'randomized' solver.

- Ensures reproducibility of the results by setting a specific seed.

Tolerance (tol):

- For the ARPACK solver, this parameter sets the tolerance for convergence.

- The small values return the exact result but increase the computational time.

Copy (copy):

- It is a boolean parameter that allows to return a copy of the input data.

- If copy is False, then the input is overwritten, and that will save memory.

Implementation of PCA in python

Importing the dataset

Figure 15: Importing the dataset

Figure 16: Dataset value

Determine the X and Y attribute

Figure 17: Separating the dependent and independent variable

Transform model

Figure 18: Transforming the model

Apply model

Figure 19: Implementation of the model

Model visualization

Figure 20: Model Visualization

Figure 21: PCA Visualization

Advantages and disadvantages of PCA

Advantages

PCA decreases the number of variables (features) while it still maintains most of the variances (information) in the dataset. Consequently, this eases the model and minimizes the computational costs.
PCA offers the most important patterns in the data while filtering out noise by focusing on the principal components.
It reduces complex data to two or three dimensions, hence easy in visualizing and understanding.
Disadvantages
The resulting principal components are just linear combinations of the original features not always easy to interpret, mainly fir high-dimensional data,
Even when PCA retains a maximal variance in the resulting principal component, small amounts of information are, however guaranteed to be lost when the number of principal components is significantly less than the quantity of the original features.
PCA is sensitive to the scaling of the data. Features with large scales will end up dominating the principal components. Therefore, data standardization must occur prior to applying PCA.

Implementation of PCA in real life

1. Finance and Banking

Figure 22: Finance and banking

PCA is used for reducing the dimensionality of financial data, identifying patterns in it, and managing the associated risk. A study of the principal components gives JPMorgan Chase valuable information about the trends in the markets, which aids in the optimization of asset portfolios and the improvement of their risk assessment models. PCA helps in the identification of the major factors that drive movements in asset prices, therefore helping with better investment decisions.

2. E-commerce

Figure 23: E-commerce

PCA is applied by Amazon to huge data volumes relating to their customers and ensures reduction in dimensionality. Their customer purchase history and browsing patterns can be expressed in principal components, by which Amazon makes sense of the behavior of its customers in order to recommend appropriate products and personalized marketing. This aids in enhancing the experience of customers, thereby improving sales.

3. Retail

Figure 24: Retail Shop

Walmart utilizes the PCA method to study sales data and enables a reduction in the variables impacting the inventory levels. If Walmart focuses on these principal components, it would understand the trends of the sales, optimize the management of inventory, and enable better demand forecasting to reduce stockout or overstock situations and hence improve operational efficiency.

QDO

Search This Blog

ADABOOST