AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...
PRINCIPAL COMPONENT ANALYSIS (PCA)
![]() |
| Figure 1: PCA |
This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model.
What is PCA
PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships.
Concept of PCA
One of the key concepts behind PCA concerns diminishing the complexity of high-dimensional data while preserving as much variability as possible. Basically, it does computation for eigenvalues and eigenvectors of the covariance matrix of the data, leading to directions where this very data has the highest variance. After projecting these principal components onto the original data, PCA gets a new reduced set of variables that hold on to the principal structure and features of the data for further efficient analysis and visualization.
Scenario
Imagine that you are are gemologist and you are trying to cluster the mouse base on its gene.
![]() |
| Figure 2: Example of data |
![]() |
| Figure 3: Example of data |
If we have 3 types of genes, we can plot the data on a 3 dimensional graph and cluster it base on the location of the data plotted there.
![]() |
| Figure 4: Example of data |
![]() |
| Figure 5: PCA plotting |
Through implementing Principal Component Analysis (PCA), we can plot the data base on the value of its principal component and does not have the worry about the dimension issue because the dimension will be reduced due to this machine learning model.
1) Principal component
A principal component is simply a linear combination of the original variables, typically computed to capture the maximum possible variance. However, how do we obtain the principal component of the graph.
![]() |
| Figure 6: Finding middle point |
First and foremost, find the middle point of the x and y axis within the graph.
![]() |
| Figure 7: Middle point in origin |
![]() |
| Figure 10: Measure distance |
Draw a line through the origin and keep on rotating it until you obtain the best fit line. While rotating the line, you will notice that when the distance from projected point to origin increases, the distance from the original point to projected point (d1) decreases.
Calculate the sum of square distance through squaring the distance from the original point to projected point for all the projected point. This is also called the sum of square distance.
Once, the best fit line is drawn, the line will have the largest sum of square distance and is called a principal component
![]() |
| Figure 11: Calculate SS value |
![]() |
| Figure 12: Drawing the best fit line |
PC2 perpendicular to PC1 and pass the origin
2) Eigenvalues
![]() |
| Figure 13: Best fit line |
Assuming that the slope of the best fit line is 0.25
This indicates that the PC is made out 1 unit of y axis and 4 unit of x axis in ratio, x axis is more important
From here, we can use the theorem Pythagoras formula to get the unit of the principal component
![]() |
| Figure 14: Getting the hypotenuses |
4.12/4.12 =1
1/4.12 = 0.97
4/4.12 = 0.242
This 1 unit vector consisting 0.97 y and 0.242 x is knows as eigenvectors.
3) Eigenvalues
On the other hand, eigenvalue of the principal component can be obtained through this formula
Eigenvalue = SS/(n-1)
Which means by dividing the number of sum squares by the amount of data -1
Eigenvalues also represents the variation of data represented by each principal component.
Assuming the eigenvalue for PCA 1 is 15 and eigenvalue for PCA 2 is 3. Hence, the total percentage of variance represented by both PCA 1 and PCA 2 respectively are
PCA1
= 15/18
= 83%
PCA2
= 3/18
= 17%
Parameters that you can tune for PCA
Variance threshold
- the minimum amount of variance that a principal component should capture to be retained.
Dimensionality reduction
- Options include:
- keep variance: This will retain enough components to capture a specified amount of variance, as mentioned above with the variance threshold.
- fixed number: This mode sets a fixed number of principal components to retain
Number of Components (
n_components):- The number of principal components to retain.
- May be an integer or a float between 0 and 1. If an integer, it decides the number of components to be chosen.
- If this is float, it would be the percentage of variance to be conserved.
Whitening (whiten):
- The boolean parameter that when True ensures that the principal components are with unit variance.
- It is useful when the components should be de-correlated and standardized.
Solver (
svd_solver):- Determines the algorithm to use for the decomposition.
Options include:
- 'auto': Automatically chooses the best solver for the data.
- 'full': uses a full Singular Value Decomposition.
- 'arpack': Uses the ARPACK package.
- 'randomized': Uses a randomized algorithm for faster approximate decomposition.
Random State (random_state):
- Controls the randomness of the 'randomized' solver.
- Ensures reproducibility of the results by setting a specific seed.
Tolerance (
tol):- For the ARPACK solver, this parameter sets the tolerance for convergence.
- The small values return the exact result but increase the computational time.
Copy (
copy):- It is a boolean parameter that allows to return a copy of the input data.
- If copy is False, then the input is overwritten, and that will save memory.
Implementation of PCA in python
Importing the dataset
![]() |
| Figure 15: Importing the dataset |
![]() |
| Figure 16: Dataset value |
Determine the X and Y attribute
![]() |
| Figure 17: Separating the dependent and independent variable |
Transform model
![]() |
| Figure 18: Transforming the model |
Apply model
![]() |
| Figure 19: Implementation of the model |
Model visualization
![]() |
| Figure 20: Model Visualization |
Advantages and disadvantages of PCA
Advantages
- PCA decreases the number of variables (features) while it still maintains most of the variances (information) in the dataset. Consequently, this eases the model and minimizes the computational costs.
- PCA offers the most important patterns in the data while filtering out noise by focusing on the principal components.
- It reduces complex data to two or three dimensions, hence easy in visualizing and understanding.
Disadvantages- The resulting principal components are just linear combinations of the original features not always easy to interpret, mainly fir high-dimensional data,
- Even when PCA retains a maximal variance in the resulting principal component, small amounts of information are, however guaranteed to be lost when the number of principal components is significantly less than the quantity of the original features.
- PCA is sensitive to the scaling of the data. Features with large scales will end up dominating the principal components. Therefore, data standardization must occur prior to applying PCA.
- The resulting principal components are just linear combinations of the original features not always easy to interpret, mainly fir high-dimensional data,
- Even when PCA retains a maximal variance in the resulting principal component, small amounts of information are, however guaranteed to be lost when the number of principal components is significantly less than the quantity of the original features.
- PCA is sensitive to the scaling of the data. Features with large scales will end up dominating the principal components. Therefore, data standardization must occur prior to applying PCA.
Implementation of PCA in real life
1. Finance and Banking
| Figure 22: Finance and banking |
PCA is used for reducing the dimensionality of financial data, identifying patterns in it, and managing the associated risk. A study of the principal components gives JPMorgan Chase valuable information about the trends in the markets, which aids in the optimization of asset portfolios and the improvement of their risk assessment models. PCA helps in the identification of the major factors that drive movements in asset prices, therefore helping with better investment decisions.
2. E-commerce
![]() |
| Figure 23: E-commerce |
PCA is applied by Amazon to huge data volumes relating to their customers and ensures reduction in dimensionality. Their customer purchase history and browsing patterns can be expressed in principal components, by which Amazon makes sense of the behavior of its customers in order to recommend appropriate products and personalized marketing. This aids in enhancing the experience of customers, thereby improving sales.
3. Retail
![]() |
| Figure 24: Retail Shop |
Walmart utilizes the PCA method to study sales data and enables a reduction in the variables impacting the inventory levels. If Walmart focuses on these principal components, it would understand the trends of the sales, optimize the management of inventory, and enable better demand forecasting to reduce stockout or overstock situations and hence improve operational efficiency.





















Comments
Post a Comment