Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

K-MEANS

   

K-MEANS

                                                



This blog post will provide you with a comprehensive overview of K-means, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of K-means, as well as its real-world applications across various domains. With that, enjoy your journey in QDO!

WHAT IS K-MEANS

                            
K-means is an unsupervised machine learning algorithm primarily used for clustering tasks. It aims to partition a dataset into a specified number of clusters , where each data point belongs to the cluster with the nearest mean, which serves as the "centroid" of that cluster. The algorithm starts by randomly initializing kk centroids, and each data point is assigned to the nearest centroid, forming temporary clusters. After all points are assigned, the centroids are recalculated as the mean of the points within each cluster. This process of assignment and centroid recalculation iterates until the centroids stabilize or until a set number of iterations is reached.


Concept of random forest

Lets say we want to cluster all the genes and we have all the data of the gene plotted on the line


We start by selecting 3 random data points as our initial cluster point



Next, we cluster the remaining data base on the distance to the cluster point


Next, we find the center point/mean for each cluster and the whole process repeats again


The variance of each cluster is presented to us and the whole process repeats until the variance no longer changes


The number ok k-means can be best determine through the elbow method below in which we plot the number of clusters and reduction in variation for each of the clusters. 


From the graph above, we can observe that the increment in the reduction in variation had decreased after the 3rd cluster. Hence, the ideal number for k in this situation is 3.

Implementation of k-means in python

Importing libraries 

from sklearn.datasets import load_iris

Loading dataset

iris = load_iris()

Applying the model

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)

Get prediction result

KModel = kmeans.fit(iris.data)

Get prediction accuracy

import pandas as pd
pd.crosstab(iris.target, KModel.labels_)

                                                         col_0012
row_0
00500
14703
214036

Parameters that you can tune in k-means

Number of Clusters (k):

  • This is the number of groups you want to create. Choosing the right k is important for meaningful clusters. Common methods to find the best k include:
    • Elbow Method: Plot the total distance between points and their cluster centers for different values of k. The "elbow" point, where improvements slow down, suggests a good k.
    • Silhouette Score: A score that shows how distinct and well-separated your clusters are; higher scores are better.

Starting Points (Initialization Method):

  • The algorithm starts by choosing k initial "centroid" points. Choosing good starting points helps K-means find better clusters:
    • Random: Picks starting points randomly, which can sometimes lead to poor results.
    • k-means++: Selects starting points that are far apart, giving better results more consistently.

Maximum Iterations:

  • This sets the maximum number of times K-means can adjust the clusters. This can be helpful if the algorithm is taking too long, allowing it to stop early after a set number of tries.

Convergence Tolerance:

  • This controls when the algorithm should stop adjusting clusters. If the centroids barely move between adjustments, the algorithm will stop. Smaller values mean the algorithm will keep fine-tuning for longer, which can improve results but also takes more time.

Number of Runs (n_init):

  • K-means can be run multiple times with different starting points to find the best grouping. The final result is chosen as the one with the lowest total distance between points and their cluster centers. More runs can improve results, but also take longer.

Algorithm Variant:

  • Different versions of K-means handle data slightly differently to improve speed:
    • Standard K-means: The most common and straightforward method.
    • Optimized K-means (like Elkan’s): Uses shortcuts to speed things up, especially helpful with large, high-dimensional data.

Advantages and disadvantages of k-means

Advantages

  • Simple and Fast:

    • K-means is easy to understand and implement. It’s also computationally efficient, especially on large datasets, making it suitable for real-time applications.
  • Works Well with Convex Clusters:

    • K-means is effective when clusters are spherical and of similar size, often producing clear, well-separated clusters under these conditions.
  • Scalable:

    • K-means can handle large datasets and can be scaled with parallel or mini-batch versions, making it useful for big data.

Disadvantages

  • Sensitive to Initial Choices:

    • The initial placement of centroids can significantly affect the final clusters, sometimes leading to suboptimal results. Different initializations can produce different outcomes.
  • Requires Specification of kk:

    • You need to specify the number of clusters (kk) in advance, which isn’t always straightforward and can lead to trial and error or require additional methods to estimate the best kk.
  • Assumes Spherical Clusters:

    • K-means assumes clusters are roughly circular and evenly sized, so it struggles with complex shapes or clusters of very different sizes, leading to poor performance in these cases.

Implementation of k-means in real life

1. Customer Segmentation


  • Amazon, Walmart, and other retailers use K-means clustering to segment customers based on their purchase behavior, demographics, browsing history, and spending patterns. By identifying clusters of customers with similar characteristics, these companies can create targeted marketing campaigns, personalize recommendations, and improve customer engagement. This segmentation helps retailers better understand their customer base, tailor product offerings, and optimize inventory based on customer preferences in different segments.

2. Image Compression


  • Companies like Google and Instagram may use K-means for image compression. K-means can reduce the number of colors in an image by clustering similar colors together, which lowers file size without significantly sacrificing image quality. By compressing images effectively, tech companies reduce storage space and improve load times, which is especially important for mobile app performance and data efficiency.

3. Anomaly Detection in Network Security


  • Banks, like JPMorgan Chase, and cybersecurity firms use K-means clustering to detect unusual patterns in network traffic or user behavior. K-means can help identify clusters of "normal" behavior, so outliers (potential security threats) stand out more clearly. This clustering approach helps detect fraud, identify potential breaches, and monitor irregular activities, which are critical for preventing financial losses and protecting sensitive data.

Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

LINEAR REGRESSION

 LINEAR REGRESSION Figure 1: Linear regression figure This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression. What is linear regression Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis...

DECISION TREE

 DECISION TREE Figure 1: Decision Tree      This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.  What is decision tree      Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regr...