Skip to main content

ADABOOST

AdaBoost This blog post will provide you with a comprehensive overview of Adaboost, exploring the theory behind this probabilistic algorithm and demonstrating its implementation using Python libraries. Dive in to uncover the advantages and disadvantages of neural network, as well as its real-world applications across various domains. With that, enjoy your journey in QDO! What is  Adaboost AdaBoost (Adaptive Boosting) is an ensemble learning technique that combines multiple weak classifiers (often decision trees) to create a strong classifier. It works by training the weak classifiers sequentially, giving more weight to misclassified instances at each step so that subsequent classifiers focus more on the harder cases. The final prediction is made by combining the weighted votes of all weak classifiers. AdaBoost is effective at reducing bias and variance, and it’s particularly good for binary classification problems. However, it can be sensitive to noisy data and outliers. Concepts o...

K-NEAREST NEIGHBOUR (KNN)

 

K-NEAREST NEIGHBOUR (KNN)

Figure 1: KNN

This blogpost will brief you the concept of K-nearest neighbor (KNN) , a machine learning model capable of both descriptive and predictive modelling. After reading this blogpost, you will learn the parameters of KNN that you can tune while using it, its advantages and disadvantages that comes with it as well as the implementation of KNN machine learning model in both python and real life by companies to tackle daily tasks.

What is KNN

The nearest neighbor (NN) algorithm is one of the simplest and most intuitive algorithms in the field of machine learning and can also be referred to by the acronym KNN (k-Nearest Neighbors). KNN is applied to classification and regression. In essence, the KNN algorithm classifies or predicts the value of a new data point by considering the "k" closest data points in the feature space.  In the case of classification, KNN assigns the new data point to the most frequent class among its k nearest neighbors found using a certain distance metric, usually Euclidean distance. In regression, the value for the new data point is assumed to be the average of the values for its k nearest neighbors. The KNN method is nonparametric; it does not assume any underlying distribution of the data. It works best when the decision boundary is nonlinear.

Concept of KNN

1) Start with a dataset with raw data plotted on it
Figure 2: Plotting on PCA graph

2) Plot the targeted data on the graph

Figure 3: Random Data on the graph

Observe the categories of the plot around the unknown data

If k = 11, meaning that we look at the categories of 11 nearest data around the targeted data. Which in this case it is predicted that the targeted data should be green color

Parameters that you can tune for KNN

Number of Neighbors (k):

- Number of the closest neighbors against which a prediction is made. The value of k is very critical in the model. If the value of k is too small, then it makes the model noise-sensitive; and if k is large, then it will smooth out the predictions too much. Common practice is to try various k values and use cross-validation to determine the best one.


Distance Metric:

- It measures the distance between data points.

- Options include:

  1. Euclidean distance: Straight line distance between two points in Euclidean space
  2. Manhattan distance: Distance between two points is the sum of the absolute difference of their coordinates
  3. Minkowski distance: Generalization of Euclidean and Manhattan distances
  4. Hamming distance: Number of positions at which the corresponding elements of two vectors differ
  5. Mahalanobis distance: taking into account the correlations between variables


Weight Function (Weights):

Determines how the neighbors are weighted in the prediction.

- Options include:

    1. Uniform: Neighbors are weighted equally.
    2. Distance: Closer neighbors are given more weight.

Algorithm:

- The algorithm used to compute the nearest neighbors.

- Options include:

  1. Brute-force: Compute distance between every pair of points in a straightforward way.
  2. Ball Tree: BallTree or in the Pedregosa et al. BallTree utilizes a binary tree structure to partition the data points.
  3. KD Tree: It uses a k-dimensional tree to partition the data points. 
  4. Auto: It automatically selects the best algorithm based on the input dataset.

Leaf Size

It is the number of points at which the algorithm has to switch to brute-force. It affects the speed and memory required to build the tree structures. More significant leaf sizes generally make the tree building process faster but can slow down the query times.


p (for Minkowski Distance):

- Minkowski distance metric power parameter

Options include:

    1. p=1: Equivalent to Manhattan distance.
    2. p=2: Equivalent to Euclidean distance.

Metric Parameters (metric_params):

- Additional parameters for the distance metric. Allows for fine-tuning of the distance metric beyond the standard options.

n_jobs:

- The number of parallel jobs to run for neighbors search. Setting n_jobs to -1 uses all available processors, which can speed up the computation.


Measure types

Types of measures that can be used to determine the "closeness" or similarity between data points.

Options include:

  1. MixedMeasures: Many different types of measures on different kinds of attributes in a data set
  2. NominalMeasures: Measures of dissimilarity between categories
  3. NumericalMeasures: Are used for continuous or ordinal data for which the mathematical notions of addition and subtraction make sense
  4. BregmanDivergences: A family of measures of the difference between two probability distributions.


Implementation of KNN in python

Figure 4: Importing dataset

Figure 5:Overview of dataset

Figure 6: Understanding value counts

Figure 7: Separating dependent and independent variables


Figure 8: Importing necessary libraries

Figure 9: LabelEncoder

Figure 10: Splitting training and testing data

Figure 11: Applying the model

Figure 12: Getting Accuracy Score

Figure 13: Accuracy score

Figure 14: Classification Report

Figure 15: Getting Conclusion Matrix

Figure 16: Conclusion Matrix

Advantages and disadvantages of KNN

Advantages

  • KNN is simple to be understood and implemented. It often serves as a baseline model that is compared against while using some more complex algorithm.
  • KNN is a lazy learning algorithm, since it does not require an explicit training phase. Actually, the algorithm just stores the training data and then computes while classifying.
  • KNN does not make any assumptions on the underlying distribution of data. Hence it becomes versatile and can be applied for a wide range of problems.

Disadvantages

  • Since the KNN needs comparison with each data point in the prediction phase, it becomes really slow for large data sets.
  • Keeping the whole training dataset consumes a lot of memory, which may be very expensive if your dataset is huge.
  • When the number of features gets larger, the concept of distance between points becomes less and less meaningful, and accuracy is reduced with performance. Feature selection/dimensionality reduction techniques are often needed.


Application of KNN in real life

Telecommunications

Figure 17: Verizon

Verizon uses KNN to predict customer churn. KNN analyzes customer usage patterns and behaviors to identify those likely to leave, allowing proactive retention strategies.

Automotive
Figure 18: Tesla


Tesla uses KNN in its Autopilot system for object detection and classification. KNN helps in identifying objects on the road, aiding the vehicle's autonomous navigation.

Entertainment
Figure 19: Netflix movies

Netflix uses KNN to recommend content to users. According to the KNN algorithm, movies and shows are suggested based on viewing patterns and user preferences, which the user is likely to enjoy.

Comments

Popular posts from this blog

PRINCIPAL COMPONENT ANALYSIS (PCA)

PRINCIPAL COMPONENT ANALYSIS (PCA) Figure 1: PCA This blogpost will bring to you the concept of principal component analysis which is one of the commonly used descriptive analysis that emphasizes of dimensionality reduction. You will learn how to implement this machine learning model in python, its advantages and disadvantages as well as how companies benefits from this machine learning model. What is PCA PCA is a statistical dimensionality-reducing technique. It takes a large set of variables and transforms them into a smaller set, retaining most of the information in the large set. This can be done by identifying the directions along which the data varies the most. These components are orthogonal to one another, capture the maximum possible variance within the data, and hence form a powerful tool for the simplification of datasets without loss of essential patterns and relationships. Concept of  PCA One of the key concepts behind PCA concerns diminishing the complexity of high-di...

LINEAR REGRESSION

 LINEAR REGRESSION Figure 1: Linear regression figure This blogpost will walk you through the concept of linear regression which is another machine learning model under the regression category of supervised learning. Introducing the parameters that you can turn while applying the logistic regression as well as the factors that play a significant impact upon the performance of the linear regression. What is linear regression Linear regression is a machine learning algorithm that could be used in predictive analysis. From predicting prices of houses to sales forecasting, linear regression is undoubtedly the first choice to many data scientists to implement within the dataset. In short, linear regression involves plotting your data on the graph base on the x and y coordinate and proceed to draw the best fit line upon the graph. The best fit line will be used as a reference to predict the independent variable in the future. However, do you have the skill to conduct a excellent analysis...

DECISION TREE

 DECISION TREE Figure 1: Decision Tree      This blogpost aims to introduce to you regarding to a machine learning model called decision trees. After reading this blogpost, you are able to deepen your knowledge on the concepts of decision trees model, its terminology, pros and cons as well as its application in real life scenarios that lends a hand in solving complex problems thus boosting the living quality of many.  What is decision tree      Imagine you’re wondering through a forest, each path branching off into multiple directions, and you need to make a series of decisions to escape the forest. Now, picture having a map that not only shows you all possible routes but also guides you on the specific conditions you encounter. Decision trees model which applies various splitting criteria's within the branches assists the user in decision making purposes. Compared to regression models which applies complex mathematical formulas like logistic regr...