K-NEAREST NEIGHBOUR (KNN)

Figure 1: KNN

This blogpost will brief you the concept of K-nearest neighbor (KNN) , a machine learning model capable of both descriptive and predictive modelling. After reading this blogpost, you will learn the parameters of KNN that you can tune while using it, its advantages and disadvantages that comes with it as well as the implementation of KNN machine learning model in both python and real life by companies to tackle daily tasks.

What is KNN

The nearest neighbor (NN) algorithm is one of the simplest and most intuitive algorithms in the field of machine learning and can also be referred to by the acronym KNN (k-Nearest Neighbors). KNN is applied to classification and regression. In essence, the KNN algorithm classifies or predicts the value of a new data point by considering the "k" closest data points in the feature space. In the case of classification, KNN assigns the new data point to the most frequent class among its k nearest neighbors found using a certain distance metric, usually Euclidean distance. In regression, the value for the new data point is assumed to be the average of the values for its k nearest neighbors. The KNN method is nonparametric; it does not assume any underlying distribution of the data. It works best when the decision boundary is nonlinear.

Concept of KNN

1) Start with a dataset with raw data plotted on it

Figure 2: Plotting on PCA graph

2) Plot the targeted data on the graph

Figure 3: Random Data on the graph

Observe the categories of the plot around the unknown data

If k = 11, meaning that we look at the categories of 11 nearest data around the targeted data. Which in this case it is predicted that the targeted data should be green color

Parameters that you can tune for KNN

Number of Neighbors (k):

- Number of the closest neighbors against which a prediction is made. The value of k is very critical in the model. If the value of k is too small, then it makes the model noise-sensitive; and if k is large, then it will smooth out the predictions too much. Common practice is to try various k values and use cross-validation to determine the best one.

Distance Metric:

- It measures the distance between data points.

- Options include:

Euclidean distance: Straight line distance between two points in Euclidean space
Manhattan distance: Distance between two points is the sum of the absolute difference of their coordinates
Minkowski distance: Generalization of Euclidean and Manhattan distances
Hamming distance: Number of positions at which the corresponding elements of two vectors differ
Mahalanobis distance: taking into account the correlations between variables

Weight Function (Weights):

Determines how the neighbors are weighted in the prediction.

- Options include:

Uniform: Neighbors are weighted equally.
Distance: Closer neighbors are given more weight.

Algorithm:

- The algorithm used to compute the nearest neighbors.

- Options include:

Brute-force: Compute distance between every pair of points in a straightforward way.
Ball Tree: BallTree or in the Pedregosa et al. BallTree utilizes a binary tree structure to partition the data points.
KD Tree: It uses a k-dimensional tree to partition the data points.
Auto: It automatically selects the best algorithm based on the input dataset.

Leaf Size

- It is the number of points at which the algorithm has to switch to brute-force. It affects the speed and memory required to build the tree structures. More significant leaf sizes generally make the tree building process faster but can slow down the query times.

p (for Minkowski Distance):

- Minkowski distance metric power parameter

- Options include:

p=1: Equivalent to Manhattan distance.
p=2: Equivalent to Euclidean distance.

Metric Parameters (metric_params):

- Additional parameters for the distance metric. Allows for fine-tuning of the distance metric beyond the standard options.

n_jobs:

- The number of parallel jobs to run for neighbors search. Setting n_jobs to -1 uses all available processors, which can speed up the computation.

Measure types

- Types of measures that can be used to determine the "closeness" or similarity between data points.

Options include:

MixedMeasures: Many different types of measures on different kinds of attributes in a data set
NominalMeasures: Measures of dissimilarity between categories
NumericalMeasures: Are used for continuous or ordinal data for which the mathematical notions of addition and subtraction make sense
BregmanDivergences: A family of measures of the difference between two probability distributions.

Implementation of KNN in python

Figure 4: Importing dataset

Figure 5:Overview of dataset

Figure 6: Understanding value counts

Figure 7: Separating dependent and independent variables

Figure 8: Importing necessary libraries

Figure 9: LabelEncoder

Figure 10: Splitting training and testing data

Figure 11: Applying the model

Figure 12: Getting Accuracy Score

Figure 13: Accuracy score

Figure 14: Classification Report

Figure 15: Getting Conclusion Matrix

Figure 16: Conclusion Matrix

Advantages and disadvantages of KNN

Advantages

KNN is simple to be understood and implemented. It often serves as a baseline model that is compared against while using some more complex algorithm.
KNN is a lazy learning algorithm, since it does not require an explicit training phase. Actually, the algorithm just stores the training data and then computes while classifying.
KNN does not make any assumptions on the underlying distribution of data. Hence it becomes versatile and can be applied for a wide range of problems.

Disadvantages

Since the KNN needs comparison with each data point in the prediction phase, it becomes really slow for large data sets.
Keeping the whole training dataset consumes a lot of memory, which may be very expensive if your dataset is huge.
When the number of features gets larger, the concept of distance between points becomes less and less meaningful, and accuracy is reduced with performance. Feature selection/dimensionality reduction techniques are often needed.

Application of KNN in real life

Telecommunications

Figure 17: Verizon

Verizon uses KNN to predict customer churn. KNN analyzes customer usage patterns and behaviors to identify those likely to leave, allowing proactive retention strategies.

Automotive
Figure 18: Tesla

Tesla uses KNN in its Autopilot system for object detection and classification. KNN helps in identifying objects on the road, aiding the vehicle's autonomous navigation.

Entertainment
Figure 19: Netflix movies

Netflix uses KNN to recommend content to users. According to the KNN algorithm, movies and shows are suggested based on viewing patterns and user preferences, which the user is likely to enjoy.

QDO

Search This Blog

ADABOOST