We often judge people by their vicinity to the group of people they live with. People who belong to a particular group are usually considered similar based on the characteristics they possess. This is the simple principle on which the KNN algorithm works – “Birds of the same feather flock together.”
The abbreviation KNN stands for “K-Nearest Neighbor”. It is one of the simplest supervised machine learning algorithms used for classification. It’s a classifier algorithm where the learning is based on “how similar” a data (a vector) to another. It stores all available cases and classifies new cases based on similar features.
K-NN is a non-parametric algorithm, which means it does not make any assumptions on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
The KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
Step by step to KNN algorithm:
The concept of finding nearest neighbors may be defined as “the process of finding the closest point to the input point from the given data set”. The algorithm stores all the available cases (test data) and classifies new cases by majority votes of its K neighbors.
Problem Statement: Consider the following example below to assign the new input data point to one of the two classes by using the KNN algorithm
In the below image, we have two classes of data, namely class A (squares) and Class B (triangles)
Step-1: Select the number K of the neighbors
“k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class.” To choose the value of K, take the square root of n (sqrt(n)), where n is the total number of data points.
Note: K must be an odd integer point, otherwise we can’t identify the new data point to which class it belongs. If k =4, sometimes we get 2 positive and 2 negative. So usually, an odd value of K is selected to avoid confusion between two classes of data.
If k =3:-
Here, the algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point. At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).
If k= 7:– But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to.
Step-2: Calculate the distance of K number of neighbors
The closeness/distance between the data points is calculated with the help of any of the measures namely: Euclidean, Manhattan, or Hamming distance.
To know more about these measures mathematically please click here https://ai.plainenglish.io/the-math-behind-knn-7883aa8e314c The most commonly used method to calculate distance is Euclidean.
KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.
According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and (a, b) is given by:
Step: 3 Take the K nearest neighbors as per the calculated Euclidean distance: i.e. based on the distance value, sort them in ascending order, it will choose the top K rows from the sorted array.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
So here, At ‘K’ = 7, Classifies the new data with the class that you took in step 2; the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
Step-6: Our model is ready!
We can build this k-kNN classifier model using the Python Scikit-learn package: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html