KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. Becouse of this the scale of variables in such a dataset is very important. Variables on a large scale will have a larger effect on the distance between the observations which also affects the KNN clasifier too.
An intutive way to handle the scalling problem in KNN classification is to standerdize the the dataset in such a way that all variables ae given a mean of zero and a sd of 1. Training algorithm:
- Store all the data
Prediction Algorithm:
- Calculate the distance from x to all points in your data.
- Sort the points in your data by increasing the distance from x.
- Predict the majority label of the "k" closest points.
This project objects to classify the observations with respect to a target varaiable indicated at last variable.Its important to note that this one of the anonymized datasets provided by clients.This could be because of the need to protect sensitive information.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('annonimizeddataset',index_col = 0)
data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
WTT | PTI | EQW | SBI | LQE | QWG | FDJ | PJF | HQE | NXJ | TARGET CLASS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.913917 | 1.162073 | 0.567946 | 0.755464 | 0.780862 | 0.352608 | 0.759697 | 0.643798 | 0.879422 | 1.231409 | 1 |
1 | 0.635632 | 1.003722 | 0.535342 | 0.825645 | 0.924109 | 0.648450 | 0.675334 | 1.013546 | 0.621552 | 1.492702 | 0 |
2 | 0.721360 | 1.201493 | 0.921990 | 0.855595 | 1.526629 | 0.720781 | 1.626351 | 1.154483 | 0.957877 | 1.285597 | 0 |
3 | 1.234204 | 1.386726 | 0.653046 | 0.825624 | 1.142504 | 0.875128 | 1.409708 | 1.380003 | 1.522692 | 1.153093 | 1 |
4 | 1.279491 | 0.949750 | 0.627280 | 0.668976 | 1.232537 | 0.703727 | 1.115596 | 0.646691 | 1.463812 | 1.419167 | 1 |
So here the data is anonymized with meaningless labes as the raw labels.The last class is the target class which needs to be predicted.
data.columns
Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
'TARGET CLASS'],
dtype='object')
data.shape
(1000, 11)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
WTT 1000 non-null float64
PTI 1000 non-null float64
EQW 1000 non-null float64
SBI 1000 non-null float64
LQE 1000 non-null float64
QWG 1000 non-null float64
FDJ 1000 non-null float64
PJF 1000 non-null float64
HQE 1000 non-null float64
NXJ 1000 non-null float64
TARGET CLASS 1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 93.8 KB
sns.heatmap(data.isnull(),yticklabels=False,cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1a25531a20>
The graph above shows clearly that there is no missing data in the set above.
As pointed out earlier ,scalling the variables is very important in KNN .
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data.drop('TARGET CLASS',axis = 1))
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_feat = scaler.transform(data.drop('TARGET CLASS',axis = 1))
data_feat = pd.DataFrame(scaled_feat,columns=data.columns[:-1])
data_feat.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
WTT | PTI | EQW | SBI | LQE | QWG | FDJ | PJF | HQE | NXJ | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -0.123542 | 0.185907 | -0.913431 | 0.319629 | -1.033637 | -2.308375 | -0.798951 | -1.482368 | -0.949719 | -0.643314 |
1 | -1.084836 | -0.430348 | -1.025313 | 0.625388 | -0.444847 | -1.152706 | -1.129797 | -0.202240 | -1.828051 | 0.636759 |
2 | -0.788702 | 0.339318 | 0.301511 | 0.755873 | 2.031693 | -0.870156 | 2.599818 | 0.285707 | -0.682494 | -0.377850 |
3 | 0.982841 | 1.060193 | -0.621399 | 0.625299 | 0.452820 | -0.267220 | 1.750208 | 1.066491 | 1.241325 | -1.026987 |
4 | 1.139275 | -0.640392 | -0.709819 | -0.057175 | 0.822886 | -0.936773 | 0.596782 | -1.472352 | 1.040772 | 0.276510 |
from sklearn.model_selection import train_test_split
X = data_feat
y = data['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
predictions = knn.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
[[134 8]
[ 11 147]]
____________________________________________________________
precision recall f1-score support
0 0.92 0.94 0.93 142
1 0.95 0.93 0.94 158
micro avg 0.94 0.94 0.94 300
macro avg 0.94 0.94 0.94 300
weighted avg 0.94 0.94 0.94 300
this gives an accuracy of 94%.
This proces aims to extract more information by chosing a beter k value.The process will also try to iterate over many more different k values and plot their error rates.This will enable me to see which one has the lowest error rate.
errorRate = []
for kvalue in range(1,40):
knn = KNeighborsClassifier(n_neighbors=kvalue)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
errorRate.append(np.mean(predictions != y_test)) # average error rate
plt.figure(figsize=(10,6))
plt.plot(range(1,40),errorRate,color = "blue",linestyle = "dashed",marker = 'o')
[<matplotlib.lines.Line2D at 0x1a2528e518>]
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
---------------------------------------------------------------------------
This gives a small improvement in accuracy.