Skip to content

leanerr/DataAnalysis_kNN_Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


KNN ALGORITHM.


About KNN classifiers.

KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. Becouse of this the scale of variables in such a dataset is very important. Variables on a large scale will have a larger effect on the distance between the observations which also affects the KNN clasifier too.

An intutive way to handle the scalling problem in KNN classification is to standerdize the the dataset in such a way that all variables ae given a mean of zero and a sd of 1. Training algorithm:

  1. Store all the data

Prediction Algorithm:

  1. Calculate the distance from x to all points in your data.
  2. Sort the points in your data by increasing the distance from x.
  3. Predict the majority label of the "k" closest points.

About KNN and using some tools like numpy , pandas and ...

This project objects to classify the observations with respect to a target varaiable indicated at last variable.Its important to note that this one of the anonymized datasets provided by clients.This could be because of the need to protect sensitive information.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
data = pd.read_csv('annonimizeddataset',index_col = 0)
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ TARGET CLASS
0 0.913917 1.162073 0.567946 0.755464 0.780862 0.352608 0.759697 0.643798 0.879422 1.231409 1
1 0.635632 1.003722 0.535342 0.825645 0.924109 0.648450 0.675334 1.013546 0.621552 1.492702 0
2 0.721360 1.201493 0.921990 0.855595 1.526629 0.720781 1.626351 1.154483 0.957877 1.285597 0
3 1.234204 1.386726 0.653046 0.825624 1.142504 0.875128 1.409708 1.380003 1.522692 1.153093 1
4 1.279491 0.949750 0.627280 0.668976 1.232537 0.703727 1.115596 0.646691 1.463812 1.419167 1

So here the data is anonymized with meaningless labes as the raw labels.The last class is the target class which needs to be predicted.

Data exploration analysis

data.columns
Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
       'TARGET CLASS'],
      dtype='object')
data.shape
(1000, 11)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
WTT             1000 non-null float64
PTI             1000 non-null float64
EQW             1000 non-null float64
SBI             1000 non-null float64
LQE             1000 non-null float64
QWG             1000 non-null float64
FDJ             1000 non-null float64
PJF             1000 non-null float64
HQE             1000 non-null float64
NXJ             1000 non-null float64
TARGET CLASS    1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 93.8 KB
sns.heatmap(data.isnull(),yticklabels=False,cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1a25531a20>

png

The graph above shows clearly that there is no missing data in the set above.

Scalling Variables.

As pointed out earlier ,scalling the variables is very important in KNN .

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data.drop('TARGET CLASS',axis = 1))
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_feat = scaler.transform(data.drop('TARGET CLASS',axis = 1))
data_feat = pd.DataFrame(scaled_feat,columns=data.columns[:-1])
data_feat.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ
0 -0.123542 0.185907 -0.913431 0.319629 -1.033637 -2.308375 -0.798951 -1.482368 -0.949719 -0.643314
1 -1.084836 -0.430348 -1.025313 0.625388 -0.444847 -1.152706 -1.129797 -0.202240 -1.828051 0.636759
2 -0.788702 0.339318 0.301511 0.755873 2.031693 -0.870156 2.599818 0.285707 -0.682494 -0.377850
3 0.982841 1.060193 -0.621399 0.625299 0.452820 -0.267220 1.750208 1.066491 1.241325 -1.026987
4 1.139275 -0.640392 -0.709819 -0.057175 0.822886 -0.936773 0.596782 -1.472352 1.040772 0.276510

Splitting data into train and test split

from sklearn.model_selection import train_test_split
X = data_feat
y = data['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

KNN model deployment.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')
predictions = knn.predict(X_test)

Model Evaluation

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
[[134   8]
 [ 11 147]]
____________________________________________________________
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       142
           1       0.95      0.93      0.94       158

   micro avg       0.94      0.94      0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300

this gives an accuracy of 94%.

Using the Elbow method in Improving the model.

This proces aims to extract more information by chosing a beter k value.The process will also try to iterate over many more different k values and plot their error rates.This will enable me to see which one has the lowest error rate.

errorRate = []

for kvalue in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=kvalue)
    knn.fit(X_train,y_train)
    predictions = knn.predict(X_test)
    errorRate.append(np.mean(predictions != y_test)) # average error rate
plt.figure(figsize=(10,6))
plt.plot(range(1,40),errorRate,color = "blue",linestyle = "dashed",marker = 'o')
[<matplotlib.lines.Line2D at 0x1a2528e518>]

png

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
---------------------------------------------------------------------------

This gives a small improvement in accuracy.