Here we implement K Nearest Neighbors on a car dataset to decide whether a given car will be unacceptable, acceptable, good, or very good. You can find the dataset [here](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)

In [2]:
# For accessing other folders
import os

import numpy as np

# For easy data loading
import pandas as pd

# For access the the KNN algorithm and relevant features
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model, preprocessing

# For storing and loading models
import pickle

We begin by loading in and printing out our dataset. Note that before loading in, we altered the dataset to include the relevant labels at the start.

In [3]:
data = pd.read_csv(os.path.join('data','car.csv'))
print(data.head())

  buying  maint doors persons lug_boot safety  class
0  vhigh  vhigh     2       2    small    low  unacc
1  vhigh  vhigh     2       2    small    med  unacc
2  vhigh  vhigh     2       2    small   high  unacc
3  vhigh  vhigh     2       2      med    low  unacc
4  vhigh  vhigh     2       2      med    med  unacc


To make use of this data, we need to convert these strings to numbers. To do so, we use the Label Encoder from SKLearn before splitting into train, test, and validation sets.

In [4]:
# Object which will convert our labels into numbers
# Note that this takes in a list as input
le = preprocessing.LabelEncoder()

# Encode all our feature labels
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
doors = le.fit_transform(list(data["doors"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
cls = le.fit_transform(list(data["class"]))

# Set our prediction feature
predict = "class"

# Setup X and y by zipping together all these np arrays
x = list(zip(buying,maint,doors,persons,lug_boot,safety))
y = cls

# Split into test, train, validation sets with 80 / 10 / 10 split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
    x,y,test_size=0.2)

x_test, x_dev, y_test, y_dev = sklearn.model_selection.train_test_split(
    x_test, y_test, test_size=0.5)

In [5]:
model = KNeighborsClassifier(n_neighbors = 7)
model.fit(x_train, y_train)
acc = model.score(x_dev, y_dev)
print(acc)

0.9421965317919075


In [6]:
names = ['unacc', 'acc', 'good', 'vgood']
predicted = model.predict(x_dev)
for i in range(len(x_dev)):
    print("Predicted:", names[predicted[i]], "Actual:", names[y_test[i]])

Predicted: good Actual: unacc
Predicted: vgood Actual: unacc
Predicted: good Actual: unacc
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: unacc Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: unacc
Predicted: unacc Actual: unacc
Predicted: good Actual: vgood
Predicted: acc Actual: good
Predicted: good Actual: good
Predicted: good Actual: unacc
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: unacc
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: unacc Actual: good
Predicted: unacc Actual: acc
Predicted: unacc Actual: good
Predicted: good Actual: unacc
Predicted: unacc Actual: acc
Predicted: good Actual: good
Predicted: good Actual: good
Predicted: unacc Actual: unacc
Predicted: good Actual: unacc
Predicted: vgood Actual: go

Now we can tune our k parameter using a sequence of odd integers evaluated by their test accuracy. To approximate the resulting real-world accuracy we look at the validation accuracy.

In [7]:
temp = []
for i in range(100):
    if i % 2 != 0:
        temp.append(i)
best = 0
for k in temp:
    model = KNeighborsClassifier(n_neighbors = k)
    model.fit(x_train, y_train)
    acc = model.score(x_dev, y_dev)
    if acc > best:
        best = acc
        with open('carmodel.pickle','wb') as f:
            pickle.dump(model,f)
    

In [8]:
pickle_in = open('carmodel.pickle','rb')
model = pickle.load(pickle_in)
print("Training accuracy is:", model.score(x_train, y_train))
print("Dev accuracy is:", model.score(x_dev, y_dev))
print("Test accuracy is:", model.score(x_test, y_test))

Training accuracy is: 0.9580318379160637
Dev accuracy is: 0.9421965317919075
Test accuracy is: 0.9132947976878613
