399 lines
13 KiB
Plaintext
399 lines
13 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we implement K Nearest Neighbors on a car dataset to decide whether a given car will be unacceptable, acceptable, good, or very good. You can find the dataset [here](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# For accessing other folders\n",
|
|
"import os\n",
|
|
"\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"# For easy data loading\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"# For access the the KNN algorithm and relevant features\n",
|
|
"import sklearn\n",
|
|
"from sklearn.utils import shuffle\n",
|
|
"from sklearn.neighbors import KNeighborsClassifier\n",
|
|
"from sklearn import linear_model, preprocessing\n",
|
|
"\n",
|
|
"# For storing and loading models\n",
|
|
"import pickle"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We begin by loading in and printing out our dataset. Note that before loading in, we altered the dataset to include the relevant labels at the start."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" buying maint doors persons lug_boot safety class\n",
|
|
"0 vhigh vhigh 2 2 small low unacc\n",
|
|
"1 vhigh vhigh 2 2 small med unacc\n",
|
|
"2 vhigh vhigh 2 2 small high unacc\n",
|
|
"3 vhigh vhigh 2 2 med low unacc\n",
|
|
"4 vhigh vhigh 2 2 med med unacc\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"data = pd.read_csv(os.path.join('data','car.csv'))\n",
|
|
"print(data.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To make use of this data, we need to convert these strings to numbers. To do so, we use the Label Encoder from SKLearn before splitting into train, test, and validation sets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Object which will convert our labels into numbers\n",
|
|
"# Note that this takes in a list as input\n",
|
|
"le = preprocessing.LabelEncoder()\n",
|
|
"\n",
|
|
"# Encode all our feature labels\n",
|
|
"buying = le.fit_transform(list(data[\"buying\"]))\n",
|
|
"maint = le.fit_transform(list(data[\"maint\"]))\n",
|
|
"doors = le.fit_transform(list(data[\"doors\"]))\n",
|
|
"persons = le.fit_transform(list(data[\"persons\"]))\n",
|
|
"lug_boot = le.fit_transform(list(data[\"lug_boot\"]))\n",
|
|
"safety = le.fit_transform(list(data[\"safety\"]))\n",
|
|
"cls = le.fit_transform(list(data[\"class\"]))\n",
|
|
"\n",
|
|
"# Set our prediction feature\n",
|
|
"predict = \"class\"\n",
|
|
"\n",
|
|
"# Setup X and y by zipping together all these np arrays\n",
|
|
"x = list(zip(buying,maint,doors,persons,lug_boot,safety))\n",
|
|
"y = cls\n",
|
|
"\n",
|
|
"# Split into test, train, validation sets with 80 / 10 / 10 split\n",
|
|
"x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(\n",
|
|
" x,y,test_size=0.2)\n",
|
|
"\n",
|
|
"x_test, x_dev, y_test, y_dev = sklearn.model_selection.train_test_split(\n",
|
|
" x_test, y_test, test_size=0.5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0.9421965317919075\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"model = KNeighborsClassifier(n_neighbors = 7)\n",
|
|
"model.fit(x_train, y_train)\n",
|
|
"acc = model.score(x_dev, y_dev)\n",
|
|
"print(acc)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: vgood Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: vgood\n",
|
|
"Predicted: acc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: unacc Actual: acc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: acc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: acc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: vgood\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: acc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: vgood\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: acc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: acc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: vgood Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: unacc Actual: vgood\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: unacc Actual: acc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: unacc Actual: good\n",
|
|
"Predicted: unacc Actual: unacc\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: good Actual: good\n",
|
|
"Predicted: acc Actual: good\n",
|
|
"Predicted: good Actual: vgood\n",
|
|
"Predicted: good Actual: unacc\n",
|
|
"Predicted: good Actual: good\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"names = ['unacc', 'acc', 'good', 'vgood']\n",
|
|
"predicted = model.predict(x_dev)\n",
|
|
"for i in range(len(x_dev)):\n",
|
|
" print(\"Predicted:\", names[predicted[i]], \"Actual:\", names[y_test[i]])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we can tune our k parameter using a sequence of odd integers evaluated by their test accuracy. To approximate the resulting real-world accuracy we look at the validation accuracy."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"temp = []\n",
|
|
"for i in range(100):\n",
|
|
" if i % 2 != 0:\n",
|
|
" temp.append(i)\n",
|
|
"best = 0\n",
|
|
"for k in temp:\n",
|
|
" model = KNeighborsClassifier(n_neighbors = k)\n",
|
|
" model.fit(x_train, y_train)\n",
|
|
" acc = model.score(x_dev, y_dev)\n",
|
|
" if acc > best:\n",
|
|
" best = acc\n",
|
|
" with open('carmodel.pickle','wb') as f:\n",
|
|
" pickle.dump(model,f)\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Training accuracy is: 0.9580318379160637\n",
|
|
"Dev accuracy is: 0.9421965317919075\n",
|
|
"Test accuracy is: 0.9132947976878613\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"pickle_in = open('carmodel.pickle','rb')\n",
|
|
"model = pickle.load(pickle_in)\n",
|
|
"print(\"Training accuracy is:\", model.score(x_train, y_train))\n",
|
|
"print(\"Dev accuracy is:\", model.score(x_dev, y_dev))\n",
|
|
"print(\"Test accuracy is:\", model.score(x_test, y_test))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|