K-Nearest Neighbors, Python3

Demos for Beginners, by Hannah

back to projects

My first machine learing project: create a classifier for characters from the My Little Pony fandom. The cartoon ponies are colorful so children (and adults) can tell them apart, but can a computer as well? The steps of this project were:

Image and Histogram

Results

Code

Image and Histogram

The background makes up the largest bar on the histogram, which means we'd need enough samples to account for all probable backgrounds.

Results

So, I did a test with 393 pictures after using 5000 pictures to train. The results were not great. As I look at the 5000 tags in the training data, I see 1646 unique names, and that probably explains the poor results well enough: there are a lot more characters than I would have expected, so not enough examples of any given character. I do notice the main characters are classified better, so reducing the character set to perhaps a dozen might have worked. (Then some pictures are black-and-white; those might have been removed).

Table of 100 Test Classifications

oc means original character -- identifies a fan-created character.

Image

3 Nearest Neighbors

Actual Tag

applejack

sunset shimmer

pear butter

prince hardhat

oc:cosmic claw

octavia melody

oc:molados

oc:shylu

pinkie pie

twilight sparkle

oc:tate langdon

twilight sparkle

oc:fuji flash

oc:abysmal soul

oc:eve

oc:sheon

pinkie pie

oc:vaporwave

oc:comet chaser

twilight sparkle

oc:blue dye

oc:dark lightning

vinyl scratch

oc:blue dye

nimbus dash

hyacinth dawn

hyacinth dawn

windy whistles

rainbow dash

oc:magic shield

rainbow dash

rainbow dash

...continue table...

Code

Pieces of code were involved in scraping the files, preparing the training data, creating this page, and doing some analysis of the results, but unless someone asks, I'll just include code that would attempt to identify the character in an image.

# identify_pony.py
#  by Hannah Leitheiser
#  2018-07-09
# Attempts to use k-nearnes neighbors to identify a My Little Pony
# character.  Images are converted to 64-color histograms.
# 
# requires python3, PIL, scipy,
# 	and appropriately prepared training data

from PIL import Image
import struct
import scipy
import scipy.spatial.distance

# ------------------- read data files -------------------------------

# data.bin - contains arrays of 64 2-byte unsigned ints representing
# 	the 64 colors.
# tags.txt - plan text tags, one per line that corespond to each
#       block of 128-bytes (64 integers) in data.bin

dataFile = open("data.bin", "rb")
tagsFile = open("tags.txt", "r")

# colordata -- will contain a list of 64-item lists with training 
# data from each photo
colordata = []

while dataFile.read(1):
	# Seek backward to account for the while
	dataFile.seek(-1, 1)
	colorrecord = [0]*64
	for x in range(64):
		# > specifies big endian, H 2-byte unsigned integer
		colorrecord[x] = (
			struct.unpack(">H", dataFile.read(2))[0])
	colordata.append( colorrecord )

names = tagsFile.readlines()

dataFile.close()
tagsFile.close()

# ----------------- prepare image for comparison --------------------

im = Image.open("testimage.png").convert("RGBA")
pixels = im.load()

# the histogram for the test image
colorbucket = [0] * 64

maxpixels = 0

for x in range(im.size[0]):
	for y in range(im.size[1]):

		# divide the image into 2 bits/color channel,
		# or 6 bit color, or 64 colors
		r = pixels[x,y][0] // 64
		g = pixels[x,y][1] // 64
		b = pixels[x,y][2] // 64

		# ignore transparent pixels, otherwise 
		# increment the correct bin
		if im.mode == "RBGA":
			if pixels[x,y][3] > 0:
				colorbucket[r+g*4+b*16]+=1
		else:
			colorbucket[r+g*4+b*16]+=1

# find the largest bin
for x in range(64):
	if colorbucket[x] > maxpixels:
		maxpixels = colorbucket[x]

# ------------------- find nearest neighbors ------------------------

# resize so the largest bin is the maximum range of 2-byte unsigned 
# integers
colorbucket = list(
	map( lambda x:int((x/maxpixels) * 65535), colorbucket )
			)

# use scipy to do the math: find eucliean distance between our
# dimensional vectors.
distancesOutput = scipy.spatial.distance.cdist(
		[ colorbucket ], colordata )[0]

# k=3
nearestThree = distancesOutput.argsort()[:3]

for x in range(3):
	print( names[ nearestThree[x] ][:-1] )