The objective of this lab is to work on some simple data clustering scenarios for better understanding of the k-means and EM algorithms. We will be running some Python scripts to demonstrate the algorithms. You may wish to get familiar with Python which can be added to your skill set as a fast prototyping tool for data mining.
Let us begin with the demo written in NumPy as presented in Lecture 2.
Start IDLE from your 'Program' list. Choose 'File'/ 'New Window', type in the following lines (well, do a copy-n-paste is okay), and save it as 'task1.py' into your working directory (preferably on S drive):
## import the necessary libaries import numpy as np import pylab as pl ## generate two random data 'clouds', each with 100 points d1 = np.random.randn(100)+2 d2 = np.random.randn(100)+8 ## show them up on the screen. note that the array index is used for x-axis pl.figure(1) pl.plot(d1,'g+') pl.plot(d2,'bx') ## mix two data clouds into one array (200 points) data=np.concatenate((d1,d2),axis=0) len=np.size(data) print len, 'points to cluster' ## randomly initialise the membership for Cluster 1 memb1=np.random.randint(0,2,len) ## what is the membership to Cluster 2? Easy ;-) memb2=1-memb1 ## Iterate for a number of times and see if convergence is reached. ## c1 and c2 are the two cluster centres. ## This updates the centres, then membership, and loops around for it in range(20): c1=sum(data*memb1)/sum(memb1) c2=sum(data*memb2)/sum(memb2) memb1=(np.sign(np.abs(data-c2)-np.abs(data-c1))+1)/2 memb2=1-memb1 print c1,c2 ## Display the cluster centers pl.plot(len/4,c1,'r^') pl.plot(len/4,c2,'r^') pl.show()
The code is commented, but feel free to ask questions if you are unclear about anything.
We can extend the code and do the clustering using EM. Save 'task1.py' (currently in the Window) as 'task2.py', adding the following lines at the end:
## Try EM; can be initialized by k-means
#np.random.seed(13317)
#memb1=np.random.randint(0,2,len)
#memb2=1-memb1
## now the iterations
for it in range(200):
# update centres
c1=sum(memb1*data)/sum(memb1)
c2=sum(memb2*data)/sum(memb2)
# update 'spread' - the std
s1=sum(memb1*(data-c1)*(data-c1))/sum(memb1)
s2=sum(memb2*(data-c2)*(data-c2))/sum(memb2)
# print the center and std
print 'it=',it,':', c1,'/',np.sqrt(s1),'||',c2,'/',np.sqrt(s2)
# re-calculate memberships using the center and std values
memb1=np.exp(-(data-c1)*(data-c1)/s1/2)
memb2=np.exp(-(data-c2)*(data-c2)/s2/2)
# normalise
memb1/=(memb1+memb2)
memb2/=(memb1+memb2)
#print memb1
As you may have noticed, the two centres in the EM code are initialised by the k-Means code. Run it, and see whether it manages to calculate the center and the spread of the two gaussians. Are the clusters correctly located and compact enough?
Now, uncomment the first three code lines (i.e., after the first comment line), so that the membership functions are randomly set. Run the code for a few times. Is the outcome always ideal, and stable? Why?
Now, let us take a look whether EM can model data "clouds" of different sizes. For this, comment of the random initialization again, and change the spread of one Gaussian. Find the line with 'd2' assignment, and modify it into
d2 = np.random.randn(100)*1.5+8
This means the second cloud has now a bigger spread (stdev=1.5). Run the code again, and note what the result tells you about the Gaussian centres and their spreads. If the result is not quite right, try a few more runs.
import numpy as np import pylab as pl ## generate two random data 'clouds' around [2,2] and [8,8], each with 100 points d1 = np.random.randn(100,2)+2 d2 = np.random.randn(100,2)+8 ## show them up on the screen. pl.figure(1) pl.plot(d1[:,0],d1[:,1],'g+') pl.plot(d2[:,0],d2[:,1],'bx') ## mix two data clouds into one array (200 points) #data=np.concatenate((d1,d2),axis=0) data=np.vstack((d1,d2)) len=np.shape(data)[0] #print len memb=np.random.randint(0,2,len) #print memb for it in range(10): c1=np.sum(np.transpose(np.ones((2,200))*memb)*data,axis=0)/np.sum(memb) c2=np.sum(np.transpose(np.ones((2,200))*(1-memb))*data,axis=0)/np.sum(1-memb) memb=(np.sign(np.sum(np.abs(data-c2),axis=1)-np.sum(np.abs(data-c1),axis=1))+1)/2 print c1,c2 ## Display the centers pl.plot(c1[0],c1[1],'r^') pl.plot(c2[0],c2[1],'r^') pl.show()
Run the code. Do you think it manages to locate the cluster centers in the right places?
Maybe you start to wonder what if we have more than two clusters to work out, i.e., with k>2? For that, the membership updating won't be so straightforward, but that can be done. We leave this to our 1st assignment where a more powerful code can be used to work on some real world datasets :-).