INFO411 Lab 2 - Clustering

The objective of this lab is to work on some simple data clustering scenarios for better understanding of the k-means and EM algorithms. We will be running some Python scripts to demonstrate the algorithms. You may wish to get familiar with Python which can be added to your skill set as a fast prototyping tool for data mining.

Task 1.

Let us begin with the demo written in NumPy as presented in Lecture 2.

Start IDLE from your 'Program' list. Choose 'File'/ 'New Window', type in the following lines (well, do a copy-n-paste is okay), and save it as 'task1.py' into your working directory (preferably on S drive):

## import the necessary libaries
import numpy as np
import pylab as pl

## generate two random data 'clouds', each with 100 points
d1 = np.random.randn(100)+2
d2 = np.random.randn(100)+8
## show them up on the screen. note that the array index is used for x-axis
pl.figure(1)
pl.plot(d1,'g+')
pl.plot(d2,'bx')

## mix two data clouds into one array (200 points)
data=np.concatenate((d1,d2),axis=0)
len=np.size(data)
print len, 'points to cluster'
## randomly initialise the membership for Cluster 1
memb1=np.random.randint(0,2,len)
## what is the membership to Cluster 2? Easy ;-)
memb2=1-memb1

## Iterate for a number of times and see if convergence is reached. 
## c1 and c2 are the two cluster centres. 
## This updates the centres, then membership, and loops around
for it in range(20):
   c1=sum(data*memb1)/sum(memb1)
   c2=sum(data*memb2)/sum(memb2)
   memb1=(np.sign(np.abs(data-c2)-np.abs(data-c1))+1)/2
   memb2=1-memb1
   print c1,c2

## Display the cluster centers
pl.plot(len/4,c1,'r^')
pl.plot(len/4,c2,'r^')
pl.show()

The code is commented, but feel free to ask questions if you are unclear about anything.

Task 2.

We can extend the code and do the clustering using EM. Save 'task1.py' (currently in the Window) as 'task2.py', adding the following lines at the end:

## Try EM; can be initialized by k-means
#np.random.seed(13317)
#memb1=np.random.randint(0,2,len)
#memb2=1-memb1
## now the iterations
for it in range(200):
    # update centres
    c1=sum(memb1*data)/sum(memb1)
    c2=sum(memb2*data)/sum(memb2)
    # update 'spread' - the std
    s1=sum(memb1*(data-c1)*(data-c1))/sum(memb1)
    s2=sum(memb2*(data-c2)*(data-c2))/sum(memb2)
    # print the center and std
    print 'it=',it,':', c1,'/',np.sqrt(s1),'||',c2,'/',np.sqrt(s2)
    # re-calculate memberships using the center and std values
    memb1=np.exp(-(data-c1)*(data-c1)/s1/2)
    memb2=np.exp(-(data-c2)*(data-c2)/s2/2)
    # normalise
    memb1/=(memb1+memb2)
    memb2/=(memb1+memb2)
    #print memb1

As you may have noticed, the two centres in the EM code are initialised by the k-Means code. Run it, and see whether it manages to calculate the center and the spread of the two gaussians. Are the clusters correctly located and compact enough?

Now, uncomment the first three code lines (i.e., after the first comment line), so that the membership functions are randomly set. Run the code for a few times. Is the outcome always ideal, and stable? Why?

Now, let us take a look whether EM can model data "clouds" of different sizes. For this, comment of the random initialization again, and change the spread of one Gaussian. Find the line with 'd2' assignment, and modify it into

d2 = np.random.randn(100)*1.5+8

This means the second cloud has now a bigger spread (stdev=1.5). Run the code again, and note what the result tells you about the Gaussian centres and their spreads. If the result is not quite right, try a few more runs.

Task 3.

So far we have been working on 1-D data points so it's kind of boring. Let's add another dimension so that they can be displayed properly displayed on the screen. The code, is as follows - note there are quite a few changes made to the original 'task1.py', mainly to deal with matrices of higher dimensionality. The algorithm is simplified, without looking at stdev of the clusters. Save it as 'task3.py'.

import numpy as np
import pylab as pl

## generate two random data 'clouds' around [2,2] and [8,8], each with 100 points
d1 = np.random.randn(100,2)+2
d2 = np.random.randn(100,2)+8
## show them up on the screen. 
pl.figure(1)
pl.plot(d1[:,0],d1[:,1],'g+')
pl.plot(d2[:,0],d2[:,1],'bx')

## mix two data clouds into one array (200 points)
#data=np.concatenate((d1,d2),axis=0)
data=np.vstack((d1,d2))
len=np.shape(data)[0]
#print len

memb=np.random.randint(0,2,len)
#print memb

for it in range(10):
   c1=np.sum(np.transpose(np.ones((2,200))*memb)*data,axis=0)/np.sum(memb)
   c2=np.sum(np.transpose(np.ones((2,200))*(1-memb))*data,axis=0)/np.sum(1-memb) 
   memb=(np.sign(np.sum(np.abs(data-c2),axis=1)-np.sum(np.abs(data-c1),axis=1))+1)/2
   print c1,c2

## Display the centers
pl.plot(c1[0],c1[1],'r^')
pl.plot(c2[0],c2[1],'r^')
pl.show()

Run the code. Do you think it manages to locate the cluster centers in the right places?

Maybe you start to wonder what if we have more than two clusters to work out, i.e., with k>2? For that, the membership updating won't be so straightforward, but that can be done. We leave this to our 1st assignment where a more powerful code can be used to work on some real world datasets :-).

End of Lab 2.