R

Beginners guide to Statistical Cluster Analysis in detail part-2


In the Last part of Statistical Cluster Analysis series we discussed about Hierarchical cluster analysis(HCA). Which is the first method of data exploratory analysis techniques.

Here in second part we will cover the second part of data exploratory analysis techniques.

Non-Hierarchical Cluster Analysis(Non-HCA): 

Features:

  • Clusters do not have hierarchy.
  • No distance matrix calculation.
  • Better suited for large dataset.

Non-HCA methods starts from either-

  • With an initial partition of items/objects into groups

or

  • With an initial partition of set of seed points, which will form the nuclei of initial clusters.

NOTE: One way to start is to randomly select seed points from among the items or to randomly partition the date(i.e. items/objects) into initial groups.

K-means method or the method of iterative relocation:- 

K-means is an algorithm that assigns each object to cluster having the nearest centroid/nuclei

Algorithm:

  1. Partition the items/objects into ‘k’ initial letters.
  2. Reassign items/objects to the cluster whose centroid is nearest. Recalculate the centroid/nuclei for the cluster receiving the new item and for the cluster using the items/objects.
  3. Repeat step 2 untill no more reassignment is possible.

NOTE:- Rather than starting with a partition of all items into k initial groups( as in step1 ) we can also assign ‘K’ initial centroid/nuclei(seed points) and then proceed to step 2 after a walk through the data.

Hand written solved example is attached below go through the steps and try to understand if not please shoot me an email @ irrfankhann29@gmail.com.

statistical non herarchical cluster analysis This is the pdf file of example

CLUSTER CRITERIA:- Comparing different partitions-

Objective:- is to have a criteria for optimum partition of the data such that given set of cases of given clusters problem reduces to partition the data into ‘g’ clusters so that the clustering criterion is optimized.

  • Let the ‘n’ data points(cases) be : x1, x1, – – – – – x(n).
  • The sample variance – covariance is matrix is given
    • ∑ = (1/n)∑(x(i)-m)(x(i)-m)^t  : {m=(1/n)∑x(i) -> sample mean}
  • Let there be ‘g’ clusters and define

 

 

We can write as following

 

 

 

Then within cluster sum of square (SS) & cross product matrix

 

 

 

==> pooled within cluster scatter matrix ‘g’ cluster

The between cluster SS and cross product matrix

 

 

 

 

Popular clustering criterian are based on univariate function S(b), S(w) or ∑.

Will share the criterians in the next part of the cluster analysis series. Till then stay tuned or practice you skills on cluster analysis, if you get any doubts please ask me by shooting an email @ irrfankhann29@gmail.com.

 

 

Advertisements

Categories: R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s