The idea behind hierarchical cluster analysis is to show which of a (potentially large) set
of samples are most similar to one another, and to group these similar samples in the same
limb of a tree.
Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the dataframe. We define similarity on the basis of the distance between two samples in this m-dimensional space.
Several different distance measures could be used, but the default is Euclidean distance and this is used to work out the distance from every sample to every other sample.
for the other options, check
?dist
This quantitative dissimilarity structure of the data is stored in a matrix produced by the “dist function”.
Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
for more details about hclust function, check
?hclust
Here in this example we will cluster the similar countries on the basis of similarity. so the decision making can be easier. In order to find the similarities in observation and group the data, we need to perform cluster analysis.
country | area | gdp | inflation | life expect | military | pop growth | unemployment |
Austria | 83871 | 41600 | 3.5 | 79 | 0.8 | 0.5 | 4 |
Belgium | 95326 | 37589 | 3.5 | 78 | 1.3 | 0.4 | 2 |
Bulgaria | 56356 | 13456 | 2.6 | 78 | 2.3 | 0.3 | 3 |
Crotia | 73569 | 18000 | 4.5 | 79 | 1.5 | 0.2 | 5 |
czech Republic | 43568 | 27156 | 4 | 78 | 1.6 | -2 | 1 |
denmark | 338155 | 37256 | 2 | 56 | 4 | 2 | 1.5 |
Estonia | 152632 | 20156 | 3 | 78 | 2 | 1.9 | 4 |
Germany | 132562 | 36252 | 4.9 | 74 | 2 | 1.8 | 3 |
Hungary | 93265 | 38265 | 5.9 | 69 | 3.1 | 1.5 | 3.5 |
Iceland | 100000 | 25655 | 1.5 | 65 | 4 | 1.2 | 3.6 |
Italy | 70125 | 19654 | 2.8 | 86 | 2 | -0.8 | 2.5 |
Latvia | 302325 | 38569 | 3.6 | 72 | 1.2 | 1.9 | 4 |
Lithuansia | 64523 | 40256 | 5.6 | 88 | 1.3 | -1.5 | 4.01 |
Luxemberg | 65235 | 32565 | 4.5 | 98 | 1.5 | 1.6 | 1.8 |
Netherland | 41256 | 12568 | 2.6 | 67 | 1.4 | 0.6 | 2.5 |
Norway | 326598 | 19568 | 7.2 | 73 | 1.69 | 0.3 | 1.23 |
Portugal | 312654 | 18652 | 1.53 | 74 | 2.6 | -1.2 | 1.6 |
Slovakia | 92356 | 45895 | 0.26 | 72 | 3.1 | 0.6 | 5 |
Slovenia | 49265 | 123654 | 2.25 | 75 | 1.5 | 0.5 | 6 |
Spain | 20125 | 26651 | 23.5 | 76.5 | 2 | 0.5 | 4.2 |
Sweden | 502354 | 21561 | 26.2 | 86.3 | 1.9 | -0.2 | 2.356 |
Switzerland | 495632 | 125465 | 56 | 56.9 | 1.8 | 0.003 | 1.8 |
In this example we will use hierarchical cluster analysis to group the countries. This cluster analysis also allows us to summarise the data by grouping all the similar observation into different clusters. These observations are made by considering similar values for number of variables. i.e. if the eucladien distance between two values is less than they are group together we can perform cluster analysis with the dist and hclust function.
dist function:- calculates a distance matrix of the provided values and provides the eucledian distance between those values by default. from the calculated eucladien distance hierarchical clustering can be derived, to perform this we use the hclust function.
The hclust function has a method attributes that specifies hows the clustering is to be done. The method includes average, gord, single, median complete and centroid methods. The complete linkage method being the default.
Steps to make hierarchical clustering in R
step1). First we load the dataset in R workspace and saved it in variable name- data
survey<-read.csv("survey.csv", header=TRUE)
step2). the syntax to perform hierarchical cluster is hclust of dist of dataset name
surveyclust<-hclust(dist(survey[-1]))
Saving the hclust in variable name surveyclust
-1 is to remove the first column i.e country name, since it does not have logical relationship with the data.
step3). plot the denddogram
plot(surveyclust)
clustering result variable to plot the dendrogram
the numbers you are seeing on dendrogram plot is country in the table
countries are plot based on their similarities
step4). we can also make clusters from these dendrograms using
rect.hclust(survey, 5)
model name and number of argument(number of clusters)
the dendrogram now will show 5 clusters in color.
Categories: R