{"id":652,"date":"2013-05-13T09:10:15","date_gmt":"2013-05-13T08:10:15","guid":{"rendered":"http:\/\/it4bus.vn\/itersdesktop\/?p=652"},"modified":"2013-05-13T09:10:15","modified_gmt":"2013-05-13T08:10:15","slug":"cluster-analysis-in-r","status":"publish","type":"post","link":"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/","title":{"rendered":"Cluster Analysis in R"},"content":{"rendered":"<p><strong style=\"line-height: 20.799999237060547px;\">R<\/strong><span style=\"line-height: 20.799999237060547px;\">\u00a0has an\u00a0<\/span><a style=\"line-height: 20.799999237060547px;\" href=\"http:\/\/wiki.math.yorku.ca\/index.php\/R:_Cluster_analysis\">amazing variety\u00a0<\/a><span style=\"line-height: 20.799999237060547px;\">of functions for\u00a0<\/span><a style=\"line-height: 20.799999237060547px;\" href=\"http:\/\/cran.cnr.berkeley.edu\/web\/views\/Cluster.html\">cluster analysis<\/a><span style=\"line-height: 20.799999237060547px;\">. In this section, I will describe three of the many approaches: hierarchical\u00a0agglomeration<\/span><span style=\"line-height: 20.799999237060547px;\">\u00a0 partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below.<\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-light-blue ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a01ef5692e7e\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a01ef5692e7e\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#data-preparation\" >Data Preparation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#partitioning\" >Partitioning<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#hierarchical-agglomerative\" >Hierarchical Agglomerative<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#model-based\" >Model Based<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#plotting-cluster-solutions\" >Plotting Cluster Solutions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.itersdesktop.com\/fr\/2013\/05\/13\/cluster-analysis-in-r\/#validating-cluster-solutions\" >Validating cluster solutions<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"data-preparation\"><\/span>Data Preparation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability.<\/p>\n<p><code># Prepare Data<br \/>\nmydata &lt;- na.omit(mydata) # listwise deletion of missing<br \/>\nmydata &lt;- scale(mydata) #\u00a0standardize variables<\/code><\/p>\n<h2><span class=\"ez-toc-section\" id=\"partitioning\"><\/span>Partitioning<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>K-means<\/strong>\u00a0clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. The analyst looks for a bend in the plot similar to a scree test in factor analysis. See\u00a0<a href=\"http:\/\/www.statmethods.net\/about\/books.html\">Everitt &amp; Hothorn (pg. 251)<\/a>.<\/p>\n<p><code># Determine number of clusters<br \/>\nwss &lt;- (nrow(mydata)-1)*sum(apply(mydata,2,var))<br \/>\nfor (i in 2:15) wss[i] &lt;- sum(kmeans(mydata,<br \/>\ncenters=i)$withinss)<br \/>\nplot(1:15, wss, type=\"b\", xlab=\"Number of Clusters\",<br \/>\nylab=\"Within groups sum of squares\")<\/code><\/p>\n<p><code># K-Means Cluster Analysis<br \/>\nfit &lt;- kmeans(mydata, 5) # 5 cluster solution<br \/>\n# get cluster means<br \/>\naggregate(mydata,by=list(fit$cluster),FUN=mean)<br \/>\n# append cluster assignment<br \/>\nmydata &lt;- data.frame(mydata, fit$cluster)<\/code><\/p>\n<p>A robust version of<strong>\u00a0K-means<\/strong>\u00a0based on mediods can be invoked by using\u00a0<strong>pam( )\u00a0<\/strong>instead of\u00a0<strong>kmeans( )<\/strong>. The function\u00a0<strong>pamk( )<\/strong>\u00a0in the\u00a0<strong><a href=\"http:\/\/cran.r-project.org\/web\/packages\/fpc\/index.html\">fpc<\/a>\u00a0<\/strong>package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"hierarchical-agglomerative\"><\/span>Hierarchical Agglomerative<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>There are a wide range of hierarchical clustering approaches. I have had good luck with Ward&rsquo;s method described below.<\/p>\n<p><code># Ward Hierarchical Clustering<br \/>\nd &lt;- dist(mydata, method = \"euclidean\") # distance matrix<br \/>\nfit &lt;- hclust(d, method=\"ward\")<br \/>\nplot(fit) # display dendogram<br \/>\ngroups &lt;- cutree(fit, k=5) # cut tree into 5 clusters<br \/>\n# draw dendogram with red borders around the 5 clusters<br \/>\nrect.hclust(fit, k=5, border=\"red\")<\/code><\/p>\n<p><a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster1.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"dendogram\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster1.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0click to view<\/p>\n<p>The\u00a0<strong>pvclust( )<\/strong>\u00a0function in the\u00a0<strong><a href=\"http:\/\/cran.r-project.org\/web\/packages\/pvclust\/index.html\">pvclust<\/a><\/strong>\u00a0package provides p-values for hierarchical clustering based on multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p values. Interpretation details are provided\u00a0<a href=\"http:\/\/www.is.titech.ac.jp\/~shimo\/prog\/pvclust\/\">Suzuki<\/a>. Be aware that\u00a0<strong><a href=\"http:\/\/cran.r-project.org\/web\/packages\/pvclust\/index.html\">pvclust<\/a><\/strong>\u00a0clusters columns, not rows. Transpose your data before using.<\/p>\n<p><code># Ward Hierarchical Clustering with Bootstrapped p values<br \/>\nlibrary(pvclust)<br \/>\nfit &lt;- pvclust(mydata, method.hclust=\"ward\",<br \/>\nmethod.dist=\"euclidean\")<br \/>\nplot(fit) # dendogram with p values<br \/>\n# add rectangles around groups highly supported by the data<br \/>\npvrect(fit, alpha=.95)<\/code><\/p>\n<p><a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster2.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"clustering with p values\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster2.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0click to view<\/p>\n<h2><span class=\"ez-toc-section\" id=\"model-based\"><\/span>Model Based<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Model based approaches assume a variety of data models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. Specifically, the\u00a0<strong>Mclust( )<\/strong>function in the\u00a0<strong><a href=\"http:\/\/cran.r-project.org\/web\/packages\/mclust\/index.html\">mclust<\/a><\/strong>\u00a0package selects the optimal model according to BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and number of clusters with the largest BIC. See\u00a0<a href=\"http:\/\/finzi.psych.upenn.edu\/R\/library\/mclust\/html\/mclustModelNames.html\">help(mclustModelNames)<\/a>\u00a0to details on the model chosen as best.<\/p>\n<p><code># Model Based Clustering<br \/>\nlibrary(mclust)<br \/>\nfit &lt;- Mclust(mydata)<br \/>\nplot(fit, mydata) # plot results<br \/>\nprint(fit) # display the best model<\/code><\/p>\n<p><a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster3.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"model based clustering\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster3.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0<a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster4.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"cluster scatter plots\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster4.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0click to view<\/p>\n<h2><span class=\"ez-toc-section\" id=\"plotting-cluster-solutions\"><\/span>Plotting Cluster Solutions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It is always a good idea to look at the cluster results.<\/p>\n<p><code># K-Means Clustering with 5 clusters<br \/>\nfit &lt;- kmeans(mydata, 5)<\/p>\n<p># Cluster Plot against 1st 2 principal components<\/p>\n<p># vary parameters for most readable graph<br \/>\nlibrary(cluster)<br \/>\nclusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,<br \/>\nlabels=2, lines=0)<\/p>\n<p># Centroid Plot against 1st 2 discriminant functions<br \/>\nlibrary(fpc)<br \/>\nplotcluster(mydata, fit$cluster)<\/code><\/p>\n<p><a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster5.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"clusplot\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster5.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0<a href=\"http:\/\/www.statmethods.net\/advstats\/images\/cluster6.jpg\"><img loading=\"lazy\" decoding=\"async\" alt=\"discriminant plot\" src=\"http:\/\/www.statmethods.net\/advstats\/images\/smcluster6.jpg\" width=\"103\" height=\"103\" \/><\/a>\u00a0click to view<\/p>\n<h2><span class=\"ez-toc-section\" id=\"validating-cluster-solutions\"><\/span>Validating cluster solutions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The function\u00a0<strong>cluster.stats()\u00a0<\/strong>in the\u00a0<strong><a href=\"http:\/\/cran.r-project.org\/web\/packages\/fpc\/index.html\">fpc<\/a><\/strong>\u00a0package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert&rsquo;s gamma coefficient, the Dunn index and the corrected rand index)<\/p>\n<p><code># comparing 2 cluster solutions<br \/>\nlibrary(fpc)<br \/>\ncluster.stats(d, fit1$cluster, fit2$cluster)<\/code><\/p>\n<p>where\u00a0<strong>d\u00a0<\/strong>is a distance matrix among objects, and\u00a0<strong>fit1$cluster<\/strong>\u00a0and\u00a0<strong>fit$cluste<\/strong>r are integer vectors containing classification results from two different clusterings of the same data.<\/p>\n<p>Source:\u00a0<a href=\"http:\/\/www.statmethods.net\/advstats\/cluster.html\">http:\/\/www.statmethods.net\/advstats\/cluster.html<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>R\u00a0has an\u00a0amazing variety\u00a0of functions for\u00a0cluster analysis. In this section, I will describe three of the many approaches: hierarchical\u00a0agglomeration\u00a0 partitioning, and model based. While there are no best solutions for the&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[150,95],"class_list":["post-652","post","type-post","status-publish","format-standard","hentry","category-r-language","tag-cluster-analysis","tag-r"],"_links":{"self":[{"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/posts\/652","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/comments?post=652"}],"version-history":[{"count":1,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/posts\/652\/revisions"}],"predecessor-version":[{"id":653,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/posts\/652\/revisions\/653"}],"wp:attachment":[{"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/media?parent=652"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/categories?post=652"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.itersdesktop.com\/fr\/wp-json\/wp\/v2\/tags?post=652"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}