ClusterAnalysis:BasicConceptsandAlgorithms
Clusteranalysisdividesdataintogroups(clusters)thataremeaningful,useful,orboth.Ifmeaningfulgroupsarethegoal,thentheclustersshouldcapturethenaturalstructureofthedata.Insomecases,however,clusteranalysisisonlyausefulstartingpointforotherpurposes,suchasdatasummarization.Whetherforunderstandingorutility,clusteranalysishaslongplayedanimportantroleinawidevarietyoffields:psychologyandothersocialsciences,biology,statistics,patternrecognition,informationretrieval,machinelearning,anddatamining.
Therehavebeenmanyapplicationsofclusteranalysistopracticalprob-lems.Weprovidesomespecificexamples,organizedbywhetherthepurposeoftheclusteringisunderstandingorutility.
ClusteringforUnderstandingClasses,orconceptuallymeaningfulgroupsofobjectsthatsharecommoncharacteristics,playanimportantroleinhowpeopleanalyzeanddescribetheworld.Indeed,humanbeingsareskilledatdividingobjectsintogroups(clustering)andassigningparticularobjectstothesegroups(classification).Forexample,evenrelativelyyoungchildrencanquicklylabeltheobjectsinaphotographasbuildings,vehicles,people,ani-mals,plants,etc.Inthecontextofunderstandingdata,clustersarepotentialclassesandclusteranalysisisthestudyoftechniquesforautomaticallyfindingclasses.Thefollowingaresomeexamples:
488Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
•Biology.Biologistshavespentmanyyearscreatingataxonomy(hi-erarchicalclassification)ofalllivingthings:kingdom,phylum,class,order,family,genus,andspecies.Thus,itisperhapsnotsurprisingthatmuchoftheearlyworkinclusteranalysissoughttocreateadisciplineofmathematicaltaxonomythatcouldautomaticallyfindsuchclassifi-cationstructures.Morerecently,biologistshaveappliedclusteringtoanalyzethelargeamountsofgeneticinformationthatarenowavailable.Forexample,clusteringhasbeenusedtofindgroupsofgenesthathavesimilarfunctions.
•InformationRetrieval.TheWorldWideWebconsistsofbillionsofWebpages,andtheresultsofaquerytoasearchenginecanreturnthousandsofpages.Clusteringcanbeusedtogroupthesesearchre-sultsintoasmallnumberofclusters,eachofwhichcapturesaparticularaspectofthequery.Forinstance,aqueryof“movie”mightreturnWebpagesgroupedintocategoriessuchasreviews,trailers,stars,andtheaters.Eachcategory(cluster)canbebrokenintosubcategories(sub-clusters),producingahierarchicalstructurethatfurtherassistsauser’sexplorationofthequeryresults.
•Climate.UnderstandingtheEarth’sclimaterequiresfindingpatternsintheatmosphereandocean.Tothatend,clusteranalysishasbeenappliedtofindpatternsintheatmosphericpressureofpolarregionsandareasoftheoceanthathaveasignificantimpactonlandclimate.•PsychologyandMedicine.Anillnessorconditionfrequentlyhasanumberofvariations,andclusteranalysiscanbeusedtoidentifythesedifferentsubcategories.Forexample,clusteringhasbeenusedtoidentifydifferenttypesofdepression.Clusteranalysiscanalsobeusedtodetectpatternsinthespatialortemporaldistributionofadisease.
•Business.Businessescollectlargeamountsofinformationoncurrentandpotentialcustomers.Clusteringcanbeusedtosegmentcustomersintoasmallnumberofgroupsforadditionalanalysisandmarketingactivities.
ClusteringforUtilityClusteranalysisprovidesanabstractionfromin-dividualdataobjectstotheclustersinwhichthosedataobjectsreside.Ad-ditionally,someclusteringtechniquescharacterizeeachclusterintermsofaclusterprototype;i.e.,adataobjectthatisrepresentativeoftheotherob-jectsinthecluster.Theseclusterprototypescanbeusedasthebasisfora
4
numberofdataanalysisordataprocessingtechniques.Therefore,inthecon-textofutility,clusteranalysisisthestudyoftechniquesforfindingthemostrepresentativeclusterprototypes.
•Summarization.Manydataanalysistechniques,suchasregressionorPCA,haveatimeorspacecomplexityofO(m2)orhigher(wheremisthenumberofobjects),andthus,arenotpracticalforlargedatasets.However,insteadofapplyingthealgorithmtotheentiredataset,itcanbeappliedtoareduceddatasetconsistingonlyofclusterprototypes.Dependingonthetypeofanalysis,thenumberofprototypes,andtheaccuracywithwhichtheprototypesrepresentthedata,theresultscanbecomparabletothosethatwouldhavebeenobtainedifallthedatacouldhavebeenused.
•Compression.Clusterprototypescanalsobeusedfordatacompres-sion.Inparticular,atableiscreatedthatconsistsoftheprototypesforeachcluster;i.e.,eachprototypeisassignedanintegervaluethatisitsposition(index)inthetable.Eachobjectisrepresentedbytheindexoftheprototypeassociatedwithitscluster.Thistypeofcompressionisknownasvectorquantizationandisoftenappliedtoimage,sound,andvideodata,where(1)manyofthedataobjectsarehighlysimilartooneanother,(2)somelossofinformationisacceptable,and(3)asubstantialreductioninthedatasizeisdesired.
•EfficientlyFindingNearestNeighbors.Findingnearestneighborscanrequirecomputingthepairwisedistancebetweenallpoints.Oftenclustersandtheirclusterprototypescanbefoundmuchmoreefficiently.Ifobjectsarerelativelyclosetotheprototypeoftheircluster,thenwecanusetheprototypestoreducethenumberofdistancecomputationsthatarenecessarytofindthenearestneighborsofanobject.Intuitively,iftwoclusterprototypesarefarapart,thentheobjectsinthecorrespondingclusterscannotbenearestneighborsofeachother.Consequently,tofindanobject’snearestneighborsitisonlynecessarytocomputethedistancetoobjectsinnearbyclusters,wherethenearnessoftwoclustersismeasuredbythedistancebetweentheirprototypes.ThisideaismademorepreciseinExercise25onpage94.
Thischapterprovidesanintroductiontoclusteranalysis.Webeginwithahigh-leveloverviewofclustering,includingadiscussionofthevariousap-proachestodividingobjectsintosetsofclustersandthedifferenttypesofclusters.Wethendescribethreespecificclusteringtechniquesthatrepresent
490Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
broadcategoriesofalgorithmsandillustrateavarietyofconcepts:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thefinalsectionofthischapterisdevotedtoclustervalidity—methodsforevaluatingthegoodnessoftheclustersproducedbyaclusteringalgorithm.MoreadvancedclusteringconceptsandalgorithmswillbediscussedinChapter9.Wheneverpossible,wediscussthestrengthsandweaknessesofdifferentschemes.Inaddition,thebibliographicnotesprovidereferencestorelevantbooksandpapersthatexploreclusteranalysisingreaterdepth.
8.1Overview
Beforediscussingspecificclusteringtechniques,weprovidesomenecessarybackground.First,wefurtherdefineclusteranalysis,illustratingwhyitisdifficultandexplainingitsrelationshiptoothertechniquesthatgroupdata.Thenweexploretwoimportanttopics:(1)differentwaystogroupasetofobjectsintoasetofclusters,and(2)typesofclusters.
8.1.1WhatIsClusterAnalysis?
Clusteranalysisgroupsdataobjectsbasedonlyoninformationfoundinthedatathatdescribestheobjectsandtheirrelationships.Thegoalisthattheobjectswithinagroupbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.Thegreaterthesimilarity(orhomogeneity)withinagroupandthegreaterthedifferencebetweengroups,thebetterormoredistincttheclustering.
Inmanyapplications,thenotionofaclusterisnotwelldefined.Tobetterunderstandthedifficultyofdecidingwhatconstitutesacluster,considerFigure8.1,whichshowstwentypointsandthreedifferentwaysofdividingthemintoclusters.Theshapesofthemarkersindicateclustermembership.Figures8.1(b)and8.1(d)dividethedataintotwoandsixparts,respectively.However,theapparentdivisionofeachofthetwolargerclustersintothreesubclustersmaysimplybeanartifactofthehumanvisualsystem.Also,itmaynotbeunreasonabletosaythatthepointsformfourclusters,asshowninFigure8.1(c).Thisfigureillustratesthatthedefinitionofaclusterisimpreciseandthatthebestdefinitiondependsonthenatureofdataandthedesiredresults.
Clusteranalysisisrelatedtoothertechniquesthatareusedtodividedataobjectsintogroups.Forinstance,clusteringcanberegardedasaformofclassificationinthatitcreatesalabelingofobjectswithclass(cluster)labels.However,itderivestheselabelsonlyfromthedata.Incontrast,classification
8.1Overview491
(a)Originalpoints.(b)Twoclusters.
(c)Fourclusters.(d)Sixclusters.
Figure8.1.Differentwaysofclusteringthesamesetofpoints.
inthesenseofChapter4issupervisedclassification;i.e.,new,unlabeledobjectsareassignedaclasslabelusingamodeldevelopedfromobjectswithknownclasslabels.Forthisreason,clusteranalysisissometimesreferredtoasunsupervisedclassification.Whenthetermclassificationisusedwithoutanyqualificationwithindatamining,ittypicallyreferstosupervisedclassification.
Also,whilethetermssegmentationandpartitioningaresometimesusedassynonymsforclustering,thesetermsarefrequentlyusedforapproachesoutsidethetraditionalboundsofclusteranalysis.Forexample,thetermpartitioningisoftenusedinconnectionwithtechniquesthatdividegraphsintosubgraphsandthatarenotstronglyconnectedtoclustering.Segmentationoftenreferstothedivisionofdataintogroupsusingsimpletechniques;e.g.,animagecanbesplitintosegmentsbasedonlyonpixelintensityandcolor,orpeoplecanbedividedintogroupsbasedontheirincome.Nonetheless,someworkingraphpartitioningandinimageandmarketsegmentationisrelatedtoclusteranalysis.
8.1.2DifferentTypesofClusterings
Anentirecollectionofclustersiscommonlyreferredtoasaclustering,andinthissection,wedistinguishvarioustypesofclusterings:hierarchical(nested)versuspartitional(unnested),exclusiveversusoverlappingversusfuzzy,andcompleteversuspartial.
HierarchicalversusPartitionalThemostcommonlydiscusseddistinc-tionamongdifferenttypesofclusteringsiswhetherthesetofclustersisnested
492Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
orunnested,orinmoretraditionalterminology,hierarchicalorpartitional.Apartitionalclusteringissimplyadivisionofthesetofdataobjectsintonon-overlappingsubsets(clusters)suchthateachdataobjectisinexactlyonesubset.Takenindividually,eachcollectionofclustersinFigures8.1(b–d)isapartitionalclustering.
Ifwepermitclusterstohavesubclusters,thenweobtainahierarchicalclustering,whichisasetofnestedclustersthatareorganizedasatree.Eachnode(cluster)inthetree(exceptfortheleafnodes)istheunionofitschildren(subclusters),andtherootofthetreeistheclustercontainingalltheobjects.Often,butnotalways,theleavesofthetreearesingletonclustersofindividualdataobjects.Ifweallowclusterstobenested,thenoneinterpretationofFigure8.1(a)isthatithastwosubclusters(Figure8.1(b)),eachofwhich,inturn,hasthreesubclusters(Figure8.1(d)).TheclustersshowninFigures8.1(a–d),whentakeninthatorder,alsoformahierarchical(nested)clusteringwith,respectively,1,2,4,and6clustersoneachlevel.Finally,notethatahierarchicalclusteringcanbeviewedasasequenceofpartitionalclusteringsandapartitionalclusteringcanbeobtainedbytakinganymemberofthatsequence;i.e.,bycuttingthehierarchicaltreeataparticularlevel.
ExclusiveversusOverlappingversusFuzzyTheclusteringsshowninFigure8.1areallexclusive,astheyassigneachobjecttoasinglecluster.Therearemanysituationsinwhichapointcouldreasonablybeplacedinmorethanonecluster,andthesesituationsarebetteraddressedbynon-exclusiveclustering.Inthemostgeneralsense,anoverlappingornon-exclusiveclusteringisusedtoreflectthefactthatanobjectcansimultaneouslybelongtomorethanonegroup(class).Forinstance,apersonatauniversitycanbebothanenrolledstudentandanemployeeoftheuniversity.Anon-exclusiveclusteringisalsooftenusedwhen,forexample,anobjectis“between”twoormoreclustersandcouldreasonablybeassignedtoanyoftheseclusters.ImagineapointhalfwaybetweentwooftheclustersofFigure8.1.Ratherthanmakeasomewhatarbitraryassignmentoftheobjecttoasinglecluster,itisplacedinallofthe“equallygood”clusters.
Inafuzzyclustering,everyobjectbelongstoeveryclusterwithamem-bershipweightthatisbetween0(absolutelydoesn’tbelong)and1(absolutelybelongs).Inotherwords,clustersaretreatedasfuzzysets.(Mathematically,afuzzysetisoneinwhichanobjectbelongstoanysetwithaweightthatisbetween0and1.Infuzzyclustering,weoftenimposetheadditionalcon-straintthatthesumoftheweightsforeachobjectmustequal1.)Similarly,probabilisticclusteringtechniquescomputetheprobabilitywithwhicheach
8.1Overview493
pointbelongstoeachcluster,andtheseprobabilitiesmustalsosumto1.Be-causethemembershipweightsorprobabilitiesforanyobjectsumto1,afuzzyorprobabilisticclusteringdoesnotaddresstruemulticlasssituations,suchasthecaseofastudentemployee,whereanobjectbelongstomultipleclasses.Instead,theseapproachesaremostappropriateforavoidingthearbitrarinessofassigninganobjecttoonlyoneclusterwhenitmaybeclosetoseveral.Inpractice,afuzzyorprobabilisticclusteringisoftenconvertedtoanexclusiveclusteringbyassigningeachobjecttotheclusterinwhichitsmembershipweightorprobabilityishighest.
CompleteversusPartialAcompleteclusteringassignseveryobjecttoacluster,whereasapartialclusteringdoesnot.Themotivationforapartialclusteringisthatsomeobjectsinadatasetmaynotbelongtowell-definedgroups.Manytimesobjectsinthedatasetmayrepresentnoise,outliers,or“uninterestingbackground.”Forexample,somenewspaperstoriesmayshareacommontheme,suchasglobalwarming,whileotherstoriesaremoregenericorone-of-a-kind.Thus,tofindtheimportanttopicsinlastmonth’sstories,wemaywanttosearchonlyforclustersofdocumentsthataretightlyrelatedbyacommontheme.Inothercases,acompleteclusteringoftheobjectsisdesired.Forexample,anapplicationthatusesclusteringtoorganizedocumentsforbrowsingneedstoguaranteethatalldocumentscanbebrowsed.
8.1.3DifferentTypesofClusters
Clusteringaimstofindusefulgroupsofobjects(clusters),whereusefulnessisdefinedbythegoalsofthedataanalysis.Notsurprisingly,thereareseveraldifferentnotionsofaclusterthatproveusefulinpractice.Inordertovisuallyillustratethedifferencesamongthesetypesofclusters,weusetwo-dimensionalpoints,asshowninFigure8.2,asourdataobjects.Westress,however,thatthetypesofclustersdescribedhereareequallyvalidforotherkindsofdata.Well-SeparatedAclusterisasetofobjectsinwhicheachobjectiscloser(ormoresimilar)toeveryotherobjectintheclusterthantoanyobjectnotinthecluster.Sometimesathresholdisusedtospecifythatalltheobjectsinaclustermustbesufficientlyclose(orsimilar)tooneanother.Thisidealisticdefinitionofaclusterissatisfiedonlywhenthedatacontainsnaturalclustersthatarequitefarfromeachother.Figure8.2(a)givesanexampleofwell-separatedclustersthatconsistsoftwogroupsofpointsinatwo-dimensionalspace.Thedistancebetweenanytwopointsindifferentgroupsislargerthan
494Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
thedistancebetweenanytwopointswithinagroup.Well-separatedclustersdonotneedtobeglobular,butcanhaveanyshape.
Prototype-BasedAclusterisasetofobjectsinwhicheachobjectiscloser(moresimilar)totheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Fordatawithcontinuousattributes,theprototypeofaclusterisoftenacentroid,i.e.,theaverage(mean)ofallthepointsintheclus-ter.Whenacentroidisnotmeaningful,suchaswhenthedatahascategoricalattributes,theprototypeisoftenamedoid,i.e.,themostrepresentativepointofacluster.Formanytypesofdata,theprototypecanberegardedasthemostcentralpoint,andinsuchinstances,wecommonlyrefertoprototype-basedclustersascenter-basedclusters.Notsurprisingly,suchclusterstendtobeglobular.Figure8.2(b)showsanexampleofcenter-basedclusters.Graph-BasedIfthedataisrepresentedasagraph,wherethenodesareobjectsandthelinksrepresentconnectionsamongobjects(seeSection2.1.2),thenaclustercanbedefinedasaconnectedcomponent;i.e.,agroupofobjectsthatareconnectedtooneanother,butthathavenoconnectiontoobjectsoutsidethegroup.Animportantexampleofgraph-basedclustersarecontiguity-basedclusters,wheretwoobjectsareconnectedonlyiftheyarewithinaspecifieddistanceofeachother.Thisimpliesthateachobjectinacontiguity-basedclusterisclosertosomeotherobjectintheclusterthantoanypointinadifferentcluster.Figure8.2(c)showsanexampleofsuchclustersfortwo-dimensionalpoints.Thisdefinitionofaclusterisusefulwhenclustersareirregularorintertwined,butcanhavetroublewhennoiseispresentsince,asillustratedbythetwosphericalclustersofFigure8.2(c),asmallbridgeofpointscanmergetwodistinctclusters.
Othertypesofgraph-basedclustersarealsopossible.Onesuchapproach(Section8.3.2)definesaclusterasaclique;i.e.,asetofnodesinagraphthatarecompletelyconnectedtoeachother.Specifically,ifweaddconnectionsbetweenobjectsintheorderoftheirdistancefromoneanother,aclusterisformedwhenasetofobjectsformsaclique.Likeprototype-basedclusters,suchclusterstendtobeglobular.
Density-BasedAclusterisadenseregionofobjectsthatissurroundedbyaregionoflowdensity.Figure8.2(d)showssomedensity-basedclustersfordatacreatedbyaddingnoisetothedataofFigure8.2(c).Thetwocircularclustersarenotmerged,asinFigure8.2(c),becausethebridgebetweenthemfadesintothenoise.Likewise,thecurvethatispresentinFigure8.2(c)also
8.1Overview495
fadesintothenoiseanddoesnotformaclusterinFigure8.2(d).Adensity-baseddefinitionofaclusterisoftenemployedwhentheclustersareirregularorintertwined,andwhennoiseandoutliersarepresent.Bycontrast,acontiguity-baseddefinitionofaclusterwouldnotworkwellforthedataofFigure8.2(d)sincethenoisewouldtendtoformbridgesbetweenclusters.
Shared-Property(ConceptualClusters)Moregenerally,wecandefineaclusterasasetofobjectsthatsharesomeproperty.Thisdefinitionencom-passesallthepreviousdefinitionsofacluster;e.g.,objectsinacenter-basedclustersharethepropertythattheyareallclosesttothesamecentroidormedoid.However,theshared-propertyapproachalsoincludesnewtypesofclusters.ConsidertheclustersshowninFigure8.2(e).Atriangulararea(cluster)isadjacenttoarectangularone,andtherearetwointertwinedcircles(clusters).Inbothcases,aclusteringalgorithmwouldneedaveryspecificconceptofaclustertosuccessfullydetecttheseclusters.Theprocessoffind-ingsuchclustersiscalledconceptualclustering.However,toosophisticatedanotionofaclusterwouldtakeusintotheareaofpatternrecognition,andthus,weonlyconsidersimplertypesofclustersinthisbook.
RoadMap
Inthischapter,weusethefollowingthreesimple,butimportanttechniquestointroducemanyoftheconceptsinvolvedinclusteranalysis.
•K-means.Thisisaprototype-based,partitionalclusteringtechniquethatattemptstofindauser-specifiednumberofclusters(K),whicharerepresentedbytheircentroids.
•AgglomerativeHierarchicalClustering.Thisclusteringapproachreferstoacollectionofcloselyrelatedclusteringtechniquesthatproduceahierarchicalclusteringbystartingwitheachpointasasingletonclusterandthenrepeatedlymergingthetwoclosestclustersuntilasingle,all-encompassingclusterremains.Someofthesetechniqueshaveanaturalinterpretationintermsofgraph-basedclustering,whileothershaveaninterpretationintermsofaprototype-basedapproach.
•DBSCAN.Thisisadensity-basedclusteringalgorithmthatproducesapartitionalclustering,inwhichthenumberofclustersisautomaticallydeterminedbythealgorithm.Pointsinlow-densityregionsareclassi-fiedasnoiseandomitted;thus,DBSCANdoesnotproduceacompleteclustering.
496Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Well-separatedclusters.Eachpointisclosertoallofthepointsinitsclusterthantoanypointinanothercluster.(b)Center-basedclusters.Eachpointisclosertothecenterofitsclusterthantothecenterofanyothercluster.
(c)Contiguity-basedclusters.Eachpointisclosertoatleastonepointinitsclusterthantoanypointinanothercluster.
(d)Density-basedclusters.Clus-tersareregionsofhighdensitysep-aratedbyregionsoflowdensity.
(e)Conceptualclusters.Pointsinaclustersharesomegeneralpropertythatderivesfromtheentiresetofpoints.(Pointsintheintersectionofthecirclesbelongtoboth.)
Figure8.2.Differenttypesofclustersasillustratedbysetsoftwo-dimensionalpoints.
8.2K-means
Prototype-basedclusteringtechniquescreateaone-levelpartitioningofthedataobjects.Thereareanumberofsuchtechniques,buttwoofthemostprominentareK-meansandK-medoid.K-meansdefinesaprototypeintermsofacentroid,whichisusuallythemeanofagroupofpoints,andistypically
8.2K-means497
appliedtoobjectsinacontinuousn-dimensionalspace.K-medoiddefinesaprototypeintermsofamedoid,whichisthemostrepresentativepointforagroupofpoints,andcanbeappliedtoawiderangeofdatasinceitrequiresonlyaproximitymeasureforapairofobjects.Whileacentroidalmostnevercorrespondstoanactualdatapoint,amedoid,byitsdefinition,mustbeanactualdatapoint.Inthissection,wewillfocussolelyonK-means,whichisoneoftheoldestandmostwidelyusedclusteringalgorithms.
8.2.1TheBasicK-meansAlgorithm
TheK-meansclusteringtechniqueissimple,andwebeginwithadescriptionofthebasicalgorithm.WefirstchooseKinitialcentroids,whereKisauser-specifiedparameter,namely,thenumberofclustersdesired.Eachpointisthenassignedtotheclosestcentroid,andeachcollectionofpointsassignedtoacentroidisacluster.Thecentroidofeachclusteristhenupdatedbasedonthepointsassignedtothecluster.Werepeattheassignmentandupdatestepsuntilnopointchangesclusters,orequivalently,untilthecentroidsremainthesame.
K-meansisformallydescribedbyAlgorithm8.1.TheoperationofK-meansisillustratedinFigure8.3,whichshowshow,startingfromthreecentroids,thefinalclustersarefoundinfourassignment-updatesteps.IntheseandotherfiguresdisplayingK-meansclustering,eachsubfigureshows(1)thecentroidsatthestartoftheiterationand(2)theassignmentofthepointstothosecentroids.Thecentroidsareindicatedbythe“+”symbol;allpointsbelongingtothesameclusterhavethesamemarkershape.Algorithm8.1BasicK-meansalgorithm.1:SelectKpointsasinitialcentroids.2:repeat3:FormKclustersbyassigningeachpointtoitsclosestcentroid.4:Recomputethecentroidofeachcluster.5:untilCentroidsdonotchange.
Inthefirststep,showninFigure8.3(a),pointsareassignedtotheinitialcentroids,whichareallinthelargergroupofpoints.Forthisexample,weusethemeanasthecentroid.Afterpointsareassignedtoacentroid,thecentroidisthenupdated.Again,thefigureforeachstepshowsthecentroidatthebeginningofthestepandtheassignmentofpointstothosecentroids.Inthesecondstep,pointsareassignedtotheupdatedcentroids,andthecentroids
498Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Iteration1.(b)Iteration2.(c)Iteration3.(d)Iteration4.
Figure8.3.UsingtheK-meansalgorithmtofindthreeclustersinsampledata.
areupdatedagain.Insteps2,3,and4,whichareshowninFigures8.3(b),(c),and(d),respectively,twoofthecentroidsmovetothetwosmallgroupsofpointsatthebottomofthefigures.WhentheK-meansalgorithmterminatesinFigure8.3(d),becausenomorechangesoccur,thecentroidshaveidentifiedthenaturalgroupingsofpoints.
Forsomecombinationsofproximityfunctionsandtypesofcentroids,K-meansalwaysconvergestoasolution;i.e.,K-meansreachesastateinwhichnopointsareshiftingfromoneclustertoanother,andhence,thecentroidsdon’tchange.Becausemostoftheconvergenceoccursintheearlysteps,however,theconditiononline5ofAlgorithm8.1isoftenreplacedbyaweakercondition,e.g.,repeatuntilonly1%ofthepointschangeclusters.
WeconsidereachofthestepsinthebasicK-meansalgorithminmoredetailandthenprovideananalysisofthealgorithm’sspaceandtimecomplexity.AssigningPointstotheClosestCentroid
Toassignapointtotheclosestcentroid,weneedaproximitymeasurethatquantifiesthenotionof“closest”forthespecificdataunderconsideration.Euclidean(L2)distanceisoftenusedfordatapointsinEuclideanspace,whilecosinesimilarityismoreappropriatefordocuments.However,theremaybeseveraltypesofproximitymeasuresthatareappropriateforagiventypeofdata.Forexample,Manhattan(L1)distancecanbeusedforEuclideandata,whiletheJaccardmeasureisoftenemployedfordocuments.
Usually,thesimilaritymeasuresusedforK-meansarerelativelysimplesincethealgorithmrepeatedlycalculatesthesimilarityofeachpointtoeachcentroid.Insomecases,however,suchaswhenthedataisinlow-dimensional
8.2
Table8.1.Tableofnotation.DescriptionAnobject.Theithcluster.
ThecentroidofclusterCi.Thecentroidofallpoints.
Thenumberofobjectsintheithcluster.Thenumberofobjectsinthedataset.Thenumberofclusters.
K-means499
SymbolxCicicmimK
Euclideanspace,itispossibletoavoidcomputingmanyofthesimilarities,thussignificantlyspeedinguptheK-meansalgorithm.BisectingK-means(describedinSection8.2.3)isanotherapproachthatspeedsupK-meansbyreducingthenumberofsimilaritiescomputed.CentroidsandObjectiveFunctions
Step4oftheK-meansalgorithmwasstatedrathergenerallyas“recomputethecentroidofeachcluster,”sincethecentroidcanvary,dependingontheproximitymeasureforthedataandthegoaloftheclustering.Thegoaloftheclusteringistypicallyexpressedbyanobjectivefunctionthatdependsontheproximitiesofthepointstooneanotherortotheclustercentroids;e.g.,minimizethesquareddistanceofeachpointtoitsclosestcentroid.Weillus-tratethiswithtwoexamples.However,thekeypointisthis:oncewehavespecifiedaproximitymeasureandanobjectivefunction,thecentroidthatweshouldchoosecanoftenbedeterminedmathematically.Weprovidemathe-maticaldetailsinSection8.2.6,andprovideanon-mathematicaldiscussionofthisobservationhere.
DatainEuclideanSpaceConsiderdatawhoseproximitymeasureisEu-clideandistance.Forourobjectivefunction,whichmeasuresthequalityofaclustering,weusethesumofthesquarederror(SSE),whichisalsoknownasscatter.Inotherwords,wecalculatetheerrorofeachdatapoint,i.e.,itsEuclideandistancetotheclosestcentroid,andthencomputethetotalsumofthesquarederrors.GiventwodifferentsetsofclustersthatareproducedbytwodifferentrunsofK-means,weprefertheonewiththesmallestsquarederrorsincethismeansthattheprototypes(centroids)ofthisclusteringareabetterrepresentationofthepointsintheircluster.UsingthenotationinTable8.1,theSSEisformallydefinedasfollows:
500Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
Ki=1x∈Ci
SSE=
dist(ci,x)2
(8.1)
wheredististhestandardEuclidean(L2)distancebetweentwoobjectsin
Euclideanspace.
Giventheseassumptions,itcanbeshown(seeSection8.2.6)thatthecentroidthatminimizestheSSEoftheclusteristhemean.UsingthenotationinTable8.1,thecentroid(mean)oftheithclusterisdefinedbyEquation8.2.
1ci=x
mi
x∈Ci
(8.2)
Toillustrate,thecentroidofaclustercontainingthethreetwo-dimensionalpoints,(1,1),(2,3),and(6,2),is((1+2+6)/3,((1+3+2)/3)=(3,2).
Steps3and4oftheK-meansalgorithmdirectlyattempttominimizetheSSE(ormoregenerally,theobjectivefunction).Step3formsclustersbyassigningpointstotheirnearestcentroid,whichminimizestheSSEforthegivensetofcentroids.Step4recomputesthecentroidssoastofurtherminimizetheSSE.However,theactionsofK-meansinSteps3and4areonlyguaranteedtofindalocalminimumwithrespecttotheSSEsincetheyarebasedonoptimizingtheSSEforspecificchoicesofthecentroidsandclusters,ratherthanforallpossiblechoices.Wewilllaterseeanexampleinwhichthisleadstoasuboptimalclustering.
DocumentDataToillustratethatK-meansisnotrestrictedtodatainEuclideanspace,weconsiderdocumentdataandthecosinesimilaritymeasure.Hereweassumethatthedocumentdataisrepresentedasadocument-termmatrixasdescribedonpage31.Ourobjectiveistomaximizethesimilarityofthedocumentsinaclustertotheclustercentroid;thisquantityisknownasthecohesionofthecluster.Forthisobjectiveitcanbeshownthattheclustercentroidis,asforEuclideandata,themean.TheanalogousquantitytothetotalSSEisthetotalcohesion,whichisgivenbyEquation8.3.
TotalCohesion=
Ki=1x∈Ci
cosine(x,ci)
(8.3)
TheGeneralCaseThereareanumberofchoicesfortheproximityfunc-tion,centroid,andobjectivefunctionthatcanbeusedinthebasicK-means
8.2K-means501
Table8.2.K-means:Commonchoicesforproximity,centroids,andobjectivefunctions.ProximityFunctionManhattan(L1)SquaredEuclidean(L22)cosineBregmandivergenceCentroidmedianmeanmeanmeanObjectiveFunctionMinimizesumoftheL1distanceofanob-jecttoitsclustercentroid
MinimizesumofthesquaredL2distanceofanobjecttoitsclustercentroid
Maximizesumofthecosinesimilarityofanobjecttoitsclustercentroid
MinimizesumoftheBregmandivergenceofanobjecttoitsclustercentroid
algorithmandthatareguaranteedtoconverge.Table8.2showssomepossiblechoices,includingthetwothatwehavejustdiscussed.NoticethatforMan-hattan(L1)distanceandtheobjectiveofminimizingthesumofthedistances,theappropriatecentroidisthemedianofthepointsinacluster.
Thelastentryinthetable,Bregmandivergence(Section2.4.5),isactuallyaclassofproximitymeasuresthatincludesthesquaredEuclideandistance,L22,theMahalanobisdistance,andcosinesimilarity.TheimportanceofBregmandivergencefunctionsisthatanysuchfunctioncanbeusedasthebasisofaK-meansstyleclusteringalgorithmwiththemeanasthecentroid.Specifically,ifweuseaBregmandivergenceasourproximityfunction,thentheresult-ingclusteringalgorithmhastheusualpropertiesofK-meanswithrespecttoconvergence,localminima,etc.Furthermore,thepropertiesofsuchacluster-ingalgorithmcanbedevelopedforallpossibleBregmandivergences.Indeed,K-meansalgorithmsthatusecosinesimilarityorsquaredEuclideandistanceareparticularinstancesofageneralclusteringalgorithmbasedonBregmandivergences.
FortherestourK-meansdiscussion,weusetwo-dimensionaldatasinceitiseasytoexplainK-meansanditspropertiesforthistypeofdata.But,assuggestedbythelastfewparagraphs,K-meansisaverygeneralclusteringalgorithmandcanbeusedwithawidevarietyofdatatypes,suchasdocumentsandtimeseries.
ChoosingInitialCentroids
Whenrandominitializationofcentroidsisused,differentrunsofK-meanstypicallyproducedifferenttotalSSEs.Weillustratethiswiththesetoftwo-dimensionalpointsshowninFigure8.3,whichhasthreenaturalclustersofpoints.Figure8.4(a)showsaclusteringsolutionthatistheglobalminimumof
502Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Optimalclustering.(b)Suboptimalclustering.
Figure8.4.Threeoptimalandnon-optimalclusters.
theSSEforthreeclusters,whileFigure8.4(b)showsasuboptimalclusteringthatisonlyalocalminimum.
ChoosingtheproperinitialcentroidsisthekeystepofthebasicK-meansprocedure.Acommonapproachistochoosetheinitialcentroidsrandomly,buttheresultingclustersareoftenpoor.
Example8.1(PoorInitialCentroids).Randomlyselectedinitialcen-troidsmaybepoor.WeprovideanexampleofthisusingthesamedatasetusedinFigures8.3and8.4.Figures8.3and8.5showtheclustersthatre-sultfromtwoparticularchoicesofinitialcentroids.(Forbothfigures,thepositionsoftheclustercentroidsinthevariousiterationsareindicatedbycrosses.)InFigure8.3,eventhoughalltheinitialcentroidsarefromonenatu-ralcluster,theminimumSSEclusteringisstillfound.InFigure8.5,however,eventhoughtheinitialcentroidsseemtobebetterdistributed,weobtainasuboptimalclustering,withhighersquarederror.
Example8.2(LimitsofRandomInitialization).Onetechniquethatiscommonlyusedtoaddresstheproblemofchoosinginitialcentroidsistoperformmultipleruns,eachwithadifferentsetofrandomlychoseninitialcentroids,andthenselectthesetofclusterswiththeminimumSSE.Whilesimple,thisstrategymaynotworkverywell,dependingonthedatasetandthenumberofclusterssought.WedemonstratethisusingthesampledatasetshowninFigure8.6(a).Thedataconsistsoftwopairsofclusters,wheretheclustersineach(top-bottom)pairareclosertoeachotherthantotheclustersintheotherpair.Figure8.6(b–d)showsthatifwestartwithtwoinitialcentroidsperpairofclusters,thenevenwhenbothcentroidsareinasingle
8.2K-means503
(a)Iteration1.(b)Iteration2.(c)Iteration3.(d)Iteration4.
Figure8.5.PoorstartingcentroidsforK-means.
cluster,thecentroidswillredistributethemselvessothatthe“true”clustersarefound.However,Figure8.7showsthatifapairofclustershasonlyoneinitialcentroidandtheotherpairhasthree,thentwoofthetrueclusterswillbecombinedandonetrueclusterwillbesplit.
Notethatanoptimalclusteringwillbeobtainedaslongastwoinitialcentroidsfallanywhereinapairofclusters,sincethecentroidswillredistributethemselves,onetoeachcluster.Unfortunately,asthenumberofclustersbecomeslarger,itisincreasinglylikelythatatleastonepairofclusterswillhaveonlyoneinitialcentroid.(SeeExercise4onpage559.)Inthiscase,becausethepairsofclustersarefartherapartthanclusterswithinapair,theK-meansalgorithmwillnotredistributethecentroidsbetweenpairsofclusters,andthus,onlyalocalminimumwillbeachieved.
Becauseoftheproblemswithusingrandomlyselectedinitialcentroids,whichevenrepeatedrunsmaynotovercome,othertechniquesareoftenem-ployedforinitialization.Oneeffectiveapproachistotakeasampleofpointsandclusterthemusingahierarchicalclusteringtechnique.Kclustersareex-tractedfromthehierarchicalclustering,andthecentroidsofthoseclustersareusedastheinitialcentroids.Thisapproachoftenworkswell,butispracticalonlyif(1)thesampleisrelativelysmall,e.g.,afewhundredtoafewthousand(hierarchicalclusteringisexpensive),and(2)Kisrelativelysmallcomparedtothesamplesize.
Thefollowingprocedureisanotherapproachtoselectinginitialcentroids.Selectthefirstpointatrandomortakethecentroidofallpoints.Then,foreachsuccessiveinitialcentroid,selectthepointthatisfarthestfromanyoftheinitialcentroidsalreadyselected.Inthisway,weobtainasetofinitial
504Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Initialpoints.(b)Iteration1.
(c)Iteration2.(d)Iteration3.
Figure8.6.Twopairsofclusterswithapairofinitialcentroidswithineachpairofclusters.
centroidsthatisguaranteedtobenotonlyrandomlyselectedbutalsowellseparated.Unfortunately,suchanapproachcanselectoutliers,ratherthanpointsindenseregions(clusters).Also,itisexpensivetocomputethefarthestpointfromthecurrentsetofinitialcentroids.Toovercometheseproblems,thisapproachisoftenappliedtoasampleofthepoints.Sinceoutliersarerare,theytendnottoshowupinarandomsample.Incontrast,pointsfromeverydenseregionarelikelytobeincludedunlessthesamplesizeisverysmall.Also,thecomputationinvolvedinfindingtheinitialcentroidsisgreatlyreducedbecausethesamplesizeistypicallymuchsmallerthanthenumberofpoints.
Lateron,wewilldiscusstwootherapproachesthatareusefulforproduc-ingbetter-quality(lowerSSE)clusterings:usingavariantofK-meansthat
8.2K-means505
(a)Iteration1.(b)Iteration2.
(c)Iteration3.(d)Iteration4.
Figure8.7.Twopairsofclusterswithmoreorfewerthantwoinitialcentroidswithinapairofclusters.
islesssusceptibletoinitializationproblems(bisectingK-means)andusingpostprocessingto“fixup”thesetofclustersproduced.TimeandSpaceComplexity
ThespacerequirementsforK-meansaremodestbecauseonlythedatapointsandcentroidsarestored.Specifically,thestoragerequiredisO((m+K)n),wheremisthenumberofpointsandnisthenumberofattributes.ThetimerequirementsforK-meansarealsomodest—basicallylinearinthenumberofdatapoints.Inparticular,thetimerequiredisO(I∗K∗m∗n),whereIisthenumberofiterationsrequiredforconvergence.Asmentioned,Iisoftensmallandcanusuallybesafelybounded,asmostchangestypicallyoccurinthe
506Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
firstfewiterations.Therefore,K-meansislinearinm,thenumberofpoints,andisefficientaswellassimpleprovidedthatK,thenumberofclusters,issignificantlylessthanm.
8.2.2K-means:AdditionalIssues
HandlingEmptyClusters
OneoftheproblemswiththebasicK-meansalgorithmgivenearlieristhatemptyclusterscanbeobtainedifnopointsareallocatedtoaclusterduringtheassignmentstep.Ifthishappens,thenastrategyisneededtochooseareplacementcentroid,sinceotherwise,thesquarederrorwillbelargerthannecessary.Oneapproachistochoosethepointthatisfarthestawayfromanycurrentcentroid.Ifnothingelse,thiseliminatesthepointthatcurrentlycontributesmosttothetotalsquarederror.AnotherapproachistochoosethereplacementcentroidfromtheclusterthathasthehighestSSE.ThiswilltypicallysplittheclusterandreducetheoverallSSEoftheclustering.Ifthereareseveralemptyclusters,thenthisprocesscanberepeatedseveraltimes.Outliers
Whenthesquarederrorcriterionisused,outlierscanundulyinfluencetheclustersthatarefound.Inparticular,whenoutliersarepresent,theresultingclustercentroids(prototypes)maynotbeasrepresentativeastheyotherwisewouldbeandthus,theSSEwillbehigheraswell.Becauseofthis,itisoftenusefultodiscoveroutliersandeliminatethembeforehand.Itisimportant,however,toappreciatethattherearecertainclusteringapplicationsforwhichoutliersshouldnotbeeliminated.Whenclusteringisusedfordatacom-pression,everypointmustbeclustered,andinsomecases,suchasfinancialanalysis,apparentoutliers,e.g.,unusuallyprofitablecustomers,canbethemostinterestingpoints.
Anobviousissueishowtoidentifyoutliers.AnumberoftechniquesforidentifyingoutlierswillbediscussedinChapter10.Ifweuseapproachesthatremoveoutliersbeforeclustering,weavoidclusteringpointsthatwillnotclus-terwell.Alternatively,outlierscanalsobeidentifiedinapostprocessingstep.Forinstance,wecankeeptrackoftheSSEcontributedbyeachpoint,andeliminatethosepointswithunusuallyhighcontributions,especiallyovermul-tipleruns.Also,wemaywanttoeliminatesmallclusterssincetheyfrequentlyrepresentgroupsofoutliers.
8.2
ReducingtheSSEwithPostprocessing
K-means507
AnobviouswaytoreducetheSSEistofindmoreclusters,i.e.,tousealargerK.However,inmanycases,wewouldliketoimprovetheSSE,butdon’twanttoincreasethenumberofclusters.ThisisoftenpossiblebecauseK-meanstypicallyconvergestoalocalminimum.Varioustechniquesareusedto“fixup”theresultingclustersinordertoproduceaclusteringthathaslowerSSE.ThestrategyistofocusonindividualclusterssincethetotalSSEissimplythesumoftheSSEcontributedbyeachcluster.(WewillusetheterminologytotalSSEandclusterSSE,respectively,toavoidanypotentialconfusion.)WecanchangethetotalSSEbyperformingvariousoperationsontheclusters,suchassplittingormergingclusters.Onecommonlyusedapproachistousealternateclustersplittingandmergingphases.Duringasplittingphase,clustersaredivided,whileduringamergingphase,clustersarecombined.Inthisway,itisoftenpossibletoescapelocalSSEminimaandstillproduceaclusteringsolutionwiththedesirednumberofclusters.Thefollowingaresometechniquesusedinthesplittingandmergingphases.
TwostrategiesthatdecreasethetotalSSEbyincreasingthenumberofclustersarethefollowing:
Splitacluster:TheclusterwiththelargestSSEisusuallychosen,butwe
couldalsosplittheclusterwiththelargeststandarddeviationforoneparticularattribute.Introduceanewclustercentroid:Oftenthepointthatisfarthestfrom
anyclustercenterischosen.WecaneasilydeterminethisifwekeeptrackoftheSSEcontributedbyeachpoint.AnotherapproachistochooserandomlyfromallpointsorfromthepointswiththehighestSSE.Twostrategiesthatdecreasethenumberofclusters,whiletryingtomini-mizetheincreaseintotalSSE,arethefollowing:
Disperseacluster:Thisisaccomplishedbyremovingthecentroidthatcor-respondstotheclusterandreassigningthepointstootherclusters.Ide-ally,theclusterthatisdispersedshouldbetheonethatincreasesthetotalSSEtheleast.Mergetwoclusters:Theclusterswiththeclosestcentroidsaretypically
chosen,althoughanother,perhapsbetter,approachistomergethetwoclustersthatresultinthesmallestincreaseintotalSSE.Thesetwomergingstrategiesarethesameonesthatareusedinthehierarchical
508Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
clusteringtechniquesknownasthecentroidmethodandWard’smethod,respectively.BothmethodsarediscussedinSection8.3.UpdatingCentroidsIncrementally
Insteadofupdatingclustercentroidsafterallpointshavebeenassignedtoacluster,thecentroidscanbeupdatedincrementally,aftereachassignmentofapointtoacluster.Noticethatthisrequireseitherzeroortwoupdatestoclustercentroidsateachstep,sinceapointeithermovestoanewcluster(twoupdates)orstaysinitscurrentcluster(zeroupdates).Usinganincrementalupdatestrategyguaranteesthatemptyclustersarenotproducedsinceallclustersstartwithasinglepoint,andifaclustereverhasonlyonepoint,thenthatpointwillalwaysbereassignedtothesamecluster.
Inaddition,ifincrementalupdatingisused,therelativeweightofthepointbeingaddedmaybeadjusted;e.g.,theweightofpointsisoftendecreasedastheclusteringproceeds.Whilethiscanresultinbetteraccuracyandfasterconvergence,itcanbedifficulttomakeagoodchoicefortherelativeweight,especiallyinawidevarietyofsituations.Theseupdateissuesaresimilartothoseinvolvedinupdatingweightsforartificialneuralnetworks.
Yetanotherbenefitofincrementalupdateshastodowithusingobjectivesotherthan“minimizeSSE.”Supposethatwearegivenanarbitraryobjectivefunctiontomeasurethegoodnessofasetofclusters.Whenweprocessanindividualpoint,wecancomputethevalueoftheobjectivefunctionforeachpossibleclusterassignment,andthenchoosetheonethatoptimizestheobjec-tive.SpecificexamplesofalternativeobjectivefunctionsaregiveninSection8.5.2.
Onthenegativeside,updatingcentroidsincrementallyintroducesanor-derdependency.Inotherwords,theclustersproducedmaydependontheorderinwhichthepointsareprocessed.Althoughthiscanbeaddressedbyrandomizingtheorderinwhichthepointsareprocessed,thebasicK-meansapproachofupdatingthecentroidsafterallpointshavebeenassignedtoclus-tershasnoorderdependency.Also,incrementalupdatesareslightlymoreexpensive.However,K-meansconvergesratherquickly,andtherefore,thenumberofpointsswitchingclustersquicklybecomesrelativelysmall.
8.2.3BisectingK-means
ThebisectingK-meansalgorithmisastraightforwardextensionofthebasicK-meansalgorithmthatisbasedonasimpleidea:toobtainKclusters,splitthesetofallpointsintotwoclusters,selectoneoftheseclusterstosplit,and
8.2K-means509
soon,untilKclustershavebeenproduced.ThedetailsofbisectingK-meansaregivenbyAlgorithm8.2.
Algorithm8.2BisectingK-meansalgorithm.
1:Initializethelistofclusterstocontaintheclusterconsistingofallpoints.2:repeat3:Removeaclusterfromthelistofclusters.4:{Performseveral“trial”bisectionsofthechosencluster.}5:fori=1tonumberoftrialsdo6:BisecttheselectedclusterusingbasicK-means.7:endfor8:SelectthetwoclustersfromthebisectionwiththelowesttotalSSE.9:Addthesetwoclusterstothelistofclusters.10:untilUntilthelistofclusterscontainsKclusters.
Thereareanumberofdifferentwaystochoosewhichclustertosplit.Wecanchoosethelargestclusterateachstep,choosetheonewiththelargestSSE,oruseacriterionbasedonbothsizeandSSE.Differentchoicesresultindifferentclusters.
WeoftenrefinetheresultingclustersbyusingtheircentroidsastheinitialcentroidsforthebasicK-meansalgorithm.Thisisnecessarybecause,althoughtheK-meansalgorithmisguaranteedtofindaclusteringthatrepresentsalocalminimumwithrespecttotheSSE,inbisectingK-meansweareusingtheK-meansalgorithm“locally,”i.e.,tobisectindividualclusters.Therefore,thefinalsetofclustersdoesnotrepresentaclusteringthatisalocalminimumwithrespecttothetotalSSE.
Example8.3(BisectingK-meansandInitialization).ToillustratethatbisectingK-meansislesssusceptibletoinitializationproblems,weshow,inFigure8.8,howbisectingK-meansfindsfourclustersinthedatasetoriginallyshowninFigure8.6(a).Initeration1,twopairsofclustersarefound;initeration2,therightmostpairofclustersissplit;andiniteration3,theleftmostpairofclustersissplit.BisectingK-meanshaslesstroublewithinitializationbecauseitperformsseveraltrialbisectionsandtakestheonewiththelowestSSE,andbecausethereareonlytwocentroidsateachstep.
Finally,byrecordingthesequenceofclusteringsproducedasK-meansbisectsclusters,wecanalsousebisectingK-meanstoproduceahierarchicalclustering.
510Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Iteration1.(b)Iteration2.(c)Iteration3.
Figure8.8.BisectingK-meansonthefourclustersexample.
8.2.4K-meansandDifferentTypesofClusters
K-meansanditsvariationshaveanumberoflimitationswithrespecttofindingdifferenttypesofclusters.Inparticular,K-meanshasdifficultydetectingthe“natural”clusters,whenclustershavenon-sphericalshapesorwidelydifferentsizesordensities.ThisisillustratedbyFigures8.9,8.10,and8.11.InFigure8.9,K-meanscannotfindthethreenaturalclustersbecauseoneoftheclustersismuchlargerthantheothertwo,andhence,thelargerclusterisbroken,whileoneofthesmallerclustersiscombinedwithaportionofthelargercluster.InFigure8.10,K-meansfailstofindthethreenaturalclustersbecausethetwosmallerclustersaremuchdenserthanthelargercluster.Finally,inFigure8.11,K-meansfindstwoclustersthatmixportionsofthetwonaturalclustersbecausetheshapeofthenaturalclustersisnotglobular.
ThedifficultyinthesethreesituationsisthattheK-meansobjectivefunc-tionisamismatchforthekindsofclusterswearetryingtofindsinceitisminimizedbyglobularclustersofequalsizeanddensityorbyclustersthatarewellseparated.However,theselimitationscanbeovercome,insomesense,iftheuseriswillingtoacceptaclusteringthatbreaksthenaturalclustersintoanumberofsubclusters.Figure8.12showswhathappenstothethreepreviousdatasetsifwefindsixclustersinsteadoftwoorthree.Eachsmallerclusterispureinthesensethatitcontainsonlypointsfromoneofthenaturalclusters.
8.2.5StrengthsandWeaknesses
K-meansissimpleandcanbeusedforawidevarietyofdatatypes.Itisalsoquiteefficient,eventhoughmultiplerunsareoftenperformed.Somevariants,includingbisectingK-means,areevenmoreefficient,andarelesssuscepti-bletoinitializationproblems.K-meansisnotsuitableforalltypesofdata,
8.2K-means511
(a)Originalpoints.(b)ThreeK-meansclusters.
Figure8.9.K-meanswithclustersofdifferentsize.
(a)Originalpoints.(b)ThreeK-meansclusters.
Figure8.10.K-meanswithclustersofdifferentdensity.
(a)Originalpoints.(b)TwoK-meansclusters.
Figure8.11.K-meanswithnon-globularclusters.
512Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Unequalsizes.
(b)Unequaldensities.
(c)Non-sphericalshapes.
Figure8.12.UsingK-meanstofindclustersthataresubclustersofthenaturalclusters.
8.2K-means513
however.Itcannothandlenon-globularclustersorclustersofdifferentsizesanddensities,althoughitcantypicallyfindpuresubclustersifalargeenoughnumberofclustersisspecified.K-meansalsohastroubleclusteringdatathatcontainsoutliers.Outlierdetectionandremovalcanhelpsignificantlyinsuchsituations.Finally,K-meansisrestrictedtodataforwhichthereisanotionofacenter(centroid).Arelatedtechnique,K-medoidclustering,doesnothavethisrestriction,butismoreexpensive.
8.2.6K-meansasanOptimizationProblem
Here,wedelveintothemathematicsbehindK-means.Thissection,whichcanbeskippedwithoutlossofcontinuity,requiresknowledgeofcalculusthroughpartialderivatives.Familiaritywithoptimizationtechniques,especiallythosebasedongradientdescent,mayalsobehelpful.
Asmentionedearlier,givenanobjectivefunctionsuchas“minimizeSSE,”clusteringcanbetreatedasanoptimizationproblem.Onewaytosolvethisproblem—tofindaglobaloptimum—istoenumerateallpossiblewaysofdi-vidingthepointsintoclustersandthenchoosethesetofclustersthatbestsatisfiestheobjectivefunction,e.g.,thatminimizesthetotalSSE.Ofcourse,thisexhaustivestrategyiscomputationallyinfeasibleandasaresult,amorepracticalapproachisneeded,evenifsuchanapproachfindssolutionsthatarenotguaranteedtobeoptimal.Onetechnique,whichisknownasgradientdescent,isbasedonpickinganinitialsolutionandthenrepeatingthefol-lowingtwosteps:computethechangetothesolutionthatbestoptimizestheobjectivefunctionandthenupdatethesolution.
Weassumethatthedataisone-dimensional,i.e.,dist(x,y)=(x−y)2.Thisdoesnotchangeanythingessential,butgreatlysimplifiesthenotation.DerivationofK-meansasanAlgorithmtoMinimizetheSSEInthissection,weshowhowthecentroidfortheK-meansalgorithmcanbemathematicallyderivedwhentheproximityfunctionisEuclideandistanceandtheobjectiveistominimizetheSSE.Specifically,weinvestigatehowwecanbestupdateaclustercentroidsothattheclusterSSEisminimized.Inmathematicalterms,weseektominimizeEquation8.1,whichwerepeathere,specializedforone-dimensionaldata.
SSE=
Ki=1x∈Ci
(ci−x)2
(8.4)
514Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
Here,Ciistheithcluster,xisapointinCi,andciisthemeanoftheithcluster.SeeTable8.1foracompletelistofnotation.
Wecansolveforthekthcentroidck,whichminimizesEquation8.4,bydifferentiatingtheSSE,settingitequalto0,andsolving,asindicatedbelow.
K
∂
(ci−x)2
∂ck
i=1x∈Ci
∂
SSE=∂ck
K∂=(ci−x)2
∂ck
i=1x∈Ci=2∗(ck−xk)=0
x∈Ck
x∈Ck
2∗(ck−xk)=0⇒mkck=
x∈Ck
1
xkxk⇒ck=
mk
x∈Ck
Thus,aspreviouslyindicated,thebestcentroidforminimizingtheSSEofaclusteristhemeanofthepointsinthecluster.DerivationofK-meansforSAE
TodemonstratethattheK-meansalgorithmcanbeappliedtoavarietyofdifferentobjectivefunctions,weconsiderhowtopartitionthedataintoKclusterssuchthatthesumoftheManhattan(L1)distancesofpointsfromthecenteroftheirclustersisminimized.WeareseekingtominimizethesumoftheL1absoluteerrors(SAE)asgivenbythefollowingequation,wheredistL1istheL1distance.Again,fornotationalsimplicity,weuseone-dimensionaldata,i.e.,distL1=|ci−x|.
SAE=
Ki=1x∈Ci
distL1(ci,x)
(8.5)
Wecansolveforthekthcentroidck,whichminimizesEquation8.5,by
differentiatingtheSAE,settingitequalto0,andsolving.
8.3AgglomerativeHierarchicalClustering515
∂
SAE=∂ck
K
∂
|ci−x|
∂ck
i=1x∈Ci
K∂=|ci−x|
∂ck
i=1x∈Ci
=
x∈Ck
∂
|ck−x|=0∂ck
x∈Ck
∂
|ck−x|=0⇒sign(x−ck)=0∂ck
x∈Ck
Ifwesolveforck,wefindthatck=median{x∈Ck},themedianofthepointsinthecluster.Themedianofagroupofpointsisstraightforwardtocomputeandlesssusceptibletodistortionbyoutliers.
8.3AgglomerativeHierarchicalClustering
Hierarchicalclusteringtechniquesareasecondimportantcategoryofcluster-ingmethods.AswithK-means,theseapproachesarerelativelyoldcomparedtomanyclusteringalgorithms,buttheystillenjoywidespreaduse.Therearetwobasicapproachesforgeneratingahierarchicalclustering:
Agglomerative:Startwiththepointsasindividualclustersand,ateach
step,mergetheclosestpairofclusters.Thisrequiresdefininganotionofclusterproximity.Divisive:Startwithone,all-inclusiveclusterand,ateachstep,splitacluster
untilonlysingletonclustersofindividualpointsremain.Inthiscase,weneedtodecidewhichclustertosplitateachstepandhowtodothesplitting.Agglomerativehierarchicalclusteringtechniquesarebyfarthemostcommon,and,inthissection,wewillfocusexclusivelyonthesemethods.AdivisivehierarchicalclusteringtechniqueisdescribedinSection9.4.2.
Ahierarchicalclusteringisoftendisplayedgraphicallyusingatree-likediagramcalledadendrogram,whichdisplaysboththecluster-subcluster
516Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
p1p2p1p2p3p4p3p4(a)Dendrogram.(b)Nestedclusterdiagram.
Figure8.13.Ahierarchicalclusteringoffourpointsshownasadendrogramandasnestedclusters.
relationshipsandtheorderinwhichtheclustersweremerged(agglomerativeview)orsplit(divisiveview).Forsetsoftwo-dimensionalpoints,suchasthosethatwewilluseasexamples,ahierarchicalclusteringcanalsobegraphicallyrepresentedusinganestedclusterdiagram.Figure8.13showsanexampleofthesetwotypesoffiguresforasetoffourtwo-dimensionalpoints.Thesepointswereclusteredusingthesingle-linktechniquethatisdescribedinSection8.3.2.
8.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm
Manyagglomerativehierarchicalclusteringtechniquesarevariationsonasin-gleapproach:startingwithindividualpointsasclusters,successivelymergethetwoclosestclustersuntilonlyoneclusterremains.Thisapproachisex-pressedmoreformallyinAlgorithm8.3.
Algorithm8.3Basicagglomerativehierarchicalclusteringalgorithm.1:Computetheproximitymatrix,ifnecessary.2:repeat3:Mergetheclosesttwoclusters.4:Updatetheproximitymatrixtoreflecttheproximitybetweenthenew
clusterandtheoriginalclusters.5:untilOnlyoneclusterremains.
8.3AgglomerativeHierarchicalClustering517
DefiningProximitybetweenClusters
ThekeyoperationofAlgorithm8.3isthecomputationoftheproximitybe-tweentwoclusters,anditisthedefinitionofclusterproximitythatdiffer-entiatesthevariousagglomerativehierarchicaltechniquesthatwewilldis-cuss.Clusterproximityistypicallydefinedwithaparticulartypeofclusterinmind—seeSection8.1.2.Forexample,manyagglomerativehierarchicalclusteringtechniques,suchasMIN,MAX,andGroupAverage,comefromagraph-basedviewofclusters.MINdefinesclusterproximityastheprox-imitybetweentheclosesttwopointsthatareindifferentclusters,orusinggraphterms,theshortestedgebetweentwonodesindifferentsubsetsofnodes.Thisyieldscontiguity-basedclustersasshowninFigure8.2(c).Alternatively,MAXtakestheproximitybetweenthefarthesttwopointsindifferentclusterstobetheclusterproximity,orusinggraphterms,thelongestedgebetweentwonodesindifferentsubsetsofnodes.(Ifourproximitiesaredistances,thenthenames,MINandMAX,areshortandsuggestive.Forsimilarities,however,wherehighervaluesindicatecloserpoints,thenamesseemreversed.Forthatreason,weusuallyprefertousethealternativenames,singlelinkandcom-pletelink,respectively.)Anothergraph-basedapproach,thegroupaveragetechnique,definesclusterproximitytobetheaveragepairwiseproximities(av-eragelengthofedges)ofallpairsofpointsfromdifferentclusters.Figure8.14illustratesthesethreeapproaches.
(a)MIN(singlelink.)(b)MAX(completelink.)(c)Groupaverage.
Figure8.14.Graph-baseddefinitionsofclusterproximity
If,instead,wetakeaprototype-basedview,inwhicheachclusterisrepre-sentedbyacentroid,differentdefinitionsofclusterproximityaremorenatural.Whenusingcentroids,theclusterproximityiscommonlydefinedastheprox-imitybetweenclustercentroids.Analternativetechnique,Ward’smethod,alsoassumesthataclusterisrepresentedbyitscentroid,butitmeasurestheproximitybetweentwoclustersintermsoftheincreaseintheSSEthatre-
518Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
sultsfrommergingthetwoclusters.LikeK-means,Ward’smethodattemptstominimizethesumofthesquareddistancesofpointsfromtheirclustercentroids.
TimeandSpaceComplexity
Thebasicagglomerativehierarchicalclusteringalgorithmjustpresenteduses
2aproximitymatrix.Thisrequiresthestorageof12mproximities(assuming
theproximitymatrixissymmetric)wheremisthenumberofdatapoints.Thespaceneededtokeeptrackoftheclustersisproportionaltothenumberofclusters,whichism−1,excludingsingletonclusters.Hence,thetotalspacecomplexityisO(m2).
Theanalysisofthebasicagglomerativehierarchicalclusteringalgorithmisalsostraightforwardwithrespecttocomputationalcomplexity.O(m2)timeisrequiredtocomputetheproximitymatrix.Afterthatstep,therearem−1iterationsinvolvingsteps3and4becausetherearemclustersatthestartandtwoclustersaremergedduringeachiteration.Ifperformedasalinearsearchoftheproximitymatrix,thenfortheithiteration,step3requiresO((m−i+1)2)time,whichisproportionaltothecurrentnumberofclusterssquared.Step4onlyrequiresO(m−i+1)timetoupdatetheproximitymatrixafterthemergeroftwoclusters.(AclustermergeraffectsonlyO(m−i+1)proximitiesforthetechniquesthatweconsider.)Withoutmodification,thiswouldyieldatimecomplexityofO(m3).Ifthedistancesfromeachclustertoallotherclustersarestoredasasortedlist(orheap),itispossibletoreducethecostoffindingthetwoclosestclusterstoO(m−i+1).However,becauseoftheadditionalcomplexityofkeepingdatainasortedlistorheap,theoveralltimerequiredforahierarchicalclusteringbasedonAlgorithm8.3isO(m2logm).
Thespaceandtimecomplexityofhierarchicalclusteringseverelylimitsthesizeofdatasetsthatcanbeprocessed.Wediscussscalabilityapproachesforclusteringalgorithms,includinghierarchicalclusteringtechniques,inSection9.5.
8.3.2SpecificTechniques
SampleData
Toillustratethebehaviorofthevarioushierarchicalclusteringalgorithms,weshallusesampledatathatconsistsof6two-dimensionalpoints,whichareshowninFigure8.15.ThexandycoordinatesofthepointsandtheEuclideandistancesbetweenthemareshowninTables8.3and8.4,respectively.
8.3AgglomerativeHierarchicalClustering519
0.60.50.40.30.20.100451236Pointp1p2p3p4p5p60.50.6xCoordinate0.400.220.350.260.080.45yCoordinate0.530.380.320.190.410.300.10.20.30.4Figure8.15.Setof6two-dimensionalpoints.
Table8.3.xycoordinatesof6points.
p1p2p3p4p5p6p10.000.240.220.370.340.23p20.240.000.150.200.140.25p30.220.150.000.150.280.11p40.370.200.150.000.290.22p50.340.140.280.290.000.39p60.230.250.110.220.390.00Table8.4.Euclideandistancematrixfor6points.
SingleLinkorMIN
ForthesinglelinkorMINversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheminimumofthedistance(maximumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenthesesinglelinkscom-binethepointsintoclusters.Thesinglelinktechniqueisgoodathandlingnon-ellipticalshapes,butissensitivetonoiseandoutliers.
Example8.4(SingleLink).Figure8.16showstheresultofapplyingthesinglelinktechniquetoourexampledatasetofsixpoints.Figure8.16(a)showsthenestedclustersasasequenceofnestedellipses,wherethenumbersassociatedwiththeellipsesindicatetheorderoftheclustering.Figure8.16(b)showsthesameinformation,butasadendrogram.Theheightatwhichtwoclustersaremergedinthedendrogramreflectsthedistanceofthetwoclusters.Forinstance,fromTable8.4,weseethatthedistancebetweenpoints3and6
520Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
1522344(a)Singlelinkclustering.
31650.20.150.10.0503621(b)Singlelinkdendrogram.
Figure8.16.SinglelinkclusteringofthesixpointsshowninFigure8.15.
is0.11,andthatistheheightatwhichtheyarejoinedintooneclusterinthedendrogram.Asanotherexample,thedistancebetweenclusters{3,6}and{2,5}isgivenby
dist({3,6},{2,5})=min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
=min(0.15,0.25,0.28,0.39)=0.15.
CompleteLinkorMAXorCLIQUE
ForthecompletelinkorMAXversionofhierarchicalclustering,theproximityoftwoclustersisdefinedasthemaximumofthedistance(minimumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenagroupofpointsisnotaclusteruntilallthepointsinitarecompletelylinked,i.e.,formaclique.Completelinkislesssusceptibletonoiseandoutliers,butitcanbreaklargeclustersanditfavorsglobularshapes.
Example8.5(CompleteLink).Figure8.17showstheresultsofapplyingMAXtothesampledatasetofsixpoints.Aswithsinglelink,points3and6
8.3AgglomerativeHierarchicalClustering521
4252343150.40.3610.20.103125(a)Completelinkclustering.(b)Completelinkdendrogram.
Figure8.17.CompletelinkclusteringofthesixpointsshowninFigure8.15.
aremergedfirst.However,{3,6}ismergedwith{4},insteadof{2,5}or{1}because
dist({3,6},{4})=max(dist(3,4),dist(6,4))
=max(0.15,0.22)=0.22.
dist({3,6},{2,5})=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
=max(0.15,0.25,0.28,0.39)=0.39.
dist({3,6},{1})=max(dist(3,1),dist(6,1))
=max(0.22,0.23)=0.23.
GroupAverage
Forthegroupaverageversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheaveragepairwiseproximityamongallpairsofpointsinthedifferentclusters.Thisisanintermediateapproachbetweenthesingleandcompletelinkapproaches.Thus,forgroupaverage,theclusterproxim-
522Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
50.251252344310.260.150.10.0503251(a)Groupaverageclustering.(b)Groupaveragedendrogram.
Figure8.18.GroupaverageclusteringofthesixpointsshowninFigure8.15.
ityproximity(Ci,Cj)ofclustersCiandCj,whichareofsizemiandmj,respectively,isexpressedbythefollowingequation:
x∈Ciproximity(x,y)y∈Cj
.(8.6)proximity(Ci,Cj)=
mi∗mjExample8.6(GroupAverage).Figure8.18showstheresultsofapplyingthegroupaverageapproachtothesampledatasetofsixpoints.Toillustratehowgroupaverageworks,wecalculatethedistancebetweensomeclusters.
dist({3,6,4},{1})=(0.22+0.37+0.23)/(3∗1)
=0.28
dist({2,5},{1})=(0.2357+0.3421)/(2∗1)
=0.28
dist({3,6,4},{2,5})=(0.15+0.28+0.25+0.39+0.20+0.29)/(6∗2)
=0.26
Becausedist({3,6,4},{2,5})issmallerthandist({3,6,4},{1})anddist({2,5},{1}),clusters{3,6,4}and{2,5}aremergedatthefourthstage.
8.3AgglomerativeHierarchicalClustering523
5252410.250.2314360.150.10.0503125(a)Ward’sclustering.(b)Ward’sdendrogram.
Figure8.19.Ward’sclusteringofthesixpointsshowninFigure8.15.
Ward’sMethodandCentroidMethods
ForWard’smethod,theproximitybetweentwoclustersisdefinedasthein-creaseinthesquarederrorthatresultswhentwoclustersaremerged.Thus,thismethodusesthesameobjectivefunctionasK-meansclustering.WhileitmayseemthatthisfeaturemakesWard’smethodsomewhatdistinctfromotherhierarchicaltechniques,itcanbeshownmathematicallythatWard’smethodisverysimilartothegroupaveragemethodwhentheproximitybe-tweentwopointsistakentobethesquareofthedistancebetweenthem.Example8.7(Ward’sMethod).Figure8.19showstheresultsofapplyingWard’smethodtothesampledatasetofsixpoints.Theclusteringthatisproducedisdifferentfromthoseproducedbysinglelink,completelink,andgroupaverage.
Centroidmethodscalculatetheproximitybetweentwoclustersbycalcu-latingthedistancebetweenthecentroidsofclusters.ThesetechniquesmayseemsimilartoK-means,butaswehaveremarked,Ward’smethodisthecorrecthierarchicalanalog.
Centroidmethodsalsohaveacharacteristic—oftenconsideredbad—thatisnotpossessedbytheotherhierarchicalclusteringtechniquesthatwehavediscussed:thepossibilityofinversions.Specifically,twoclustersthataremergedmaybemoresimilar(lessdistant)thanthepairofclustersthatweremergedinapreviousstep.Fortheothermethods,thedistancebetween
524Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
Table8.5.TableofLance-Williamscoefficientsforcommonhierarchicalclusteringapproaches.ClusteringMethodSingleLinkCompleteLinkGroupAverageCentroidWard’s
αA1/21/2αB1/21/2β000−mAmB(mA+mB)2−mQ
mA+mB+mQ
mAmA+mBmAmA+mBmA+mQmA+mB+mQmBmA+mBmBmA+mBmB+mQmA+mB+mQ
γ−1/21/2000
mergedclustersmonotonicallyincreases(oris,atworst,non-increasing)asweproceedfromsingletonclusterstooneall-inclusivecluster.
8.3.3TheLance-WilliamsFormulaforClusterProximity
Anyoftheclusterproximitiesthatwehavediscussedinthissectioncanbeviewedasachoiceofdifferentparameters(intheLance-WilliamsformulashownbelowinEquation8.7)fortheproximitybetweenclustersQandR,whereRisformedbymergingclustersAandB.Inthisequation,p(.,.)isaproximityfunction,whilemA,mB,andmQarethenumberofpointsinclustersA,B,andQ,respectively.Inotherwords,afterwemergeclustersAandBtoformclusterR,theproximityofthenewcluster,R,toanexistingcluster,Q,isalinearfunctionoftheproximitiesofQwithrespecttotheoriginalclustersAandB.Table8.5showsthevaluesofthesecoefficientsforthetechniquesthatwehavediscussed.
p(R,Q)=αAp(A,Q)+αBp(B,Q)+βp(A,B)+γ|p(A,Q)−p(B,Q)|(8.7)AnyhierarchicalclusteringtechniquethatcanbeexpressedusingtheLance-Williamsformuladoesnotneedtokeeptheoriginaldatapoints.In-stead,theproximitymatrixisupdatedasclusteringoccurs.Whileageneralformulaisappealing,especiallyforimplementation,itiseasiertounderstandthedifferenthierarchicalmethodsbylookingdirectlyatthedefinitionofclus-terproximitythateachmethoduses.
8.3.4KeyIssuesinHierarchicalClustering
LackofaGlobalObjectiveFunction
Wepreviouslymentionedthatagglomerativehierarchicalclusteringcannotbeviewedasgloballyoptimizinganobjectivefunction.Instead,agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,ateach
8.3AgglomerativeHierarchicalClustering525
step,whichclustersshouldbemerged(orsplitfordivisiveapproaches).Thisapproachyieldsclusteringalgorithmsthatavoidthedifficultyofattemptingtosolveahardcombinatorialoptimizationproblem.(Itcanbeshownthatthegeneralclusteringproblemforanobjectivefunctionsuchas“minimizeSSE”iscomputationallyinfeasible.)Furthermore,suchapproachesdonothaveproblemswithlocalminimaordifficultiesinchoosinginitialpoints.Ofcourse,thetimecomplexityofO(m2logm)andthespacecomplexityofO(m2)areprohibitiveinmanycases.
AbilitytoHandleDifferentClusterSizes
Oneaspectofagglomerativehierarchicalclusteringthatwehavenotyetdis-cussedishowtotreattherelativesizesofthepairsofclustersthataremerged.(Thisdiscussionappliesonlytoclusterproximityschemesthatinvolvesums,suchascentroid,Ward’s,andgroupaverage.)Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,whichtakesthenumberofpointsineachclusterintoaccount.Notethattheterminologyofweightedorunweightedreferstothedatapoints,nottheclusters.Inotherwords,treatingclustersofunequalsizeequallygivesdifferentweightstothepointsindifferentclusters,whiletakingtheclustersizeintoaccountgivespointsindifferentclustersthesameweight.
WewillillustratethisusingthegroupaveragetechniquediscussedinSec-tion8.3.2,whichistheunweightedversionofthegroupaveragetechnique.Intheclusteringliterature,thefullnameofthisapproachistheUnweightedPairGroupMethodusingArithmeticaverages(UPGMA).InTable8.5,whichgivestheformulaforupdatingclustersimilarity,thecoefficientsforUPGMA
mAinvolvethesizeofeachoftheclustersthatweremerged:αA=mA
+mB,αB=mBmA+mB,β=0,γ=0.Fortheweightedversionofgroupaverage—knownasWPGMA—thecoefficientsareconstants:αA=1/2,αB=1/2,β=0,γ=0.Ingeneral,unweightedapproachesarepreferredunlessthereisreasontobe-lievethatindividualpointsshouldhavedifferentweights;e.g.,perhapsclassesofobjectshavebeenunevenlysampled.MergingDecisionsAreFinal
Agglomerativehierarchicalclusteringalgorithmstendtomakegoodlocalde-cisionsaboutcombiningtwoclusterssincetheycanuseinformationaboutthepairwisesimilarityofallpoints.However,onceadecisionismadetomergetwoclusters,itcannotbeundoneatalatertime.Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobaloptimizationcriterion.
526Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
Forexample,althoughthe“minimizesquarederror”criterionfromK-meansisusedindecidingwhichclusterstomergeinWard’smethod,theclustersateachleveldonotrepresentlocalminimawithrespecttothetotalSSE.Indeed,theclustersarenotevenstable,inthesensethatapointinoneclustermaybeclosertothecentroidofsomeotherclusterthanitistothecentroidofitscurrentcluster.Nonetheless,Ward’smethodisoftenusedasarobustmethodofinitializingaK-meansclustering,indicatingthatalocal“minimizesquarederror”objectivefunctiondoeshaveaconnectiontoaglobal“minimizesquarederror”objectivefunction.
Therearesometechniquesthatattempttoovercomethelimitationthatmergesarefinal.Oneapproachattemptstofixupthehierarchicalclusteringbymovingbranchesofthetreearoundsoastoimproveaglobalobjectivefunction.AnotherapproachusesapartitionalclusteringtechniquesuchasK-meanstocreatemanysmallclusters,andthenperformshierarchicalclusteringusingthesesmallclustersasthestartingpoint.
8.3.5StrengthsandWeaknesses
Thestrengthsandweaknessofspecificagglomerativehierarchicalclusteringalgorithmswerediscussedabove.Moregenerally,suchalgorithmsaretypi-callyusedbecausetheunderlyingapplication,e.g.,creationofataxonomy,requiresahierarchy.Also,therehavebeensomestudiesthatsuggestthatthesealgorithmscanproducebetter-qualityclusters.However,agglomerativehierarchicalclusteringalgorithmsareexpensiveintermsoftheircomputa-tionalandstoragerequirements.Thefactthatallmergesarefinalcanalsocausetroublefornoisy,high-dimensionaldata,suchasdocumentdata.Inturn,thesetwoproblemscanbeaddressedtosomedegreebyfirstpartiallyclusteringthedatausinganothertechnique,suchasK-means.
8.4DBSCAN
Density-basedclusteringlocatesregionsofhighdensitythatareseparatedfromoneanotherbyregionsoflowdensity.DBSCANisasimpleandeffec-tivedensity-basedclusteringalgorithmthatillustratesanumberofimportantconceptsthatareimportantforanydensity-basedclusteringapproach.Inthissection,wefocussolelyonDBSCANafterfirstconsideringthekeynotionofdensity.Otheralgorithmsforfindingdensity-basedclustersaredescribedinthenextchapter.
8.4DBSCAN527
8.4.1TraditionalDensity:Center-BasedApproach
Althoughtherearenotasmanyapproachesfordefiningdensityastherearefordefiningsimilarity,thereareseveraldistinctmethods.Inthissectionwedis-cussthecenter-basedapproachonwhichDBSCANisbased.OtherdefinitionsofdensitywillbepresentedinChapter9.
Inthecenter-basedapproach,densityisestimatedforaparticularpointinthedatasetbycountingthenumberofpointswithinaspecifiedradius,Eps,ofthatpoint.Thisincludesthepointitself.ThistechniqueisgraphicallyillustratedbyFigure8.20.ThenumberofpointswithinaradiusofEpsofpointAis7,includingAitself.
Thismethodissimpletoimplement,butthedensityofanypointwilldependonthespecifiedradius.Forinstance,iftheradiusislargeenough,thenallpointswillhaveadensityofm,thenumberofpointsinthedataset.Likewise,iftheradiusistoosmall,thenallpointswillhaveadensityof1.Anapproachfordecidingontheappropriateradiusforlow-dimensionaldataisgiveninthenextsectioninthecontextofourdiscussionofDBSCAN.ClassificationofPointsAccordingtoCenter-BasedDensity
Thecenter-basedapproachtodensityallowsustoclassifyapointasbeing(1)intheinteriorofadenseregion(acorepoint),(2)ontheedgeofadenseregion(aborderpoint),or(3)inasparselyoccupiedregion(anoiseorbackgroundpoint).Figure8.21graphicallyillustratestheconceptsofcore,border,andnoisepointsusingacollectionoftwo-dimensionalpoints.Thefollowingtextprovidesamoreprecisedescription.
Corepoints:Thesepointsareintheinteriorofadensity-basedcluster.A
pointisacorepointifthenumberofpointswithinagivenneighborhoodaroundthepointasdeterminedbythedistancefunctionandauser-specifieddistanceparameter,Eps,exceedsacertainthreshold,MinPts,whichisalsoauser-specifiedparameter.InFigure8.21,pointAisacorepoint,fortheindicatedradius(Eps)ifMinPts≤7.Borderpoints:Aborderpointisnotacorepoint,butfallswithintheneigh-borhoodofacorepoint.InFigure8.21,pointBisaborderpoint.Aborderpointcanfallwithintheneighborhoodsofseveralcorepoints.Noisepoints:Anoisepointisanypointthatisneitheracorepointnora
borderpoint.InFigure8.21,pointCisanoisepoint.
528Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
border pointnoise pointcore pointAEpsBCEpsAEpsEpsFigure8.20.density.
Center-based
Figure8.21.Core,border,andnoisepoints.
8.4.2TheDBSCANAlgorithm
Giventhepreviousdefinitionsofcorepoints,borderpoints,andnoisepoints,theDBSCANalgorithmcanbeinformallydescribedasfollows.Anytwocorepointsthatarecloseenough—withinadistanceEpsofoneanother—areputinthesamecluster.Likewise,anyborderpointthatiscloseenoughtoacorepointisputinthesameclusterasthecorepoint.(Tiesmayneedtoberesolvedifaborderpointisclosetocorepointsfromdifferentclusters.)Noisepointsarediscarded.TheformaldetailsaregiveninAlgorithm8.4.ThisalgorithmusesthesameconceptsandfindsthesameclustersastheoriginalDBSCAN,butisoptimizedforsimplicity,notefficiency.Algorithm8.4DBSCANalgorithm.
1:2:3:4:5:
Labelallpointsascore,border,ornoisepoints.Eliminatenoisepoints.
PutanedgebetweenallcorepointsthatarewithinEpsofeachother.Makeeachgroupofconnectedcorepointsintoaseparatecluster.
Assigneachborderpointtooneoftheclustersofitsassociatedcorepoints.
TimeandSpaceComplexity
ThebasictimecomplexityoftheDBSCANalgorithmisO(m×timetofindpointsintheEps-neighborhood),wheremisthenumberofpoints.Intheworstcase,thiscomplexityisO(m2).However,inlow-dimensionalspaces,therearedatastructures,suchaskd-trees,thatallowefficientretrievalofall
8.4DBSCAN529
pointswithinagivendistanceofaspecifiedpoint,andthetimecomplexitycanbeaslowasO(mlogm).ThespacerequirementofDBSCAN,evenforhigh-dimensionaldata,isO(m)becauseitisonlynecessarytokeepasmallamountofdataforeachpoint,i.e.,theclusterlabelandtheidentificationofeachpointasacore,border,ornoisepoint.SelectionofDBSCANParameters
Thereis,ofcourse,theissueofhowtodeterminetheparametersEpsandMinPts.Thebasicapproachistolookatthebehaviorofthedistancefromapointtoitskthnearestneighbor,whichwewillcallthek-dist.Forpointsthatbelongtosomecluster,thevalueofk-distwillbesmallifkisnotlargerthantheclustersize.Notethattherewillbesomevariation,dependingonthedensityoftheclusterandtherandomdistributionofpoints,butonaverage,therangeofvariationwillnotbehugeiftheclusterdensitiesarenotradicallydifferent.However,forpointsthatarenotinacluster,suchasnoisepoints,thek-distwillberelativelylarge.Therefore,ifwecomputethek-distforallthedatapointsforsomek,sorttheminincreasingorder,andthenplotthesortedvalues,weexpecttoseeasharpchangeatthevalueofk-distthatcorrespondstoasuitablevalueofEps.IfweselectthisdistanceastheEpsparameterandtakethevalueofkastheMinPtsparameter,thenpointsforwhichk-distislessthanEpswillbelabeledascorepoints,whileotherpointswillbelabeledasnoiseorborderpoints.
Figure8.22showsasampledataset,whilethek-distgraphforthedataisgiveninFigure8.23.ThevalueofEpsthatisdeterminedinthiswaydependsonk,butdoesnotchangedramaticallyaskchanges.Ifthevalueofkistoosmall,thenevenasmallnumberofcloselyspacedpointsthatarenoiseoroutlierswillbeincorrectlylabeledasclusters.Ifthevalueofkistoolarge,thensmallclusters(ofsizelessthank)arelikelytobelabeledasnoise.TheoriginalDBSCANalgorithmusedavalueofk=4,whichappearstobeareasonablevalueformosttwo-dimensionaldatasets.ClustersofVaryingDensity
DBSCANcanhavetroublewithdensityifthedensityofclustersvarieswidely.ConsiderFigure8.24,whichshowsfourclustersembeddedinnoise.Theden-sityoftheclustersandnoiseregionsisindicatedbytheirdarkness.Thenoisearoundthepairofdenserclusters,AandB,hasthesamedensityasclustersCandD.IftheEpsthresholdislowenoughthatDBSCANfindsCandDasclusters,thenAandBandthepointssurroundingthemwillbecomeasingle
530Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
50
4th Nearest Neighbor Distance40
30
20
10
0
050010001500200025003000
Points Sorted by Distance to 4th Nearest Neighbor
Figure8.22.Sampledata.Figure8.23.K-distplotforsampledata.
Cluster ACluster BCluster CCluster DNoiseNoiseFigure8.24.Fourclustersembeddedinnoise.
cluster.IftheEpsthresholdishighenoughthatDBSCANfindsAandBasseparateclusters,andthepointssurroundingthemaremarkedasnoise,thenCandDandthepointssurroundingthemwillalsobemarkedasnoise.AnExample
ToillustratetheuseofDBSCAN,weshowtheclustersthatitfindsintherelativelycomplicatedtwo-dimensionaldatasetshowninFigure8.22.Thisdatasetconsistsof3000two-dimensionalpoints.TheEpsthresholdforthisdatawasfoundbyplottingthesorteddistancesofthefourthnearestneighborofeachpoint(Figure8.23)andidentifyingthevalueatwhichthereisasharpincrease.WeselectedEps=10,whichcorrespondstothekneeofthecurve.TheclustersfoundbyDBSCANusingtheseparameters,i.e.,MinPts=4andEps=10,areshowninFigure8.25(a).Thecorepoints,borderpoints,andnoisepointsaredisplayedinFigure8.25(b).
8.4.3StrengthsandWeaknesses
BecauseDBSCANusesadensity-baseddefinitionofacluster,itisrelativelyresistanttonoiseandcanhandleclustersofarbitraryshapesandsizes.Thus,
8.4DBSCAN531
(a)ClustersfoundbyDBSCAN.
x – Noise Point + – Border Point – Core Point(b)Core,border,andnoisepoints.
Figure8.25.DBSCANclusteringof3000two-dimensionalpoints.
532Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
DBSCANcanfindmanyclustersthatcouldnotbefoundusingK-means,suchasthoseinFigure8.22.Asindicatedpreviously,however,DBSCANhastroublewhentheclustershavewidelyvaryingdensities.Italsohastroublewithhigh-dimensionaldatabecausedensityismoredifficulttodefineforsuchdata.OnepossibleapproachtodealingwithsuchissuesisgiveninSection9.4.8.Finally,DBSCANcanbeexpensivewhenthecomputationofnearestneighborsrequirescomputingallpairwiseproximities,asisusuallythecaseforhigh-dimensionaldata.
8.5ClusterEvaluation
Insupervisedclassification,theevaluationoftheresultingclassificationmodelisanintegralpartoftheprocessofdevelopingaclassificationmodel,andtherearewell-acceptedevaluationmeasuresandprocedures,e.g.,accuracyandcross-validation,respectively.However,becauseofitsverynature,clusterevaluationisnotawell-developedorcommonlyusedpartofclusteranalysis.Nonetheless,clusterevaluation,orclustervalidationasitismoretradition-allycalled,isimportant,andthissectionwillreviewsomeofthemostcommonandeasilyappliedapproaches.
Theremightbesomeconfusionastowhyclusterevaluationisnecessary.Manytimes,clusteranalysisisconductedasapartofanexploratorydataanalysis.Hence,evaluationseemslikeanunnecessarilycomplicatedadditiontowhatissupposedtobeaninformalprocess.Furthermore,sincethereareanumberofdifferenttypesofclusters—insomesense,eachclusteringalgorithmdefinesitsowntypeofcluster—itmayseemthateachsituationmightrequireadifferentevaluationmeasure.Forinstance,K-meansclustersmightbeevaluatedintermsoftheSSE,butfordensity-basedclusters,whichneednotbeglobular,SSEwouldnotworkwellatall.
Nonetheless,clusterevaluationshouldbeapartofanyclusteranalysis.Akeymotivationisthatalmosteveryclusteringalgorithmwillfindclustersinadataset,evenifthatdatasethasnonaturalclusterstructure.Forinstance,considerFigure8.26,whichshowstheresultofclustering100pointsthatarerandomly(uniformly)distributedontheunitsquare.TheoriginalpointsareshowninFigure8.26(a),whiletheclustersfoundbyDBSCAN,K-means,andcompletelinkareshowninFigures8.26(b),8.26(c),and8.26(d),respectively.SinceDBSCANfoundthreeclusters(afterwesetEpsbylookingatthedistancesofthefourthnearestneighbors),wesetK-meansandcompletelinktofindthreeclustersaswell.(InFigure8.26(b)thenoiseisshownbythesmallmarkers.)However,theclustersdonotlookcompellingforanyof
8.5ClusterEvaluation533
thethreemethods.Inhigherdimensions,suchproblemscannotbesoeasilydetected.
8.5.1Overview
Beingabletodistinguishwhetherthereisnon-randomstructureinthedataisjustoneimportantaspectofclustervalidation.Thefollowingisalistofseveralimportantissuesforclustervalidation.
1.Determiningtheclusteringtendencyofasetofdata,i.e.,distinguish-ingwhethernon-randomstructureactuallyexistsinthedata.2.Determiningthecorrectnumberofclusters.
3.Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithoutreferencetoexternalinformation.4.Comparingtheresultsofaclusteranalysistoexternallyknownresults,suchasexternallyprovidedclasslabels.5.Comparingtwosetsofclusterstodeterminewhichisbetter.
Noticethatitems1,2,and3donotmakeuseofanyexternalinformation—theyareunsupervisedtechniques—whileitem4requiresexternalinformation.Item5canbeperformedineitherasupervisedoranunsupervisedmanner.Afurtherdistinctioncanbemadewithrespecttoitems3,4,and5:Dowewanttoevaluatetheentireclusteringorjustindividualclusters?
Whileitispossibletodevelopvariousnumericalmeasurestoassessthedifferentaspectsofclustervaliditymentionedabove,thereareanumberofchallenges.First,ameasureofclustervaliditymaybequitelimitedinthescopeofitsapplicability.Forexample,mostworkonmeasuresofclusteringtendencyhasbeendonefortwo-orthree-dimensionalspatialdata.Second,weneedaframeworktointerpretanymeasure.Ifweobtainavalueof10forameasurethatevaluateshowwellclusterlabelsmatchexternallyprovidedclasslabels,doesthisvaluerepresentagood,fair,orpoormatch?Thegoodnessofamatchoftencanbemeasuredbylookingatthestatisticaldistributionofthisvalue,i.e.,howlikelyitisthatsuchavalueoccursbychance.Finally,ifameasureistoocomplicatedtoapplyortounderstand,thenfewwilluseit.
Theevaluationmeasures,orindices,thatareappliedtojudgevariousaspectsofclustervalidityaretraditionallyclassifiedintothefollowingthreetypes.
534Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
(a)Originalpoints.(b)ThreeclustersfoundbyDBSCAN.
(c)ThreeclustersfoundbyK-means.
(d)Threeclustersfoundbycompletelink.
Figure8.26.Clusteringof100uniformlydistributedpoints.
8.5ClusterEvaluation535
Unsupervised.Measuresthegoodnessofaclusteringstructurewithoutre-specttoexternalinformation.AnexampleofthisistheSSE.Unsu-pervisedmeasuresofclustervalidityareoftenfurtherdividedintotwoclasses:measuresofclustercohesion(compactness,tightness),whichdeterminehowcloselyrelatedtheobjectsinaclusterare,andmeasuresofclusterseparation(isolation),whichdeterminehowdistinctorwell-separatedaclusterisfromotherclusters.Unsupervisedmeasuresareoftencalledinternalindicesbecausetheyuseonlyinformationpresentinthedataset.Supervised.Measurestheextenttowhichtheclusteringstructurediscovered
byaclusteringalgorithmmatchessomeexternalstructure.Anexampleofasupervisedindexisentropy,whichmeasureshowwellclusterlabelsmatchexternallysuppliedclasslabels.Supervisedmeasuresareoftencalledexternalindicesbecausetheyuseinformationnotpresentinthedataset.Relative.Comparesdifferentclusteringsorclusters.Arelativeclustereval-uationmeasureisasupervisedorunsupervisedevaluationmeasurethatisusedforthepurposeofcomparison.Thus,relativemeasuresarenotactuallyaseparatetypeofclusterevaluationmeasure,butareinsteadaspecificuseofsuchmeasures.Asanexample,twoK-meansclusteringscanbecomparedusingeithertheSSEorentropy.Intheremainderofthissection,weprovidespecificdetailsconcerningclus-tervalidity.Wefirstdescribetopicsrelatedtounsupervisedclusterevaluation,beginningwith(1)measuresbasedoncohesionandseparation,and(2)twotechniquesbasedontheproximitymatrix.Sincetheseapproachesareusefulonlyforpartitionalsetsofclusters,wealsodescribethepopularcopheneticcorrelationcoefficient,whichcanbeusedfortheunsupervisedevaluationofahierarchicalclustering.Weendourdiscussionofunsupervisedevaluationwithbriefdiscussionsaboutfindingthecorrectnumberofclustersandevalu-atingclusteringtendency.Wethenconsidersupervisedapproachestoclustervalidity,suchasentropy,purity,andtheJaccardmeasure.Weconcludethissectionwithashortdiscussionofhowtointerpretthevaluesof(unsupervisedorsupervised)validitymeasures.
536Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
8.5.2
UnsupervisedClusterEvaluationUsingCohesionandSeparation
Manyinternalmeasuresofclustervalidityforpartitionalclusteringschemesarebasedonthenotionsofcohesionorseparation.Inthissection,weuseclustervaliditymeasuresforprototype-andgraph-basedclusteringtechniquestoexplorethesenotionsinsomedetail.Intheprocess,wewillalsoseesomeinterestingrelationshipsbetweenprototype-andgraph-basedclustering.
Ingeneral,wecanconsiderexpressingoverallclustervalidityforasetofKclustersasaweightedsumofthevalidityofindividualclusters,
overallvalidity=
Ki=1
wivalidity(Ci).
(8.8)
Thevalidityfunctioncanbecohesion,separation,orsomecombinationofthese
quantities.Theweightswillvarydependingontheclustervaliditymeasure.Insomecases,theweightsaresimply1orthesizeofthecluster,whileinothercasestheyreflectamorecomplicatedproperty,suchasthesquarerootofthecohesion.SeeTable8.6.Ifthevalidityfunctioniscohesion,thenhighervaluesarebetter.Ifitisseparation,thenlowervaluesarebetter.Graph-BasedViewofCohesionandSeparation
Forgraph-basedclusters,thecohesionofaclustercanbedefinedasthesumoftheweightsofthelinksintheproximitygraphthatconnectpointswithinthecluster.SeeFigure8.27(a).(Recallthattheproximitygraphhasdataobjectsasnodes,alinkbetweeneachpairofdataobjects,andaweightassignedtoeachlinkthatistheproximitybetweenthetwodataobjectsconnectedbythelink.)Likewise,theseparationbetweentwoclusterscanbemeasuredbythesumoftheweightsofthelinksfrompointsinoneclustertopointsintheothercluster.ThisisillustratedinFigure8.27(b).
Mathematically,cohesionandseparationforagraph-basedclustercanbeexpressedusingEquations8.9and8.10,respectively.Theproximityfunctioncanbeasimilarity,adissimilarity,orasimplefunctionofthesequantities.
cohesion(Ci)=separation(Ci,Cj)=
x∈Ciy∈Ci
proximity(x,y)proximity(x,y)
(8.9)(8.10)
x∈Ciy∈Cj
8.5ClusterEvaluation537
(a)Cohesion.(b)Separation.
Figure8.27.Graph-basedviewofclustercohesionandseparation.
Prototype-BasedViewofCohesionandSeparation
Forprototype-basedclusters,thecohesionofaclustercanbedefinedasthesumoftheproximitieswithrespecttotheprototype(centroidormedoid)ofthecluster.Similarly,theseparationbetweentwoclusterscanbemeasuredbytheproximityofthetwoclusterprototypes.ThisisillustratedinFigure8.28,wherethecentroidofaclusterisindicatedbya“+”.
Cohesionforaprototype-basedclusterisgiveninEquation8.11,whiletwomeasuresforseparationaregiveninEquations8.12and8.13,respec-tively,whereciistheprototype(centroid)ofclusterCiandcistheoverallprototype(centroid).Therearetwomeasuresforseparationbecause,aswewillseeshortly,theseparationofclusterprototypesfromanoverallprototypeissometimesdirectlyrelatedtotheseparationofclusterprototypesfromoneanother.NotethatEquation8.11istheclusterSSEifweletproximitybethesquaredEuclideandistance.
x∈Ci
cohesion(Ci)=proximity(x,ci)
(8.11)(8.12)(8.13)
separation(Ci,Cj)=proximity(ci,cj)
separation(Ci)=proximity(ci,c)
OverallMeasuresofCohesionandSeparation
Thepreviousdefinitionsofclustercohesionandseparationgaveussomesim-pleandwell-definedmeasuresofclustervaliditythatcanbecombinedintoanoverallmeasureofclustervaliditybyusingaweightedsum,asindicated
538Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
+++(a)Cohesion.(b)Separation.
Figure8.28.Prototype-basedviewofclustercohesionandseparation.
inEquation8.8.However,weneedtodecidewhatweightstouse.Notsur-prisingly,theweightsusedcanvarywidely,althoughtypicallytheyaresomemeasureofclustersize.
Table8.6providesexamplesofvaliditymeasuresbasedoncohesionandseparation.I1isameasureofcohesionintermsofthepairwiseproximityofobjectsintheclusterdividedbytheclustersize.I2isameasureofcohesionbasedonthesumoftheproximitiesofobjectsintheclustertotheclustercentroid.E1isameasureofseparationdefinedastheproximityofaclustercentroidtotheoverallcentroidmultipliedbythenumberofobjectsinthecluster.G1,whichisameasurebasedonbothcohesionandseparation,isthesumofthepairwiseproximityofallobjectsintheclusterwithallobjectsoutsidethecluster—thetotalweightoftheedgesoftheproximitygraphthatmustbecuttoseparatetheclusterfromallotherclusters—dividedbythesumofthepairwiseproximityofobjectsinthecluster.
Table8.6.Tableofgraph-basedclusterevaluationmeasures.
NameClusterMeasureI1I2E1G1
x∈Ci
x∈Ciy∈CiClusterWeight1mi
proximity(x,y)proximity(x,ci)
Typegraph-basedcohesion
1mi
x∈Ciy∈Ci
proximity(ci,c)k
j=1j=i
x∈Ciy∈Cj
proximity(x,y)prototype-basedcohesion
prototype-basedseparationgraph-based1
separationand
proximity(x,y)
cohesion
8.5ClusterEvaluation539
Notethatanyunsupervisedmeasureofclustervaliditypotentiallycanbeusedasanobjectivefunctionforaclusteringalgorithmandviceversa.TheCLUsteringTOolkit(CLUTO)(seethebibliographicnotes)usestheclusterevaluationmeasuresdescribedinTable8.6,aswellassomeotherevaluationmeasuresnotmentionedhere,todrivetheclusteringprocess.ItdoesthisbyusinganalgorithmthatissimilartotheincrementalK-meansalgorithmdis-cussedinSection8.2.2.Specifically,eachpointisassignedtotheclusterthatproducesthebestvaluefortheclusterevaluationfunction.Theclustereval-uationmeasureI2correspondstotraditionalK-meansandproducesclustersthathavegoodSSEvalues.TheothermeasuresproduceclustersthatarenotasgoodwithrespecttoSSE,butthataremoreoptimalwithrespecttothespecifiedclustervaliditymeasure.
RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesion
Whilethegraph-basedandprototype-basedapproachestomeasuringtheco-hesionandseparationofaclusterseemdistinct,forsomeproximitymeasurestheyareequivalent.Forinstance,fortheSSEandpointsinEuclideanspace,itcanbeshown(Equation8.14)thattheaveragepairwisedistancebetweenthepointsinaclusterisequivalenttotheSSEofthecluster.SeeExercise27onpage566.
ClusterSSE=
x∈Ci
1
dist(ci,x)=dist(x,y)2
2mi
2
x∈Ciy∈Ci
(8.14)
TwoApproachestoPrototype-BasedSeparation
WhenproximityismeasuredbyEuclideandistance,thetraditionalmeasureofseparationbetweenclustersisthebetweengroupsumofsquares(SSB),whichisthesumofthesquareddistanceofaclustercentroid,ci,totheoverallmean,c,ofallthedatapoints.BysummingtheSSBoverallclusters,weobtainthetotalSSB,whichisgivenbyEquation8.15,whereciisthemeanoftheithclusterandcistheoverallmean.ThehigherthetotalSSBofaclustering,themoreseparatedtheclustersarefromoneanother.
TotalSSB=
Ki=1
midist(ci,c)2
(8.15)
ItisstraightforwardtoshowthatthetotalSSBisdirectlyrelatedtothe
pairwisedistancesbetweenthecentroids.Inparticular,iftheclustersizesare
0Chapter8ClusterAnalysis:BasicConceptsandAlgorithms
equal,i.e.,mi=m/K,thenthisrelationshiptakesthesimpleformgivenbyEquation8.16.(SeeExercise28onpage566.)ItisthistypeofequivalencethatmotivatesthedefinitionofprototypeseparationintermsofbothEquations8.12and8.13.
KK
1m
TotalSSB=dist(ci,cj)2
2Ki=1j=1K
(8.16)
RelationshipbetweenCohesionandSeparation
Insomecases,thereisalsoastrongrelationshipbetweencohesionandsepara-tion.Specifically,itispossibletoshowthatthesumofthetotalSSEandthetotalSSBisaconstant;i.e.,thatitisequaltothetotalsumofsquares(TSS),whichisthesumofsquaresofthedistanceofeachpointtotheoverallmeanofthedata.TheimportanceofthisresultisthatminimizingSSE(cohesion)isequivalenttomaximizingSSB(separation).
Weprovidetheproofofthisfactbelow,sincetheapproachillustratestechniquesthatarealsoapplicabletoprovingtherelationshipsstatedinthelasttwosections.Tosimplifythenotation,weassumethatthedataisone-dimensional,i.e.,dist(x,y)=(x−y)2.Also,weusethefactthatthecross-termKi=1x∈Ci(x−ci)(c−ci)is0.(SeeExercise29onpage566.)
TSS
=
Ki=1x∈Ci
(x−c)2
((x−ci)−(c−ci))2(x−ci)−2(x−ci)+(x−ci)2+
22
Ki=1x∈CiKi=1x∈CiKi=1
=
Ki=1x∈Ci
=
Ki=1x∈Ci
(x−ci)(c−ci)+
Ki=1x∈Ci
(c−ci)2
=
Ki=1x∈Ci
(c−ci)2
==
Ki=1x∈Ci
|Ci|(c−ci)2
SSE+SSB
8.5
EvaluatingIndividualClustersandObjects
ClusterEvaluation1
Sofar,wehavefocusedonusingcohesionandseparationintheoveralleval-uationofagroupofclusters.Manyofthesemeasuresofclustervalidityalsocanbeusedtoevaluateindividualclustersandobjects.Forexample,wecanrankindividualclustersaccordingtotheirspecificvalueofclustervalidity,i.e.,clustercohesionorseparation.Aclusterthathasahighvalueofcohesionmaybeconsideredbetterthanaclusterthathasalowervalue.Thisinformationoftencanbeusedtoimprovethequalityofaclustering.If,forexample,aclusterisnotverycohesive,thenwemaywanttosplititintoseveralsubclus-ters.Ontheotherhand,iftwoclustersarerelativelycohesive,butnotwellseparated,wemaywanttomergethemintoasinglecluster.
Wecanalsoevaluatetheobjectswithinaclusterintermsoftheircon-tributiontotheoverallcohesionorseparationofthecluster.Objectsthatcontributemoretothecohesionandseparationarenearthe“interior”ofthecluster.Thoseobjectsforwhichtheoppositeistrueareprobablynearthe“edge”ofthecluster.Inthefollowingsection,weconsideraclusterevalua-tionmeasurethatusesanapproachbasedontheseideastoevaluatepoints,clusters,andtheentiresetofclusters.TheSilhouetteCoefficient
Thepopularmethodofsilhouettecoefficientscombinesbothcohesionandsep-aration.Thefollowingstepsexplainhowtocomputethesilhouettecoefficientforanindividualpoint,aprocessthatconsistsofthefollowingthreesteps.Weusedistances,butananalogousapproachcanbeusedforsimilarities.1.Fortheithobject,calculateitsaveragedistancetoallotherobjectsinitscluster.Callthisvalueai.2.Fortheithobjectandanyclusternotcontainingtheobject,calculatetheobject’saveragedistancetoalltheobjectsinthegivencluster.Findtheminimumsuchvaluewithrespecttoallclusters;callthisvaluebi.3.Fortheithobject,thesilhouettecoefficientissi=(bi−ai)/max(ai,bi).Thevalueofthesilhouettecoefficientcanvarybetween−1and1.Anegativevalueisundesirablebecausethiscorrespondstoacaseinwhichai,theaveragedistancetopointsinthecluster,isgreaterthanbi,theminimumaveragedistancetopointsinanothercluster.Wewantthesilhouettecoefficienttobepositive(ai 00.10.20.30.40.50.60.70.80.91 Silhouette Coefficient Figure8.29.Silhouettecoefficientsforpointsintenclusters. Wecancomputetheaveragesilhouettecoefficientofaclusterbysimplytakingtheaverageofthesilhouettecoefficientsofpointsbelongingtothecluster.Anoverallmeasureofthegoodnessofaclusteringcanbeobtainedbycomputingtheaveragesilhouettecoefficientofallpoints. Example8.8(SilhouetteCoefficient).Figure8.29showsaplotofthesilhouettecoefficientsforpointsin10clusters.Darkershadesindicatelowersilhouettecoefficients. 8.5.3 UnsupervisedClusterEvaluationUsingtheProximityMatrix Inthissection,weexamineacoupleofunsupervisedapproachesforassessingclustervaliditythatarebasedontheproximitymatrix.Thefirstcomparesanactualandidealizedproximitymatrix,whilethesecondusesvisualization.MeasuringClusterValidityviaCorrelation Ifwearegiventhesimilaritymatrixforadatasetandtheclusterlabelsfromaclusteranalysisofthedataset,thenwecanevaluatethe“goodness”oftheclusteringbylookingatthecorrelationbetweenthesimilaritymatrixandanidealversionofthesimilaritymatrixbasedontheclusterlabels.(Withminorchanges,thefollowingappliestoproximitymatrices,butforsimplicity,wediscussonlysimilaritymatrices.)Morespecifically,anidealclusterisonewhosepointshaveasimilarityof1toallpointsinthecluster,andasimilarityof0toallpointsinotherclusters.Thus,ifwesorttherowsandcolumnsofthesimilaritymatrixsothatallobjectsbelongingtothesameclassaretogether,thenanidealsimilaritymatrixhasablockdiagonalstructure.Inotherwords,thesimilarityisnon-zero,i.e.,1,insidetheblocksofthesimilarity 8.5ClusterEvaluation3 matrixwhoseentriesrepresentintra-clustersimilarity,and0elsewhere.Theidealsimilaritymatrixisconstructedbycreatingamatrixthathasonerowandonecolumnforeachdatapoint—justlikeanactualsimilaritymatrix—andassigninga1toanentryiftheassociatedpairofpointsbelongstothesamecluster.Allotherentriesare0. Highcorrelationbetweentheidealandactualsimilaritymatricesindicatesthatthepointsthatbelongtothesameclusterareclosetoeachother,whilelowcorrelationindicatestheopposite.(Sincetheactualandidealsimilaritymatricesaresymmetric,thecorrelationiscalculatedonlyamongthen(n−1)/2entriesbeloworabovethediagonalofthematrices.)Consequently,thisisnotagoodmeasureformanydensity-orcontiguity-basedclusters,becausetheyarenotglobularandmaybecloselyintertwinedwithotherclusters. Example8.9(CorrelationofActualandIdealSimilarityMatrices).Toillustratethismeasure,wecalculatedthecorrelationbetweentheidealandactualsimilaritymatricesfortheK-meansclustersshowninFigure8.26(c)(randomdata)andFigure8.30(a)(datawiththreewell-separatedclusters).Thecorrelationswere0.5810and0.9235,respectively,whichreflectstheex-pectedresultthattheclustersfoundbyK-meansintherandomdataareworsethantheclustersfoundbyK-meansindatawithwell-separatedclusters.JudgingaClusteringVisuallybyItsSimilarityMatrix Theprevioustechniquesuggestsamoregeneral,qualitativeapproachtojudg-ingasetofclusters:Orderthesimilaritymatrixwithrespecttoclusterlabelsandthenplotit.Intheory,ifwehavewell-separatedclusters,thenthesimi-laritymatrixshouldberoughlyblock-diagonal.Ifnot,thenthepatternsdis-playedinthesimilaritymatrixcanrevealtherelationshipsbetweenclusters.Again,allofthiscanbeappliedtodissimilaritymatrices,butforsimplicity,wewillonlydiscusssimilaritymatrices. Example8.10(VisualizingaSimilarityMatrix).ConsiderthepointsinFigure8.30(a),whichformthreewell-separatedclusters.IfweuseK-meanstogroupthesepointsintothreeclusters,thenweshouldhavenotroublefindingtheseclusterssincetheyarewell-separated.TheseparationoftheseclustersisillustratedbythereorderedsimilaritymatrixshowninFigure8.30(b).(Foruniformity,wehavetransformedthedistancesintosimilaritiesusingthefor-mulas=1−(d−mind)/(maxd−mind).)Figure8.31showsthereorderedsimilaritymatricesforclustersfoundintherandomdatasetofFigure8.26byDBSCAN,K-means,andcompletelink. 4Chapter8 1ClusterAnalysis:BasicConceptsandAlgorithms 1 10 0.90.80.70.60.50.40.30.20.1 20 40 Points 60 80 100 0 Similarity 0.82030 0.6Points40506070 y0.40.28090 000.20.4x 0.60.81100 (a)Well-separatedclusters. (b)SimilaritymatrixsortedbyK-meansclusterlabels. Figure8.30.Similaritymatrixforwell-separatedclusters. Thewell-separatedclustersinFigure8.30showaverystrong,block-diagonalpatterninthereorderedsimilaritymatrix.However,therearealsoweakblockdiagonalpatterns—seeFigure8.31—inthereorderedsimilaritymatricesoftheclusteringsfoundbyK-means,DBSCAN,andcompletelinkintherandomdata.Justaspeoplecanfindpatternsinclouds,dataminingalgorithmscanfindclustersinrandomdata.Whileitisentertainingtofindpatternsinclouds,itispointlessandperhapsembarrassingtofindclustersinnoise. Thisapproachmayseemhopelesslyexpensiveforlargedatasets,sincethecomputationoftheproximitymatrixtakesO(m2)time,wheremisthenumberofobjects,butwithsampling,thismethodcanstillbeused.Wecantakeasampleofdatapointsfromeachcluster,computethesimilaritybetweenthesepoints,andplottheresult.Itmaybenecessarytooversamplesmallclustersandundersamplelargeonestoobtainanadequaterepresentationofallclusters. 8.5.4UnsupervisedEvaluationofHierarchicalClustering Thepreviousapproachestoclusterevaluationareintendedforpartitionalclusterings.Herewediscussthecopheneticcorrelation,apopularevaluationmeasureforhierarchicalclusterings.Thecopheneticdistancebetweentwoobjectsistheproximityatwhichanagglomerativehierarchicalclusteringtech- 8.5 1 10203040Points5060708090100 2040Points 60801000.90.80.70.6 Points0.50.40.30.20.1 0 Similarity 102030405060708090100 2040Points 608010010.90.80.70.6 ClusterEvaluation5 1 10203040Points5060708090100 2040Points 60801000.90.80.70.60.50.40.30.20.1 0 Similarity 0.50.40.30.20.1 0 Similarity (a)SimilaritymatrixsortedbyDBSCANclusterlabels.(b)SimilaritymatrixsortedbyK-meansclusterlabels.(c)Similaritymatrixsortedbycompletelinkclusterlabels. Figure8.31.Similaritymatricesforclustersfromrandomdata. niqueputstheobjectsinthesameclusterforthefirsttime.Forexample,ifatsomepointintheagglomerativehierarchicalclusteringprocess,thesmallestdistancebetweenthetwoclustersthataremergedis0.1,thenallpointsinoneclusterhaveacopheneticdistanceof0.1withrespecttothepointsintheothercluster.Inacopheneticdistancematrix,theentriesarethecopheneticdistancesbetweeneachpairofobjects.Thecopheneticdistanceisdifferentforeachhierarchicalclusteringofasetofpoints. Example8.11(CopheneticDistanceMatrix).Table8.7showsthecophen-ticdistancematrixforthesinglelinkclusteringshowninFigure8.16.(Thedataforthisfigureconsistsofthe6two-dimensionalpointsgiveninTable8.3.) Table8.7.Copheneticdistancematrixforsinglelinkanddataintable8.3 PointP1P2P3P4P5P6P100.2220.2220.2220.2220.222P20.22200.1480.1510.1390.148P30.2220.14800.1510.1480.110P40.2220.1510.15100.1510.151P50.2220.1390.1480.15100.148P60.2220.1480.1100.1510.1480TheCoPheneticCorrelationCoefficient(CPCC)isthecorrelationbetweentheentriesofthismatrixandtheoriginaldissimilaritymatrixandis 6Chapter8ClusterAnalysis:BasicConceptsandAlgorithms astandardmeasureofhowwellahierarchicalclustering(ofaparticulartype)fitsthedata.Oneofthemostcommonusesofthismeasureistoevaluatewhichtypeofhierarchicalclusteringisbestforaparticulartypeofdata.Example8.12(CopheneticCorrelationCoefficient).WecalculatedtheCPCCforthehierarchicalclusteringsshowninFigures8.16–8.19.ThesevaluesareshowninTable8.8.Thehierarchicalclusteringproducedbythesinglelinktechniqueseemstofitthedatalesswellthantheclusteringsproducedbycompletelink,groupaverage,andWard’smethod. Table8.8.CopheneticcorrelationcoefficientfordataofTable8.3andfouragglomerativehierarchicalclusteringtechniques. TechniqueSingleLinkCompleteLinkGroupAverage Ward’s CPCC0.440.630.660. 8.5.5DeterminingtheCorrectNumberofClusters Variousunsupervisedclusterevaluationmeasurescanbeusedtoapproxi-matelydeterminethecorrectornaturalnumberofclusters. Example8.13(NumberofClusters).ThedatasetofFigure8.29has10naturalclusters.Figure8.32showsaplotoftheSSEversusthenumberofclustersfora(bisecting)K-meansclusteringofthedataset,whileFigure8.33showstheaveragesilhouettecoefficientversusthenumberofclustersforthesamedata.ThereisadistinctkneeintheSSEandadistinctpeakinthesilhouettecoefficientwhenthenumberofclustersisequalto10. Thus,wecantrytofindthenaturalnumberofclustersinadatasetbylookingforthenumberofclustersatwhichthereisaknee,peak,ordipintheplotoftheevaluationmeasurewhenitisplottedagainstthenumberofclusters.Ofcourse,suchanapproachdoesnotalwaysworkwell.ClustersmaybeconsiderablymoreintertwinedoroverlappingthanthoseshowninFigure8.29.Also,thedatamayconsistofnestedclusters.Actually,theclustersinFigure8.29aresomewhatnested;i.e.,thereare5pairsofclusterssincetheclustersareclosertoptobottomthantheyarelefttoright.ThereisakneethatindicatesthisintheSSEcurve,butthesilhouettecoefficientcurveisnot 8.5 10 0.750.7 ClusterEvaluation7 8 0.65Silhouette Coefficient0.60.550.50.450.40.35 6SSE420 051015202530 0.3 051015202530 Number of ClustersNumber of Clusters Figure8.32.SSEversusnumberofclustersforthedataofFigure8.29. Figure8.33.Averagesilhouettecoefficientver-susnumberofclustersforthedataofFigure8.29. asclear.Insummary,whilecautionisneeded,thetechniquewehavejustdescribedcanprovideinsightintothenumberofclustersinthedata. 8.5.6ClusteringTendency Oneobviouswaytodetermineifadatasethasclustersistotrytoclusterit.However,almostallclusteringalgorithmswilldutifullyfindclusterswhengivendata.Toaddressthisissue,wecouldevaluatetheresultingclustersandonlyclaimthatadatasethasclustersifatleastsomeoftheclustersareofgoodquality.However,thisapproachdoesnotaddressthefacttheclustersinthedatacanbeofadifferenttypethanthosesoughtbyourclusteringalgorithm.Tohandlethisadditionalproblem,wecouldusemultiplealgorithmsandagainevaluatethequalityoftheresultingclusters.Iftheclustersareuniformlypoor,thenthismayindeedindicatethattherearenoclustersinthedata. Alternatively,andthisisthefocusofmeasuresofclusteringtendency,wecantrytoevaluatewhetheradatasethasclusterswithoutclustering.Themostcommonapproach,especiallyfordatainEuclideanspace,hasbeentousestatisticaltestsforspatialrandomness.Unfortunately,choosingthecor-rectmodel,estimatingtheparameters,andevaluatingthestatisticalsignifi-canceofthehypothesisthatthedataisnon-randomcanbequitechallenging.Nonetheless,manyapproacheshavebeendeveloped,mostofthemforpointsinlow-dimensionalEuclideanspace. Example8.14(HopkinsStatistic).Forthisapproach,wegenerateppointsthatarerandomlydistributedacrossthedataspaceandalsosamplepactual 8Chapter8ClusterAnalysis:BasicConceptsandAlgorithms datapoints.Forbothsetsofpointswefindthedistancetothenearestneigh-borintheoriginaldataset.Lettheuibethenearestneighbordistancesoftheartificiallygeneratedpoints,whilethewiarethenearestneighbordistancesofthesampleofpointsfromtheoriginaldataset.TheHopkinsstatisticHisthendefinedbyEquation8.17. p wii=1H=p(8.17)pu+wiii=1i=1Iftherandomlygeneratedpointsandthesampleofdatapointshave roughlythesamenearestneighbordistances,thenHwillbenear0.5.ValuesofHnear0and1indicate,respectively,datathatishighlyclusteredanddatathatisregularlydistributedinthedataspace.Togiveanexample,theHopkinsstatisticforthedataofFigure8.26wascomputedforp=20and100differenttrials.TheaveragevalueofHwas0.56withastandarddeviationof0.03.Thesameexperimentwasperformedforthewell-separatedpointsofFigure8.30.TheaveragevalueofHwas0.95withastandarddeviationof0.006. 8.5.7SupervisedMeasuresofClusterValidity Whenwehaveexternalinformationaboutdata,itistypicallyintheformofexternallyderivedclasslabelsforthedataobjects.Insuchcases,theusualprocedureistomeasurethedegreeofcorrespondencebetweentheclusterlabelsandtheclasslabels.Butwhyisthisofinterest?Afterall,ifwehavetheclasslabels,thenwhatisthepointinperformingaclusteranalysis?Motivationsforsuchananalysisarethecomparisonofclusteringtechniqueswiththe“groundtruth”ortheevaluationoftheextenttowhichamanualclassificationprocesscanbeautomaticallyproducedbyclusteranalysis. Weconsidertwodifferentkindsofapproaches.Thefirstsetoftechniquesusemeasuresfromclassification,suchasentropy,purity,andtheF-measure.Thesemeasuresevaluatetheextenttowhichaclustercontainsobjectsofasingleclass.Thesecondgroupofmethodsisrelatedtothesimilaritymeasuresforbinarydata,suchastheJaccardmeasurethatwesawinChapter2.Theseapproachesmeasuretheextenttowhichtwoobjectsthatareinthesameclassareinthesameclusterandviceversa.Forconvenience,wewillrefertothesetwotypesofmeasuresasclassification-orientedandsimilarity-oriented,respectively. 8.5ClusterEvaluation9 Classification-OrientedMeasuresofClusterValidity Thereareanumberofmeasures—entropy,purity,precision,recall,andtheF-measure—thatarecommonlyusedtoevaluatetheperformanceofaclassi-ficationmodel.Inthecaseofclassification,wemeasurethedegreetowhichpredictedclasslabelscorrespondtoactualclasslabels,butforthemeasuresjustmentioned,nothingfundamentalischangedbyusingclusterlabelsin-steadofpredictedclasslabels.Next,wequicklyreviewthedefinitionsofthesemeasures,whichwerediscussedinChapter4. Entropy:Thedegreetowhicheachclusterconsistsofobjectsofasingleclass. Foreachcluster,theclassdistributionofthedataiscalculatedfirst,i.e.,forclusterjwecomputepij,theprobabilitythatamemberofclusteribelongstoclassjaspij=mij/mi,wheremiisthenumberofobjectsinclusteriandmijisthenumberofobjectsofclassjinclusteri.Usingthisclassdistribution,theentropyLofeachclusteriiscalculatedusingthestandardformula,ei=−j=1pijlog2pij,whereListhenumberofclasses.Thetotalentropyforasetofclustersiscalculatedasthesumoftheentropiesofeachclusterweightedbythesizeofeachcluster,i.e.,mie=Ki=1mei,whereKisthenumberofclustersandmisthetotalnumberofdatapoints.Purity:Anothermeasureoftheextenttowhichaclustercontainsobjectsof asingleclass.Usingthepreviousterminology,thepurityofclusteriis mipi=maxpij,theoverallpurityofaclusteringispurity=Ki=1mpi. j Precision:Thefractionofaclusterthatconsistsofobjectsofaspecifiedclass.Theprecisionofclusteriwithrespecttoclassjisprecision(i,j)=pij.Recall:Theextenttowhichaclustercontainsallobjectsofaspecifiedclass. Therecallofclusteriwithrespecttoclassjisrecall(i,j)=mij/mj,wheremjisthenumberofobjectsinclassj.F-measureAcombinationofbothprecisionandrecallthatmeasuresthe extenttowhichaclustercontainsonlyobjectsofaparticularclassandallobjectsofthatclass.TheF-measureofclusteriwithrespecttoclassjisF(i,j)=(2×precision(i,j)×recall(i,j))/(precision(i,j)+recall(i,j)).Example8.15(SupervisedEvaluationMeasures).Wepresentanexam-pletoillustratethesemeasures.Specifically,weuseK-meanswiththecosinesimilaritymeasuretocluster3204newspaperarticlesfromtheLosAngeles 550Chapter8ClusterAnalysis:BasicConceptsandAlgorithms Table8.9.K-meansclusteringresultsfortheLATimesdocumentdataset. Cluster123456TotalEnter-Financialtainment3711101623312253583555Foreign4028013512341Metro50629711970212943National96394731348273Sports27267122313738Entropy1.22701.14720.18131.74871.39761.55231.1450Purity0.74740.77560.97960.43900.71340.55250.7203Times.Thesearticlescomefromsixdifferentclasses:Entertainment,Finan-cial,Foreign,Metro,National,andSports.Table8.9showstheresultsofaK-meansclusteringtofindsixclusters.Thefirstcolumnindicatestheclus-ter,whilethenextsixcolumnstogetherformtheconfusionmatrix;i.e.,thesecolumnsindicatehowthedocumentsofeachcategoryaredistributedamongtheclusters.Thelasttwocolumnsaretheentropyandpurityofeachcluster,respectively. Ideally,eachclusterwillcontaindocumentsfromonlyoneclass.Inreality,eachclustercontainsdocumentsfrommanyclasses.Nevertheless,manyclus-terscontaindocumentsprimarilyfromjustoneclass.Inparticular,cluster3,whichcontainsmostlydocumentsfromtheSportssection,isexceptionallygood,bothintermsofpurityandentropy.Thepurityandentropyoftheotherclustersisnotasgood,butcantypicallybegreatlyimprovedifthedataispartitionedintoalargernumberofclusters. Precision,recall,andtheF-measurecanbecalculatedforeachcluster.Togiveaconcreteexample,weconsidercluster1andtheMetroclassofTable8.9.Theprecisionis506/677=0.75,recallis506/943=0.26,andhence,theFvalueis0.39.Incontrast,theFvalueforcluster3andSportsis0.94.Similarity-OrientedMeasuresofClusterValidity Themeasuresthatwediscussinthissectionareallbasedonthepremisethatanytwoobjectsthatareinthesameclustershouldbeinthesameclassandviceversa.Wecanviewthisapproachtoclustervalidityasinvolvingthecomparisonoftwomatrices:(1)theidealclustersimilaritymatrixdiscussedpreviously,whichhasa1intheijthentryiftwoobjects,iandj,areinthesameclusterand0,otherwise,and(2)anidealclasssimilaritymatrixdefinedwithrespecttoclasslabels,whichhasa1intheijthentryif 8.5ClusterEvaluation551 twoobjects,iandj,belongtothesameclass,anda0otherwise.Asbefore,wecantakethecorrelationofthesetwomatricesasthemeasureofclustervalidity.ThismeasureisknownastheΓstatisticinclusteringvalidationliterature.Example8.16(CorrelationbetweenClusterandClassMatrices).Todemonstratethisideamoreconcretely,wegiveanexampleinvolvingfivedatapoints,p1,p2,p3,p4,p5,twoclusters,C1={p1,p2,p3}andC2={p4,p5},andtwoclasses,L1={p1,p2}andL2={p3,p4,p5}.TheidealclusterandclasssimilaritymatricesaregiveninTables8.10and8.11.Thecorrelationbetweentheentriesofthesetwomatricesis0.359. Table8.10.Idealclustersimilaritymatrix.Pointp1p2p3p4p5 p111100 p211100 p311100 p400011 p500011 Table8.11.Idealclasssimilaritymatrix.Pointp1p2p3p4p5 p111000 p211000 p300111 p400111 p500111 Moregenerally,wecanuseanyofthemeasuresforbinarysimilaritythatwesawinSection2.4.5.(Forexample,wecanconvertthesetwomatricesintobinaryvectorsbyappendingtherows.)Werepeatthedefinitionsofthefourquantitiesusedtodefinethosesimilaritymeasures,butmodifyourdescriptivetexttofitthecurrentcontext.Specifically,weneedtocomputethefollowingfourquantitiesforallpairsofdistinctobjects.(Therearem(m−1)/2suchpairs,ifmisthenumberofobjects.)f00f01f10f11 =numberofpairsofobjectshavingadifferentclassandadifferentcluster=numberofpairsofobjectshavingadifferentclassandthesamecluster=numberofpairsofobjectshavingthesameclassandadifferentcluster=numberofpairsofobjectshavingthesameclassandthesamecluster Inparticular,thesimplematchingcoefficient,whichisknownastheRandstatisticinthiscontext,andtheJaccardcoefficientaretwoofthemostfre-quentlyusedclustervaliditymeasures. Randstatistic= f00+f11 f00+f01+f10+f11 (8.18) 552Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsJaccardcoefficient= f11 f01+f10+f11 (8.19) Example8.17(RandandJaccardMeasures).Basedontheseformulas,wecanreadilycomputetheRandstatisticandJaccardcoefficientfortheexamplebasedonTables8.10and8.11.Notingthatf00=4,f01=2,f10=2,andf11=2,theRandstatistic=(2+4)/10=0.6andtheJaccardcoefficient=2/(2+2+2)=0.33. Wealsonotethatthefourquantities,f00,f01,f10,andf11,defineacon-tingencytableasshowninTable8.12. Table8.12.Two-waycontingencytablefordeterminingwhetherpairsofobjectsareinthesameclassandsamecluster. SameClusterDifferentClusterSameClassf11f10DifferentClassf01f00 Previously,inthecontextofassociationanalysis—seeSection6.7.1—we presentedanextensivediscussionofmeasuresofassociationthatcanbeusedforthistypeofcontingencytable.(CompareTable8.12withTable6.7.)Thosemeasurescanalsobeappliedtoclustervalidity.ClusterValidityforHierarchicalClusterings Sofarinthissection,wehavediscussedsupervisedmeasuresofclusterva-lidityonlyforpartitionalclusterings.Supervisedevaluationofahierarchicalclusteringismoredifficultforavarietyofreasons,includingthefactthatapreexistinghierarchicalstructureoftendoesnotexist.Here,wewillgiveanexampleofanapproachforevaluatingahierarchicalclusteringintermsofa(flat)setofclasslabels,whicharemorelikelytobeavailablethanapreexistinghierarchicalstructure. Thekeyideaofthisapproachistoevaluatewhetherahierarchicalclus-teringcontains,foreachclass,atleastoneclusterthatisrelativelypureandincludesmostoftheobjectsofthatclass.Toevaluateahierarchicalcluster-ingwithrespecttothisgoal,wecompute,foreachclass,theF-measureforeachclusterintheclusterhierarchy.Foreachclass,wetakethemaximumF-measureattainedforanycluster.Finally,wecalculateanoverallF-measureforthehierarchicalclusteringbycomputingtheweightedaverageofallper-classF-measures,wheretheweightsarebasedontheclasssizes.Moreformally, 8.5 thishierarchicalF-measureisdefinedasfollows: F= mj j ClusterEvaluation553 m maxF(i,j) i wherethemaximumistakenoverallclustersiatalllevels,mjisthenumberofobjectsinclassj,andmisthetotalnumberofobjects. 8.5.8AssessingtheSignificanceofClusterValidityMeasures Clustervaliditymeasuresareintendedtohelpusmeasurethegoodnessoftheclustersthatwehaveobtained.Indeed,theytypicallygiveusasinglenumberasameasureofthatgoodness.However,wearethenfacedwiththeproblemofinterpretingthesignificanceofthisnumber,ataskthatmaybeevenmoredifficult. Theminimumandmaximumvaluesofclusterevaluationmeasuresmayprovidesomeguidanceinmanycases.Forinstance,bydefinition,apurityof0isbad,whileapurityof1isgood,atleastifwetrustourclasslabelsandwantourclusterstructuretoreflecttheclassstructure.Likewise,anentropyof0isgood,asisanSSEof0. Sometimes,however,theremaynotbeaminimumormaximumvalue,orthescaleofthedatamayaffecttheinterpretation.Also,evenifthereareminimumandmaximumvalueswithobviousinterpretations,intermediatevaluesstillneedtobeinterpreted.Insomecases,wecanuseanabsolutestandard.If,forexample,weareclusteringforutility,wemaybewillingtotolerateonlyacertainleveloferrorintheapproximationofourpointsbyaclustercentroid. Butifthisisnotthecase,thenwemustdosomethingelse.Acommonapproachistointerpretthevalueofourvaliditymeasureinstatisticalterms.Specifically,weattempttojudgehowlikelyitisthatourobservedvaluemaybeachievedbyrandomchance.Thevalueisgoodifitisunusual;i.e.,ifitisunlikelytobetheresultofrandomchance.Themotivationforthisapproachisthatweareonlyinterestedinclustersthatreflectnon-randomstructureinthedata,andsuchstructuresshouldgenerateunusuallyhigh(low)valuesofourclustervaliditymeasure,atleastifthevaliditymeasuresaredesignedtoreflectthepresenceofstrongclusterstructure. Example8.18(SignificanceofSSE).Toshowhowthisworks,wepresentanexamplebasedonK-meansandtheSSE.Supposethatwewantameasureofhowgoodthewell-separatedclustersofFigure8.30arewithrespecttorandomdata.Wegeneratemanyrandomsetsof100pointshavingthesamerangeas 5Chapter8ClusterAnalysis:BasicConceptsandAlgorithms 504030Count201000.0150.020.025SSE 0.030.0350.04Figure8.34.HistogramofSSEfor500randomdatasets. thepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure8.34showsthehistogramoftheSSEfrom500randomruns.ThelowestSSEshowninFigure8.34is0.0173.ForthethreeclustersofFigure8.30,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure8.30couldoccurbychance. Toconclude,westressthatthereismoretoclusterevaluation—supervisedorunsupervised—thanobtaininganumericalmeasureofclustervalidity.Un-lessthisvaluehasanaturalinterpretationbasedonthedefinitionofthemea-sure,weneedtointerpretthisvalueinsomeway.Ifourclusterevaluationmeasureisdefinedsuchthatlowervaluesindicatestrongerclusters,thenwecanusestatisticstoevaluatewhetherthevaluewehaveobtainedisunusuallylow,providedwehaveadistributionfortheevaluationmeasure.Wehavepre-sentedanexampleofhowtofindsuchadistribution,butthereisconsiderablymoretothistopic,andwereferthereadertothebibliographicnotesformorepointers. Finally,evenwhenanevaluationmeasureisusedasarelativemeasure,i.e.,tocomparetwoclusterings,westillneedtoassessthesignificanceinthedifferencebetweentheevaluationmeasuresofthetwoclusterings.Althoughonevaluewillalmostalwaysbebetterthananother,itcanbedifficulttodetermineifthedifferenceissignificant.Notethattherearetwoaspectstothissignificance:whetherthedifferenceisstatisticallysignificant(repeatable) 8.6BibliographicNotes555 andwhetherthemagnitudeofthedifferenceismeaningfulwithrespecttotheapplication.Manywouldnotregardadifferenceof0.1%assignificant,evenifitisconsistentlyreproducible. 8.6BibliographicNotes DiscussioninthischapterhasbeenmostheavilyinfluencedbythebooksonclusteranalysiswrittenbyJainandDubes[396],Anderberg[374],andKauf-manandRousseeuw[400].AdditionalclusteringbooksthatmayalsobeofinterestincludethosebyAldenderferandBlashfield[373],Everittetal.[388],Hartigan[394],Mirkin[405],Murtagh[407],Romesburg[409],andSp¨ath[413].AmorestatisticallyorientedapproachtoclusteringisgivenbythepatternrecognitionbookofDudaetal.[385],themachinelearningbookofMitchell[406],andthebookonstatisticallearningbyHastieetal.[395].AgeneralsurveyofclusteringisgivenbyJainetal.[397],whileasurveyofspatialdataminingtechniquesisprovidedbyHanetal.[393].Behrkin[379]providesasurveyofclusteringtechniquesfordatamining.AgoodsourceofreferencestoclusteringoutsideofthedataminingfieldisthearticlebyArabieandHu-bert[376].ApaperbyKleinberg[401]providesadiscussionofsomeofthetrade-offsthatclusteringalgorithmsmakeandprovesthatitisimpossibletoforaclusteringalgorithmtosimultaneouslypossessthreesimpleproperties. TheK-meansalgorithmhasalonghistory,butisstillthesubjectofcurrentresearch.TheoriginalK-meansalgorithmwasproposedbyMacQueen[403].TheISODATAalgorithmbyBallandHall[377]wasanearly,butsophisticatedversionofK-meansthatemployedvariouspre-andpostprocessingtechniquestoimproveonthebasicalgorithm.TheK-meansalgorithmandmanyofitsvariationsaredescribedindetailinthebooksbyAnderberg[374]andJainandDubes[396].ThebisectingK-meansalgorithmdiscussedinthischapterwasdescribedinapaperbySteinbachetal.[414],andanimplementationofthisandotherclusteringapproachesisfreelyavailableforacademicuseintheCLUTO(CLUsteringTOolkit)packagecreatedbyKarypis[382].Boley[380]hascreatedadivisivepartitioningclusteringalgorithm(PDDP)basedonfindingthefirstprincipaldirection(component)ofthedata,andSavaresiandBoley[411]haveexploreditsrelationshiptobisectingK-means.RecentvariationsofK-meansareanewincrementalversionofK-means(Dhillonetal.[383]),X-means(PellegandMoore[408]),andK-harmonicmeans(Zhangetal[416]).HamerlyandElkan[392]discusssomeclusteringalgorithmsthatpro-ducebetterresultsthanK-means.WhilesomeofthepreviouslymentionedapproachesaddresstheinitializationproblemofK-meansinsomemanner, 556Chapter8ClusterAnalysis:BasicConceptsandAlgorithms otherapproachestoimprovingK-meansinitializationcanalsobefoundintheworkofBradleyandFayyad[381].DhillonandModha[384]presentagen-eralizationofK-means,calledsphericalK-means,thatworkswithcommonlyusedsimilarityfunctions.AgeneralframeworkforK-meansclusteringthatusesdissimilarityfunctionsbasedonBregmandivergenceswasconstructedbyBanerjeeetal.[378]. Hierarchicalclusteringtechniquesalsohavealonghistory.MuchoftheinitialactivitywasintheareaoftaxonomyandiscoveredinbooksbyJardineandSibson[398]andSneathandSokal[412].General-purposediscussionsofhierarchicalclusteringarealsoavailableinmostoftheclusteringbooksmen-tionedabove.Agglomerativehierarchicalclusteringisthefocusofmostworkintheareaofhierarchicalclustering,butdivisiveapproacheshavealsoreceivedsomeattention.Forexample,Zahn[415]describesadivisivehierarchicaltech-niquethatusestheminimumspanningtreeofagraph.Whilebothdivisiveandagglomerativeapproachestypicallytaketheviewthatmerging(splitting)decisionsarefinal,therehasbeensomeworkbyFisher[3]andKarypisetal.[399]toovercometheselimitations. Esteretal.proposedDBSCAN[387],whichwaslatergeneralizedtotheGDBSCANalgorithmbySanderetal.[410]inordertohandlemoregeneraltypesofdataanddistancemeasures,suchaspolygonswhoseclosenessismea-suredbythedegreeofintersection.AnincrementalversionofDBSCANwasdevelopedbyKriegeletal.[386].OneinterestingoutgrowthofDBSCANisOPTICS(OrderingPointsToIdentifytheClusteringStructure)(Ankerstetal.[375]),whichallowsthevisualizationofclusterstructureandcanalsobeusedforhierarchicalclustering. Anauthoritativediscussionofclustervalidity,whichstronglyinfluencedthediscussioninthischapter,isprovidedinChapter4ofJainandDubes’clusteringbook[396].MorerecentreviewsofclustervalidityarethoseofHalkidietal.[390,391]andMilligan[404].SilhouettecoefficientsaredescribedinKaufmanandRousseeuw’sclusteringbook[400].ThesourceofthecohesionandseparationmeasuresinTable8.6isapaperbyZhaoandKarypis[417],whichalsocontainsadiscussionofentropy,purity,andthehierarchicalF-measure.TheoriginalsourceofthehierarchicalF-measureisanarticlebyLarsenandAone[402]. Bibliography [373]M.S.AldenderferandR.K.Blashfield.ClusterAnalysis.SagePublications,Los Angeles,1985. [374]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,NewYork, December1973. Bibliography557 [375]M.Ankerst,M.M.Breunig,H.-P.Kriegel,andJ.Sander.OPTICS:OrderingPoints ToIdentifytheClusteringStructure.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages49–60,Philadelphia,Pennsylvania,June1999.ACMPress.[376]P.Arabie,L.Hubert,andG.D.Soete.Anoverviewofcombinatorialdataanalysis. InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages188–217.WorldScientific,Singapore,January1996. [377]G.BallandD.Hall.AClusteringTechniqueforSummarizingMultivariateData. BehaviorScience,12:153–155,March1967. [378]A.Banerjee,S.Merugu,I.S.Dhillon,andJ.Ghosh.ClusteringwithBregmanDiver-gences.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages234–245,LakeBuenaVista,FL,April2004. [379]P.Berkhin.SurveyOfClusteringDataMiningTechniques.Technicalreport,Accrue Software,SanJose,CA,2002. [380]D.Boley.PrincipalDirectionDivisivePartitioning.DataMiningandKnowledge Discovery,2(4):325–344,1998. [381]P.S.BradleyandU.M.Fayyad.RefiningInitialPointsforK-MeansClustering.In Proc.ofthe15thIntl.Conf.onMachineLearning,pages91–99,Madison,WI,July1998.MorganKaufmannPublishersInc.[382]CLUTO2.1.1:SoftwareforClusteringHigh-DimensionalDatasets. /www.cs.umn.edu/∼karypis,November2003. [383]I.S.Dhillon,Y.Guan,andJ.Kogan.IterativeClusteringofHighDimensionalText DataAugmentedbyLocalSearch.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages131–138.IEEEComputerSociety,2002. [384]I.S.DhillonandD.S.Modha.ConceptDecompositionsforLargeSparseTextData UsingClustering.MachineLearning,42(1/2):143–175,2001. [385]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons, Inc.,NewYork,secondedition,2001. [386]M.Ester,H.-P.Kriegel,J.Sander,M.Wimmer,andX.Xu.IncrementalClustering forMininginaDataWarehousingEnvironment.InProc.ofthe24thVLDBConf.,pages323–333,NewYorkCity,August1998.MorganKaufmann. [387]M.Ester,H.-P.Kriegel,J.Sander,andX.Xu.ADensity-BasedAlgorithmforDis-coveringClustersinLargeSpatialDatabaseswithNoise.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages226–231,Portland,Oregon,August1996.AAAIPress. [388]B.S.Everitt,S.Landau,andM.Leese.ClusterAnalysis.ArnoldPublishers,London, fourthedition,May2001. [3]D.Fisher.IterativeOptimizationandSimplificationofHierarchicalClusterings.Jour-nalofArtificialIntelligenceResearch,4:147–179,1996. [390]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clustervaliditymethods:partI. SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(2):40–45,June2002. [391]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clusteringvaliditycheckingmethods: partII.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(3):19–27,Sept.2002. [392]G.HamerlyandC.Elkan.Alternativestothek-meansalgorithmthatfindbetter clusterings.InProc.ofthe11thIntl.Conf.onInformationandKnowledgeManagement,pages600–607,McLean,Virginia,2002.ACMPress. 558Chapter8ClusterAnalysis:BasicConceptsandAlgorithms [393]J.Han,M.Kamber,andA.Tung.SpatialClusteringMethodsinDataMining:A review.InH.J.MillerandJ.Han,editors,GeographicDataMiningandKnowledgeDiscovery,pages188–217.TaylorandFrancis,London,December2001.[394]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975. [395]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning: DataMining,Inference,Prediction.Springer,NewYork,2001. [396]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHall AdvancedReferenceSeries.PrenticeHall,March1988.Bookavailableonlineathttp://www.cse.msu.edu/∼jain/ClusteringJainDubes.pdf. [397]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview.ACMComputing Surveys,31(3):2–323,September1999. [398]N.JardineandR.Sibson.MathematicalTaxonomy.Wiley,NewYork,1971. [399]G.Karypis,E.-H.Han,andV.Kumar.MultilevelRefinementforHierarchicalClus-tering.TechnicalReportTR99-020,UniversityofMinnesota,Minneapolis,MN,1999.[400]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoCluster Analysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990. [401]J.M.Kleinberg.AnImpossibilityTheoremforClustering.InProc.ofthe16thAnnual Conf.onNeuralInformationProcessingSystems,December,9–142002. [402]B.LarsenandC.Aone.FastandEffectiveTextMiningUsingLinear-TimeDocument Clustering.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages16–22,SanDiego,California,1999.ACMPress. [403]J.MacQueen.Somemethodsforclassificationandanalysisofmultivariateobserva-tions.InProc.ofthe5thBerkeleySymp.onMathematicalStatisticsandProbability,pages281–297.UniversityofCaliforniaPress,1967. [404]G.W.Milligan.ClusteringValidation:ResultsandImplicationsforAppliedAnalyses. InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages345–375.WorldScientific,Singapore,January1996. [405]B.Mirkin.MathematicalClassificationandClustering,volume11ofNonconvexOpti-mizationandItsApplications.KluwerAcademicPublishers,August1996.[406]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997. [407]F.Murtagh.MultidimensionalClusteringAlgorithms.Physica-Verlag,Heidelbergand Vienna,1985. [408]D.PellegandA.W.Moore.X-means:ExtendingK-meanswithEfficientEstimation oftheNumberofClusters.InProc.ofthe17thIntl.Conf.onMachineLearning,pages727–734.MorganKaufmann,SanFrancisco,CA,2000. [409]C.Romesburg.ClusterAnalysisforResearchers.LifeTimeLearning,Belmont,CA, 1984. [410]J.Sander,M.Ester,H.-P.Kriegel,andX.Xu.Density-BasedClusteringinSpatial Databases:TheAlgorithmGDBSCANanditsApplications.DataMiningandKnowl-edgeDiscovery,2(2):169–194,1998. [411]S.M.SavaresiandD.Boley.AcomparativeanalysisonthebisectingK-meansand thePDDPclusteringalgorithms.IntelligentDataAnalysis,8(4):345–362,2004. [412]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.[413]H.Sp¨ath.ClusterAnalysisAlgorithmsforDataReductionandClassificationofOb-jects,volume4ofComputersandTheirApplication.EllisHorwoodPublishers,Chich-ester,1980.ISBN0-85312-141-9. 8.7Exercises559 [414]M.Steinbach,G.Karypis,andV.Kumar.AComparisonofDocumentClustering Techniques.InProc.ofKDDWorkshoponTextMining,Proc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,Boston,MA,August2000. [415]C.T.Zahn.Graph-TheoreticalMethodsforDetectingandDescribingGestaltClusters. IEEETransactionsonComputers,C-20(1):68–86,Jan.1971. [416]B.Zhang,M.Hsu,andU.Dayal.K-HarmonicMeans—ADataClusteringAlgorithm. TechnicalReportHPL-1999-124,HewlettPackardLaboratories,Oct.291999. [417]Y.ZhaoandG.Karypis.Empiricalandtheoreticalcomparisonsofselectedcriterion functionsfordocumentclustering.MachineLearning,55(3):311–331,2004. 8.7Exercises 1.Consideradatasetconsistingof220datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvectorquan-tizationisusedforcompressionandthat216prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?2.Findallwell-separatedclustersinthesetofpointsshowninFigure8.35. Figure8.35.PointsforExercise2. 3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenum-berofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris1/K,buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation8.20.(Thisassumessamplingwithreplacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis4!/44=0.0938. 560Chapter8ClusterAnalysis:BasicConceptsandAlgorithms numberofwaystoselectonecentroidfromeachclusterK!nKK!p=== numberofwaystoselectKcentroids(Kn)KKK(8.20) (a)Plottheprobabilityofobtainingonepointfromeachclusterinasample ofsizeKforvaluesofKbetween2and100.(b)ForKclusters,K=10,100,and1000,findtheprobabilitythatasample ofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.5.IdentifytheclustersinFigure8.36usingthecenter-,contiguity-,anddensity-baseddefinitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN. (a)(b)(c)(d) Figure8.36.ClustersforExercise5. 6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberofclustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.Assumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure8.37matchesthecorrespondingpartofthisquestion,e.g.,Figure8.37(a)goeswithpart(a). (a)K=2.Assumingthatthepointsareuniformlydistributedinthecircle, howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.) 8.7Exercises561 (a)(b)(c)(d)(e) Figure8.37.DiagramsforExercise6. (b)K=3.Thedistancebetweentheedgesofthecirclesisslightlygreater thantheradiiofthecircles.(c)K=3.Thedistancebetweentheedgesofthecirclesismuchlessthan theradiiofthecircles.(d)K=2.(e)K=3.Hint:Usethesymmetryofthesituationandrememberthatwe arelookingforaroughsketchofwhattheresultwouldbe.7.Supposethatforadataset •therearempointsandKclusters, •halfthepointsandclustersarein“moredense”regions,•halfthepointsandclustersarein“lessdense”regions,and•thetworegionsarewell-separatedfromeachother. Forthegivendataset,whichofthefollowingshouldoccurinordertominimizethesquarederrorwhenfindingKclusters: (a)Centroidsshouldbeequallydistributedbetweenmoredenseandlessdense regions.(b)Morecentroidsshouldbeallocatedtothelessdenseregion.(c)Morecentroidsshouldbeallocatedtothedenserregion. Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdifferentfromanygivenabove,justifyyourresponse. 8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcompo-nentsmostaccuratelycharacterizetheobjectsinthecluster?9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyfindthecorrectclusters,butbisectingK-meanswouldnot. 562Chapter8ClusterAnalysis:BasicConceptsandAlgorithms 10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?11.TotalSSEisthesumoftheSSEforeachseparateattribute.Whatdoesitmean iftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?12.Theleaderalgorithm(Hartigan[394])representseachclusterusingapoint, knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-specifiedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster. (a)Whataretheadvantagesanddisadvantagesoftheleaderalgorithmas comparedtoK-means?(b)Suggestwaysinwhichtheleaderalgorithmmightbeimproved. 13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofall thepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspecifiedpoints.(SeeFigure8.38.)WhatistherelationshipbetweenVoronoidiagramsandK-meansclus-ters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters? Figure8.38.VoronoidiagramforExercise13. 14.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata. YouuseK-meanstoclusterthedata,butforallvaluesofK,1≤K≤100,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersat eachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe 8.7Exercises563 (nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.16.UsethesimilaritymatrixinTable8.13toperformsingleandcompletelink hierarchicalclustering.Showyourresultsbydrawingadendrogram.Theden-drogramshouldclearlyshowtheorderinwhichthepointsaremerged. Table8.13.SimilaritymatrixforExercise16.p1p2p3p4p5p11.000.100.410.550.35p20.101.000.0.470.98p30.410.1.000.440.85p40.550.470.441.000.76p50.350.980.850.761.0017.HierarchicalclusteringissometimesusedtogenerateKclusters,K>1by takingtheclustersattheKthlevelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondifferenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means. Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}. (a)Foreachofthefollowingsetsofinitialcentroids,createtwoclustersby assigningeachpointtothenearestcentroid,andthencalculatethetotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids. i.{18,45}ii.{15,40} (b)Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-means algorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?(c)Whatarethetwoclustersproducedbysinglelink? (d)Whichtechnique,K-meansorsinglelink,seemstoproducethe“most natural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)(e)Whatdefinition(s)ofclusteringdoesthisnaturalclusteringcorrespond to?(Well-separated,center-based,contiguous,ordensity.)(f)Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthe previousbehavior? 5Chapter8ClusterAnalysis:BasicConceptsandAlgorithms 18.SupposewefindKclustersusingWard’smethod,bisectingK-means,andordi-naryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.19.HierarchicalclusteringalgorithmsrequireO(m2log(m))time,andconsequently, areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetime√requiredistosamplethedataset.Forexample,ifKclustersaredesiredandmpointsaresampledfromthempoints,thenahierarchi-calclusteringalgorithmwillproduceahierarchicalclusteringinroughlyO(m)time.KclusterscanbeextractedfromthishierarchicalclusteringbytakingtheclustersontheKthlevelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspecificexample,√thecentroidsoftheKclusterscanbecomputed,andtheneachofthem−mremainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid. Foreachofthefollowingtypesofdataorclusters,discussbrieflyif(1)samplingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromthetotalsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm. (a)Datawithverydifferentsizedclusters.(b)High-dimensionaldata. (c)Datawithoutliers,i.e.,atypicalpoints.(d)Datawithhighlyirregularregions.(e)Datawithglobularclusters.(f)Datawithwidelydifferentdensities.(g)Datawithasmallpercentageofnoisepoints.(h)Non-Euclideandata.(i)Euclideandata. (j)Datawithmanyandmixedattributetypes. 20.ConsiderthefollowingfourfacesshowninFigure8.39.Again,darknessor numberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints. (a)Foreachfigure,couldyouusesinglelinktofindthepatternsrepresented bythenose,eyes,andmouth?Explain.(b)Foreachfigure,couldyouuseK-meanstofindthepatternsrepresented bythenose,eyes,andmouth?Explain. 8.7Exercises565 (a)(b)(c)(d) Figure8.39.FigureforExercise20. (c)Whatlimitationdoesclusteringhaveindetectingallthepatternsformed bythepointsinFigure8.39(c)?21.ComputetheentropyandpurityfortheconfusionmatrixinTable8.14. Table8.14.ConfusionmatrixforExercise21. EntertainmentFinancialForeignMetroNational1101142733382725332658105163555341943273Cluster#1#2#3TotalSports6763329738Total6931562949320422.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Oneset ofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare. (a)Isthereadifferencebetweenthetwosetsofpoints? (b)Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforK=10 clusters?(c)WhatwillbethebehaviorofDBSCANontheuniformdataset?The randomdataset?23.UsingthedatainExercise24,computethesilhouettecoefficientforeachpoint, eachofthetwoclusters,andtheoverallclustering.24.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables8.15and 8.16,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhoseijthentryis1iftwoobjectsbelongtothesamecluster,and0otherwise. 566Chapter8ClusterAnalysis:BasicConceptsandAlgorithms SimilaritymatrixforExercise24.P110.80.650.55P20.810.70.6P30.650.710.9P40.550.60.91Table8.15.TableofclusterlabelsforExercise24.Table8.16. PointClusterLabelPointP11P1P21P2P32P3P42P425.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,p8}andhierarchicalclusteringshowninFigure8.40.ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB. {p1, p2, p3, p4, p5, p6, p7, p8}{p1, p2, p4, p5,}{p3, p6, p7, p8}{p1, p2}{p4, p5}{p3, p6}{p7, p8}Figure8.40.HierarchicalclusteringforExercise25. 26.Computethecopheneticcorrelationcoefficientforthehierarchicalclusterings inExercise16.(Youwillneedtoconvertthesimilaritiesintodissimilarities.)27.ProveEquation8.14. 28.ProveEquation8.16. K 29.Provethati=1x∈Ci(x−mi)(m−mi)=0.Thisfactwasusedintheproof thatTSS=SSE+SSBinSection8.5.2.30.Clustersofdocumentscanbesummarizedbyfindingthetopterms(words)for thedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefrequentlythanaspecifiedthreshold.SupposethatK-meansisusedtofindclustersofbothdocumentsandwordsforadocumentdataset. (a)Howmightasetoftermclustersdefinedbythetoptermsinadocument clusterdifferfromthewordclustersfoundbyclusteringthetermswithK-means?(b)Howcouldtermclusteringbeusedtodefineclustersofdocuments?31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionof attributenodes,wherethereisalinkbetweeneachobjectandeachattribute, 8.7Exercises567 andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindifferentclusters.Thistypeofclusteringisalsoknownasco-clusteringsincetheobjectsandattributesareclusteredatthesametime. (a)Howisbipartiteclustering(co-clustering)differentfromclusteringthe setsofobjectsandattributesseparately?(b)Arethereanycasesinwhichtheseapproachesyieldthesameclusters?(c)Whatarethestrengthsandweaknessesofco-clusteringascomparedto ordinaryclustering?32.InFigure8.41,matchthesimilaritymatrices,whicharesortedaccordingto clusterlabels,withthesetsofpoints.Differencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equal-sizedclusters. 568Chapter8 10.90.80.70.60.50.40.30.20.1000.20.4yClusterAnalysis:BasicConceptsandAlgorithms 1 10.90.80.70.60.50.40.30.20.1 0.6x 0.8100 0.2 0.4 x 10.90.80.70.60.50.40.30.20.1y0.6 0.8 1 y2 10.90.80.70.60.50.40.30.20.1000.20.4y34 0.6x 0.81000.20.4x 0.60.81Figure8.41.PointsandsimilaritymatricesforExercise32. 因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- huatuo0.com 版权所有 湘ICP备2023021991号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务