Cluster analysis

来源：华佗健康网

ClusterAnalysis:BasicConceptsandAlgorithms

Clusteranalysisdividesdataintogroups(clusters)thataremeaningful,useful,orboth.Ifmeaningfulgroupsarethegoal,thentheclustersshouldcapturethenaturalstructureofthedata.Insomecases,however,clusteranalysisisonlyausefulstartingpointforotherpurposes,suchasdatasummarization.Whetherforunderstandingorutility,clusteranalysishaslongplayedanimportantroleinawidevarietyofﬁelds:psychologyandothersocialsciences,biology,statistics,patternrecognition,informationretrieval,machinelearning,anddatamining.

Therehavebeenmanyapplicationsofclusteranalysistopracticalprob-lems.Weprovidesomespeciﬁcexamples,organizedbywhetherthepurposeoftheclusteringisunderstandingorutility.

ClusteringforUnderstandingClasses,orconceptuallymeaningfulgroupsofobjectsthatsharecommoncharacteristics,playanimportantroleinhowpeopleanalyzeanddescribetheworld.Indeed,humanbeingsareskilledatdividingobjectsintogroups(clustering)andassigningparticularobjectstothesegroups(classiﬁcation).Forexample,evenrelativelyyoungchildrencanquicklylabeltheobjectsinaphotographasbuildings,vehicles,people,ani-mals,plants,etc.Inthecontextofunderstandingdata,clustersarepotentialclassesandclusteranalysisisthestudyoftechniquesforautomaticallyﬁndingclasses.Thefollowingaresomeexamples:

488Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

•Biology.Biologistshavespentmanyyearscreatingataxonomy(hi-erarchicalclassiﬁcation)ofalllivingthings:kingdom,phylum,class,order,family,genus,andspecies.Thus,itisperhapsnotsurprisingthatmuchoftheearlyworkinclusteranalysissoughttocreateadisciplineofmathematicaltaxonomythatcouldautomaticallyﬁndsuchclassiﬁ-cationstructures.Morerecently,biologistshaveappliedclusteringtoanalyzethelargeamountsofgeneticinformationthatarenowavailable.Forexample,clusteringhasbeenusedtoﬁndgroupsofgenesthathavesimilarfunctions.

•InformationRetrieval.TheWorldWideWebconsistsofbillionsofWebpages,andtheresultsofaquerytoasearchenginecanreturnthousandsofpages.Clusteringcanbeusedtogroupthesesearchre-sultsintoasmallnumberofclusters,eachofwhichcapturesaparticularaspectofthequery.Forinstance,aqueryof“movie”mightreturnWebpagesgroupedintocategoriessuchasreviews,trailers,stars,andtheaters.Eachcategory(cluster)canbebrokenintosubcategories(sub-clusters),producingahierarchicalstructurethatfurtherassistsauser’sexplorationofthequeryresults.

•Climate.UnderstandingtheEarth’sclimaterequiresﬁndingpatternsintheatmosphereandocean.Tothatend,clusteranalysishasbeenappliedtoﬁndpatternsintheatmosphericpressureofpolarregionsandareasoftheoceanthathaveasigniﬁcantimpactonlandclimate.•PsychologyandMedicine.Anillnessorconditionfrequentlyhasanumberofvariations,andclusteranalysiscanbeusedtoidentifythesediﬀerentsubcategories.Forexample,clusteringhasbeenusedtoidentifydiﬀerenttypesofdepression.Clusteranalysiscanalsobeusedtodetectpatternsinthespatialortemporaldistributionofadisease.

•Business.Businessescollectlargeamountsofinformationoncurrentandpotentialcustomers.Clusteringcanbeusedtosegmentcustomersintoasmallnumberofgroupsforadditionalanalysisandmarketingactivities.

ClusteringforUtilityClusteranalysisprovidesanabstractionfromin-dividualdataobjectstotheclustersinwhichthosedataobjectsreside.Ad-ditionally,someclusteringtechniquescharacterizeeachclusterintermsofaclusterprototype;i.e.,adataobjectthatisrepresentativeoftheotherob-jectsinthecluster.Theseclusterprototypescanbeusedasthebasisfora

numberofdataanalysisordataprocessingtechniques.Therefore,inthecon-textofutility,clusteranalysisisthestudyoftechniquesforﬁndingthemostrepresentativeclusterprototypes.

•Summarization.Manydataanalysistechniques,suchasregressionorPCA,haveatimeorspacecomplexityofO(m2)orhigher(wheremisthenumberofobjects),andthus,arenotpracticalforlargedatasets.However,insteadofapplyingthealgorithmtotheentiredataset,itcanbeappliedtoareduceddatasetconsistingonlyofclusterprototypes.Dependingonthetypeofanalysis,thenumberofprototypes,andtheaccuracywithwhichtheprototypesrepresentthedata,theresultscanbecomparabletothosethatwouldhavebeenobtainedifallthedatacouldhavebeenused.

•Compression.Clusterprototypescanalsobeusedfordatacompres-sion.Inparticular,atableiscreatedthatconsistsoftheprototypesforeachcluster;i.e.,eachprototypeisassignedanintegervaluethatisitsposition(index)inthetable.Eachobjectisrepresentedbytheindexoftheprototypeassociatedwithitscluster.Thistypeofcompressionisknownasvectorquantizationandisoftenappliedtoimage,sound,andvideodata,where(1)manyofthedataobjectsarehighlysimilartooneanother,(2)somelossofinformationisacceptable,and(3)asubstantialreductioninthedatasizeisdesired.

•EﬃcientlyFindingNearestNeighbors.Findingnearestneighborscanrequirecomputingthepairwisedistancebetweenallpoints.Oftenclustersandtheirclusterprototypescanbefoundmuchmoreeﬃciently.Ifobjectsarerelativelyclosetotheprototypeoftheircluster,thenwecanusetheprototypestoreducethenumberofdistancecomputationsthatarenecessarytoﬁndthenearestneighborsofanobject.Intuitively,iftwoclusterprototypesarefarapart,thentheobjectsinthecorrespondingclusterscannotbenearestneighborsofeachother.Consequently,toﬁndanobject’snearestneighborsitisonlynecessarytocomputethedistancetoobjectsinnearbyclusters,wherethenearnessoftwoclustersismeasuredbythedistancebetweentheirprototypes.ThisideaismademorepreciseinExercise25onpage94.

Thischapterprovidesanintroductiontoclusteranalysis.Webeginwithahigh-leveloverviewofclustering,includingadiscussionofthevariousap-proachestodividingobjectsintosetsofclustersandthediﬀerenttypesofclusters.Wethendescribethreespeciﬁcclusteringtechniquesthatrepresent

490Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

broadcategoriesofalgorithmsandillustrateavarietyofconcepts:K-means,agglomerativehierarchicalclustering,andDBSCAN.Theﬁnalsectionofthischapterisdevotedtoclustervalidity—methodsforevaluatingthegoodnessoftheclustersproducedbyaclusteringalgorithm.MoreadvancedclusteringconceptsandalgorithmswillbediscussedinChapter9.Wheneverpossible,wediscussthestrengthsandweaknessesofdiﬀerentschemes.Inaddition,thebibliographicnotesprovidereferencestorelevantbooksandpapersthatexploreclusteranalysisingreaterdepth.

8.1Overview

Beforediscussingspeciﬁcclusteringtechniques,weprovidesomenecessarybackground.First,wefurtherdeﬁneclusteranalysis,illustratingwhyitisdiﬃcultandexplainingitsrelationshiptoothertechniquesthatgroupdata.Thenweexploretwoimportanttopics:(1)diﬀerentwaystogroupasetofobjectsintoasetofclusters,and(2)typesofclusters.

8.1.1WhatIsClusterAnalysis?

Clusteranalysisgroupsdataobjectsbasedonlyoninformationfoundinthedatathatdescribestheobjectsandtheirrelationships.Thegoalisthattheobjectswithinagroupbesimilar(orrelated)tooneanotheranddiﬀerentfrom(orunrelatedto)theobjectsinothergroups.Thegreaterthesimilarity(orhomogeneity)withinagroupandthegreaterthediﬀerencebetweengroups,thebetterormoredistincttheclustering.

Inmanyapplications,thenotionofaclusterisnotwelldeﬁned.Tobetterunderstandthediﬃcultyofdecidingwhatconstitutesacluster,considerFigure8.1,whichshowstwentypointsandthreediﬀerentwaysofdividingthemintoclusters.Theshapesofthemarkersindicateclustermembership.Figures8.1(b)and8.1(d)dividethedataintotwoandsixparts,respectively.However,theapparentdivisionofeachofthetwolargerclustersintothreesubclustersmaysimplybeanartifactofthehumanvisualsystem.Also,itmaynotbeunreasonabletosaythatthepointsformfourclusters,asshowninFigure8.1(c).Thisﬁgureillustratesthatthedeﬁnitionofaclusterisimpreciseandthatthebestdeﬁnitiondependsonthenatureofdataandthedesiredresults.

Clusteranalysisisrelatedtoothertechniquesthatareusedtodividedataobjectsintogroups.Forinstance,clusteringcanberegardedasaformofclassiﬁcationinthatitcreatesalabelingofobjectswithclass(cluster)labels.However,itderivestheselabelsonlyfromthedata.Incontrast,classiﬁcation

8.1Overview491

(a)Originalpoints.(b)Twoclusters.

(c)Fourclusters.(d)Sixclusters.

Figure8.1.Differentwaysofclusteringthesamesetofpoints.

inthesenseofChapter4issupervisedclassiﬁcation;i.e.,new,unlabeledobjectsareassignedaclasslabelusingamodeldevelopedfromobjectswithknownclasslabels.Forthisreason,clusteranalysisissometimesreferredtoasunsupervisedclassiﬁcation.Whenthetermclassiﬁcationisusedwithoutanyqualiﬁcationwithindatamining,ittypicallyreferstosupervisedclassiﬁcation.

Also,whilethetermssegmentationandpartitioningaresometimesusedassynonymsforclustering,thesetermsarefrequentlyusedforapproachesoutsidethetraditionalboundsofclusteranalysis.Forexample,thetermpartitioningisoftenusedinconnectionwithtechniquesthatdividegraphsintosubgraphsandthatarenotstronglyconnectedtoclustering.Segmentationoftenreferstothedivisionofdataintogroupsusingsimpletechniques;e.g.,animagecanbesplitintosegmentsbasedonlyonpixelintensityandcolor,orpeoplecanbedividedintogroupsbasedontheirincome.Nonetheless,someworkingraphpartitioningandinimageandmarketsegmentationisrelatedtoclusteranalysis.

8.1.2DiﬀerentTypesofClusterings

Anentirecollectionofclustersiscommonlyreferredtoasaclustering,andinthissection,wedistinguishvarioustypesofclusterings:hierarchical(nested)versuspartitional(unnested),exclusiveversusoverlappingversusfuzzy,andcompleteversuspartial.

HierarchicalversusPartitionalThemostcommonlydiscusseddistinc-tionamongdiﬀerenttypesofclusteringsiswhetherthesetofclustersisnested

492Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

orunnested,orinmoretraditionalterminology,hierarchicalorpartitional.Apartitionalclusteringissimplyadivisionofthesetofdataobjectsintonon-overlappingsubsets(clusters)suchthateachdataobjectisinexactlyonesubset.Takenindividually,eachcollectionofclustersinFigures8.1(b–d)isapartitionalclustering.

Ifwepermitclusterstohavesubclusters,thenweobtainahierarchicalclustering,whichisasetofnestedclustersthatareorganizedasatree.Eachnode(cluster)inthetree(exceptfortheleafnodes)istheunionofitschildren(subclusters),andtherootofthetreeistheclustercontainingalltheobjects.Often,butnotalways,theleavesofthetreearesingletonclustersofindividualdataobjects.Ifweallowclusterstobenested,thenoneinterpretationofFigure8.1(a)isthatithastwosubclusters(Figure8.1(b)),eachofwhich,inturn,hasthreesubclusters(Figure8.1(d)).TheclustersshowninFigures8.1(a–d),whentakeninthatorder,alsoformahierarchical(nested)clusteringwith,respectively,1,2,4,and6clustersoneachlevel.Finally,notethatahierarchicalclusteringcanbeviewedasasequenceofpartitionalclusteringsandapartitionalclusteringcanbeobtainedbytakinganymemberofthatsequence;i.e.,bycuttingthehierarchicaltreeataparticularlevel.

ExclusiveversusOverlappingversusFuzzyTheclusteringsshowninFigure8.1areallexclusive,astheyassigneachobjecttoasinglecluster.Therearemanysituationsinwhichapointcouldreasonablybeplacedinmorethanonecluster,andthesesituationsarebetteraddressedbynon-exclusiveclustering.Inthemostgeneralsense,anoverlappingornon-exclusiveclusteringisusedtoreﬂectthefactthatanobjectcansimultaneouslybelongtomorethanonegroup(class).Forinstance,apersonatauniversitycanbebothanenrolledstudentandanemployeeoftheuniversity.Anon-exclusiveclusteringisalsooftenusedwhen,forexample,anobjectis“between”twoormoreclustersandcouldreasonablybeassignedtoanyoftheseclusters.ImagineapointhalfwaybetweentwooftheclustersofFigure8.1.Ratherthanmakeasomewhatarbitraryassignmentoftheobjecttoasinglecluster,itisplacedinallofthe“equallygood”clusters.

Inafuzzyclustering,everyobjectbelongstoeveryclusterwithamem-bershipweightthatisbetween0(absolutelydoesn’tbelong)and1(absolutelybelongs).Inotherwords,clustersaretreatedasfuzzysets.(Mathematically,afuzzysetisoneinwhichanobjectbelongstoanysetwithaweightthatisbetween0and1.Infuzzyclustering,weoftenimposetheadditionalcon-straintthatthesumoftheweightsforeachobjectmustequal1.)Similarly,probabilisticclusteringtechniquescomputetheprobabilitywithwhicheach

8.1Overview493

pointbelongstoeachcluster,andtheseprobabilitiesmustalsosumto1.Be-causethemembershipweightsorprobabilitiesforanyobjectsumto1,afuzzyorprobabilisticclusteringdoesnotaddresstruemulticlasssituations,suchasthecaseofastudentemployee,whereanobjectbelongstomultipleclasses.Instead,theseapproachesaremostappropriateforavoidingthearbitrarinessofassigninganobjecttoonlyoneclusterwhenitmaybeclosetoseveral.Inpractice,afuzzyorprobabilisticclusteringisoftenconvertedtoanexclusiveclusteringbyassigningeachobjecttotheclusterinwhichitsmembershipweightorprobabilityishighest.

CompleteversusPartialAcompleteclusteringassignseveryobjecttoacluster,whereasapartialclusteringdoesnot.Themotivationforapartialclusteringisthatsomeobjectsinadatasetmaynotbelongtowell-deﬁnedgroups.Manytimesobjectsinthedatasetmayrepresentnoise,outliers,or“uninterestingbackground.”Forexample,somenewspaperstoriesmayshareacommontheme,suchasglobalwarming,whileotherstoriesaremoregenericorone-of-a-kind.Thus,toﬁndtheimportanttopicsinlastmonth’sstories,wemaywanttosearchonlyforclustersofdocumentsthataretightlyrelatedbyacommontheme.Inothercases,acompleteclusteringoftheobjectsisdesired.Forexample,anapplicationthatusesclusteringtoorganizedocumentsforbrowsingneedstoguaranteethatalldocumentscanbebrowsed.

8.1.3DiﬀerentTypesofClusters

Clusteringaimstoﬁndusefulgroupsofobjects(clusters),whereusefulnessisdeﬁnedbythegoalsofthedataanalysis.Notsurprisingly,thereareseveraldiﬀerentnotionsofaclusterthatproveusefulinpractice.Inordertovisuallyillustratethediﬀerencesamongthesetypesofclusters,weusetwo-dimensionalpoints,asshowninFigure8.2,asourdataobjects.Westress,however,thatthetypesofclustersdescribedhereareequallyvalidforotherkindsofdata.Well-SeparatedAclusterisasetofobjectsinwhicheachobjectiscloser(ormoresimilar)toeveryotherobjectintheclusterthantoanyobjectnotinthecluster.Sometimesathresholdisusedtospecifythatalltheobjectsinaclustermustbesuﬃcientlyclose(orsimilar)tooneanother.Thisidealisticdeﬁnitionofaclusterissatisﬁedonlywhenthedatacontainsnaturalclustersthatarequitefarfromeachother.Figure8.2(a)givesanexampleofwell-separatedclustersthatconsistsoftwogroupsofpointsinatwo-dimensionalspace.Thedistancebetweenanytwopointsindiﬀerentgroupsislargerthan

494Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

thedistancebetweenanytwopointswithinagroup.Well-separatedclustersdonotneedtobeglobular,butcanhaveanyshape.

Prototype-BasedAclusterisasetofobjectsinwhicheachobjectiscloser(moresimilar)totheprototypethatdeﬁnestheclusterthantotheprototypeofanyothercluster.Fordatawithcontinuousattributes,theprototypeofaclusterisoftenacentroid,i.e.,theaverage(mean)ofallthepointsintheclus-ter.Whenacentroidisnotmeaningful,suchaswhenthedatahascategoricalattributes,theprototypeisoftenamedoid,i.e.,themostrepresentativepointofacluster.Formanytypesofdata,theprototypecanberegardedasthemostcentralpoint,andinsuchinstances,wecommonlyrefertoprototype-basedclustersascenter-basedclusters.Notsurprisingly,suchclusterstendtobeglobular.Figure8.2(b)showsanexampleofcenter-basedclusters.Graph-BasedIfthedataisrepresentedasagraph,wherethenodesareobjectsandthelinksrepresentconnectionsamongobjects(seeSection2.1.2),thenaclustercanbedeﬁnedasaconnectedcomponent;i.e.,agroupofobjectsthatareconnectedtooneanother,butthathavenoconnectiontoobjectsoutsidethegroup.Animportantexampleofgraph-basedclustersarecontiguity-basedclusters,wheretwoobjectsareconnectedonlyiftheyarewithinaspeciﬁeddistanceofeachother.Thisimpliesthateachobjectinacontiguity-basedclusterisclosertosomeotherobjectintheclusterthantoanypointinadiﬀerentcluster.Figure8.2(c)showsanexampleofsuchclustersfortwo-dimensionalpoints.Thisdeﬁnitionofaclusterisusefulwhenclustersareirregularorintertwined,butcanhavetroublewhennoiseispresentsince,asillustratedbythetwosphericalclustersofFigure8.2(c),asmallbridgeofpointscanmergetwodistinctclusters.

Othertypesofgraph-basedclustersarealsopossible.Onesuchapproach(Section8.3.2)deﬁnesaclusterasaclique;i.e.,asetofnodesinagraphthatarecompletelyconnectedtoeachother.Speciﬁcally,ifweaddconnectionsbetweenobjectsintheorderoftheirdistancefromoneanother,aclusterisformedwhenasetofobjectsformsaclique.Likeprototype-basedclusters,suchclusterstendtobeglobular.

Density-BasedAclusterisadenseregionofobjectsthatissurroundedbyaregionoflowdensity.Figure8.2(d)showssomedensity-basedclustersfordatacreatedbyaddingnoisetothedataofFigure8.2(c).Thetwocircularclustersarenotmerged,asinFigure8.2(c),becausethebridgebetweenthemfadesintothenoise.Likewise,thecurvethatispresentinFigure8.2(c)also

8.1Overview495

fadesintothenoiseanddoesnotformaclusterinFigure8.2(d).Adensity-baseddeﬁnitionofaclusterisoftenemployedwhentheclustersareirregularorintertwined,andwhennoiseandoutliersarepresent.Bycontrast,acontiguity-baseddeﬁnitionofaclusterwouldnotworkwellforthedataofFigure8.2(d)sincethenoisewouldtendtoformbridgesbetweenclusters.

Shared-Property(ConceptualClusters)Moregenerally,wecandeﬁneaclusterasasetofobjectsthatsharesomeproperty.Thisdeﬁnitionencom-passesallthepreviousdeﬁnitionsofacluster;e.g.,objectsinacenter-basedclustersharethepropertythattheyareallclosesttothesamecentroidormedoid.However,theshared-propertyapproachalsoincludesnewtypesofclusters.ConsidertheclustersshowninFigure8.2(e).Atriangulararea(cluster)isadjacenttoarectangularone,andtherearetwointertwinedcircles(clusters).Inbothcases,aclusteringalgorithmwouldneedaveryspeciﬁcconceptofaclustertosuccessfullydetecttheseclusters.Theprocessofﬁnd-ingsuchclustersiscalledconceptualclustering.However,toosophisticatedanotionofaclusterwouldtakeusintotheareaofpatternrecognition,andthus,weonlyconsidersimplertypesofclustersinthisbook.

RoadMap

Inthischapter,weusethefollowingthreesimple,butimportanttechniquestointroducemanyoftheconceptsinvolvedinclusteranalysis.

•K-means.Thisisaprototype-based,partitionalclusteringtechniquethatattemptstoﬁndauser-speciﬁednumberofclusters(K),whicharerepresentedbytheircentroids.

•AgglomerativeHierarchicalClustering.Thisclusteringapproachreferstoacollectionofcloselyrelatedclusteringtechniquesthatproduceahierarchicalclusteringbystartingwitheachpointasasingletonclusterandthenrepeatedlymergingthetwoclosestclustersuntilasingle,all-encompassingclusterremains.Someofthesetechniqueshaveanaturalinterpretationintermsofgraph-basedclustering,whileothershaveaninterpretationintermsofaprototype-basedapproach.

•DBSCAN.Thisisadensity-basedclusteringalgorithmthatproducesapartitionalclustering,inwhichthenumberofclustersisautomaticallydeterminedbythealgorithm.Pointsinlow-densityregionsareclassi-ﬁedasnoiseandomitted;thus,DBSCANdoesnotproduceacompleteclustering.

496Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Well-separatedclusters.Eachpointisclosertoallofthepointsinitsclusterthantoanypointinanothercluster.(b)Center-basedclusters.Eachpointisclosertothecenterofitsclusterthantothecenterofanyothercluster.

(c)Contiguity-basedclusters.Eachpointisclosertoatleastonepointinitsclusterthantoanypointinanothercluster.

(d)Density-basedclusters.Clus-tersareregionsofhighdensitysep-aratedbyregionsoflowdensity.

(e)Conceptualclusters.Pointsinaclustersharesomegeneralpropertythatderivesfromtheentiresetofpoints.(Pointsintheintersectionofthecirclesbelongtoboth.)

Figure8.2.Differenttypesofclustersasillustratedbysetsoftwo-dimensionalpoints.

8.2K-means

Prototype-basedclusteringtechniquescreateaone-levelpartitioningofthedataobjects.Thereareanumberofsuchtechniques,buttwoofthemostprominentareK-meansandK-medoid.K-meansdeﬁnesaprototypeintermsofacentroid,whichisusuallythemeanofagroupofpoints,andistypically

8.2K-means497

appliedtoobjectsinacontinuousn-dimensionalspace.K-medoiddeﬁnesaprototypeintermsofamedoid,whichisthemostrepresentativepointforagroupofpoints,andcanbeappliedtoawiderangeofdatasinceitrequiresonlyaproximitymeasureforapairofobjects.Whileacentroidalmostnevercorrespondstoanactualdatapoint,amedoid,byitsdeﬁnition,mustbeanactualdatapoint.Inthissection,wewillfocussolelyonK-means,whichisoneoftheoldestandmostwidelyusedclusteringalgorithms.

8.2.1TheBasicK-meansAlgorithm

TheK-meansclusteringtechniqueissimple,andwebeginwithadescriptionofthebasicalgorithm.WeﬁrstchooseKinitialcentroids,whereKisauser-speciﬁedparameter,namely,thenumberofclustersdesired.Eachpointisthenassignedtotheclosestcentroid,andeachcollectionofpointsassignedtoacentroidisacluster.Thecentroidofeachclusteristhenupdatedbasedonthepointsassignedtothecluster.Werepeattheassignmentandupdatestepsuntilnopointchangesclusters,orequivalently,untilthecentroidsremainthesame.

K-meansisformallydescribedbyAlgorithm8.1.TheoperationofK-meansisillustratedinFigure8.3,whichshowshow,startingfromthreecentroids,theﬁnalclustersarefoundinfourassignment-updatesteps.IntheseandotherﬁguresdisplayingK-meansclustering,eachsubﬁgureshows(1)thecentroidsatthestartoftheiterationand(2)theassignmentofthepointstothosecentroids.Thecentroidsareindicatedbythe“+”symbol;allpointsbelongingtothesameclusterhavethesamemarkershape.Algorithm8.1BasicK-meansalgorithm.1:SelectKpointsasinitialcentroids.2:repeat3:FormKclustersbyassigningeachpointtoitsclosestcentroid.4:Recomputethecentroidofeachcluster.5:untilCentroidsdonotchange.

Intheﬁrststep,showninFigure8.3(a),pointsareassignedtotheinitialcentroids,whichareallinthelargergroupofpoints.Forthisexample,weusethemeanasthecentroid.Afterpointsareassignedtoacentroid,thecentroidisthenupdated.Again,theﬁgureforeachstepshowsthecentroidatthebeginningofthestepandtheassignmentofpointstothosecentroids.Inthesecondstep,pointsareassignedtotheupdatedcentroids,andthecentroids

498Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Iteration1.(b)Iteration2.(c)Iteration3.(d)Iteration4.

Figure8.3.UsingtheK-meansalgorithmtoﬁndthreeclustersinsampledata.

areupdatedagain.Insteps2,3,and4,whichareshowninFigures8.3(b),(c),and(d),respectively,twoofthecentroidsmovetothetwosmallgroupsofpointsatthebottomoftheﬁgures.WhentheK-meansalgorithmterminatesinFigure8.3(d),becausenomorechangesoccur,thecentroidshaveidentiﬁedthenaturalgroupingsofpoints.

Forsomecombinationsofproximityfunctionsandtypesofcentroids,K-meansalwaysconvergestoasolution;i.e.,K-meansreachesastateinwhichnopointsareshiftingfromoneclustertoanother,andhence,thecentroidsdon’tchange.Becausemostoftheconvergenceoccursintheearlysteps,however,theconditiononline5ofAlgorithm8.1isoftenreplacedbyaweakercondition,e.g.,repeatuntilonly1%ofthepointschangeclusters.

WeconsidereachofthestepsinthebasicK-meansalgorithminmoredetailandthenprovideananalysisofthealgorithm’sspaceandtimecomplexity.AssigningPointstotheClosestCentroid

Toassignapointtotheclosestcentroid,weneedaproximitymeasurethatquantiﬁesthenotionof“closest”forthespeciﬁcdataunderconsideration.Euclidean(L2)distanceisoftenusedfordatapointsinEuclideanspace,whilecosinesimilarityismoreappropriatefordocuments.However,theremaybeseveraltypesofproximitymeasuresthatareappropriateforagiventypeofdata.Forexample,Manhattan(L1)distancecanbeusedforEuclideandata,whiletheJaccardmeasureisoftenemployedfordocuments.

Usually,thesimilaritymeasuresusedforK-meansarerelativelysimplesincethealgorithmrepeatedlycalculatesthesimilarityofeachpointtoeachcentroid.Insomecases,however,suchaswhenthedataisinlow-dimensional

8.2

Table8.1.Tableofnotation.DescriptionAnobject.Theithcluster.

ThecentroidofclusterCi.Thecentroidofallpoints.

Thenumberofobjectsintheithcluster.Thenumberofobjectsinthedataset.Thenumberofclusters.

K-means499

SymbolxCicicmimK

Euclideanspace,itispossibletoavoidcomputingmanyofthesimilarities,thussigniﬁcantlyspeedinguptheK-meansalgorithm.BisectingK-means(describedinSection8.2.3)isanotherapproachthatspeedsupK-meansbyreducingthenumberofsimilaritiescomputed.CentroidsandObjectiveFunctions

Step4oftheK-meansalgorithmwasstatedrathergenerallyas“recomputethecentroidofeachcluster,”sincethecentroidcanvary,dependingontheproximitymeasureforthedataandthegoaloftheclustering.Thegoaloftheclusteringistypicallyexpressedbyanobjectivefunctionthatdependsontheproximitiesofthepointstooneanotherortotheclustercentroids;e.g.,minimizethesquareddistanceofeachpointtoitsclosestcentroid.Weillus-tratethiswithtwoexamples.However,thekeypointisthis:oncewehavespeciﬁedaproximitymeasureandanobjectivefunction,thecentroidthatweshouldchoosecanoftenbedeterminedmathematically.Weprovidemathe-maticaldetailsinSection8.2.6,andprovideanon-mathematicaldiscussionofthisobservationhere.

DatainEuclideanSpaceConsiderdatawhoseproximitymeasureisEu-clideandistance.Forourobjectivefunction,whichmeasuresthequalityofaclustering,weusethesumofthesquarederror(SSE),whichisalsoknownasscatter.Inotherwords,wecalculatetheerrorofeachdatapoint,i.e.,itsEuclideandistancetotheclosestcentroid,andthencomputethetotalsumofthesquarederrors.GiventwodiﬀerentsetsofclustersthatareproducedbytwodiﬀerentrunsofK-means,weprefertheonewiththesmallestsquarederrorsincethismeansthattheprototypes(centroids)ofthisclusteringareabetterrepresentationofthepointsintheircluster.UsingthenotationinTable8.1,theSSEisformallydeﬁnedasfollows:

500Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

K󰀁󰀁i=1x∈Ci

SSE=

dist(ci,x)2

(8.1)

wheredististhestandardEuclidean(L2)distancebetweentwoobjectsin

Euclideanspace.

Giventheseassumptions,itcanbeshown(seeSection8.2.6)thatthecentroidthatminimizestheSSEoftheclusteristhemean.UsingthenotationinTable8.1,thecentroid(mean)oftheithclusterisdeﬁnedbyEquation8.2.

1󰀁ci=x

x∈Ci

(8.2)

Toillustrate,thecentroidofaclustercontainingthethreetwo-dimensionalpoints,(1,1),(2,3),and(6,2),is((1+2+6)/3,((1+3+2)/3)=(3,2).

Steps3and4oftheK-meansalgorithmdirectlyattempttominimizetheSSE(ormoregenerally,theobjectivefunction).Step3formsclustersbyassigningpointstotheirnearestcentroid,whichminimizestheSSEforthegivensetofcentroids.Step4recomputesthecentroidssoastofurtherminimizetheSSE.However,theactionsofK-meansinSteps3and4areonlyguaranteedtoﬁndalocalminimumwithrespecttotheSSEsincetheyarebasedonoptimizingtheSSEforspeciﬁcchoicesofthecentroidsandclusters,ratherthanforallpossiblechoices.Wewilllaterseeanexampleinwhichthisleadstoasuboptimalclustering.

DocumentDataToillustratethatK-meansisnotrestrictedtodatainEuclideanspace,weconsiderdocumentdataandthecosinesimilaritymeasure.Hereweassumethatthedocumentdataisrepresentedasadocument-termmatrixasdescribedonpage31.Ourobjectiveistomaximizethesimilarityofthedocumentsinaclustertotheclustercentroid;thisquantityisknownasthecohesionofthecluster.Forthisobjectiveitcanbeshownthattheclustercentroidis,asforEuclideandata,themean.TheanalogousquantitytothetotalSSEisthetotalcohesion,whichisgivenbyEquation8.3.

TotalCohesion=

K󰀁󰀁i=1x∈Ci

cosine(x,ci)

(8.3)

TheGeneralCaseThereareanumberofchoicesfortheproximityfunc-tion,centroid,andobjectivefunctionthatcanbeusedinthebasicK-means

8.2K-means501

Table8.2.K-means:Commonchoicesforproximity,centroids,andobjectivefunctions.ProximityFunctionManhattan(L1)SquaredEuclidean(L22)cosineBregmandivergenceCentroidmedianmeanmeanmeanObjectiveFunctionMinimizesumoftheL1distanceofanob-jecttoitsclustercentroid

MinimizesumofthesquaredL2distanceofanobjecttoitsclustercentroid

Maximizesumofthecosinesimilarityofanobjecttoitsclustercentroid

MinimizesumoftheBregmandivergenceofanobjecttoitsclustercentroid

algorithmandthatareguaranteedtoconverge.Table8.2showssomepossiblechoices,includingthetwothatwehavejustdiscussed.NoticethatforMan-hattan(L1)distanceandtheobjectiveofminimizingthesumofthedistances,theappropriatecentroidisthemedianofthepointsinacluster.

Thelastentryinthetable,Bregmandivergence(Section2.4.5),isactuallyaclassofproximitymeasuresthatincludesthesquaredEuclideandistance,L22,theMahalanobisdistance,andcosinesimilarity.TheimportanceofBregmandivergencefunctionsisthatanysuchfunctioncanbeusedasthebasisofaK-meansstyleclusteringalgorithmwiththemeanasthecentroid.Speciﬁcally,ifweuseaBregmandivergenceasourproximityfunction,thentheresult-ingclusteringalgorithmhastheusualpropertiesofK-meanswithrespecttoconvergence,localminima,etc.Furthermore,thepropertiesofsuchacluster-ingalgorithmcanbedevelopedforallpossibleBregmandivergences.Indeed,K-meansalgorithmsthatusecosinesimilarityorsquaredEuclideandistanceareparticularinstancesofageneralclusteringalgorithmbasedonBregmandivergences.

FortherestourK-meansdiscussion,weusetwo-dimensionaldatasinceitiseasytoexplainK-meansanditspropertiesforthistypeofdata.But,assuggestedbythelastfewparagraphs,K-meansisaverygeneralclusteringalgorithmandcanbeusedwithawidevarietyofdatatypes,suchasdocumentsandtimeseries.

ChoosingInitialCentroids

Whenrandominitializationofcentroidsisused,diﬀerentrunsofK-meanstypicallyproducediﬀerenttotalSSEs.Weillustratethiswiththesetoftwo-dimensionalpointsshowninFigure8.3,whichhasthreenaturalclustersofpoints.Figure8.4(a)showsaclusteringsolutionthatistheglobalminimumof

502Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Optimalclustering.(b)Suboptimalclustering.

Figure8.4.Threeoptimalandnon-optimalclusters.

theSSEforthreeclusters,whileFigure8.4(b)showsasuboptimalclusteringthatisonlyalocalminimum.

ChoosingtheproperinitialcentroidsisthekeystepofthebasicK-meansprocedure.Acommonapproachistochoosetheinitialcentroidsrandomly,buttheresultingclustersareoftenpoor.

Example8.1(PoorInitialCentroids).Randomlyselectedinitialcen-troidsmaybepoor.WeprovideanexampleofthisusingthesamedatasetusedinFigures8.3and8.4.Figures8.3and8.5showtheclustersthatre-sultfromtwoparticularchoicesofinitialcentroids.(Forbothﬁgures,thepositionsoftheclustercentroidsinthevariousiterationsareindicatedbycrosses.)InFigure8.3,eventhoughalltheinitialcentroidsarefromonenatu-ralcluster,theminimumSSEclusteringisstillfound.InFigure8.5,however,eventhoughtheinitialcentroidsseemtobebetterdistributed,weobtainasuboptimalclustering,withhighersquarederror.

Example8.2(LimitsofRandomInitialization).Onetechniquethatiscommonlyusedtoaddresstheproblemofchoosinginitialcentroidsistoperformmultipleruns,eachwithadiﬀerentsetofrandomlychoseninitialcentroids,andthenselectthesetofclusterswiththeminimumSSE.Whilesimple,thisstrategymaynotworkverywell,dependingonthedatasetandthenumberofclusterssought.WedemonstratethisusingthesampledatasetshowninFigure8.6(a).Thedataconsistsoftwopairsofclusters,wheretheclustersineach(top-bottom)pairareclosertoeachotherthantotheclustersintheotherpair.Figure8.6(b–d)showsthatifwestartwithtwoinitialcentroidsperpairofclusters,thenevenwhenbothcentroidsareinasingle

8.2K-means503

(a)Iteration1.(b)Iteration2.(c)Iteration3.(d)Iteration4.

Figure8.5.PoorstartingcentroidsforK-means.

cluster,thecentroidswillredistributethemselvessothatthe“true”clustersarefound.However,Figure8.7showsthatifapairofclustershasonlyoneinitialcentroidandtheotherpairhasthree,thentwoofthetrueclusterswillbecombinedandonetrueclusterwillbesplit.

Notethatanoptimalclusteringwillbeobtainedaslongastwoinitialcentroidsfallanywhereinapairofclusters,sincethecentroidswillredistributethemselves,onetoeachcluster.Unfortunately,asthenumberofclustersbecomeslarger,itisincreasinglylikelythatatleastonepairofclusterswillhaveonlyoneinitialcentroid.(SeeExercise4onpage559.)Inthiscase,becausethepairsofclustersarefartherapartthanclusterswithinapair,theK-meansalgorithmwillnotredistributethecentroidsbetweenpairsofclusters,andthus,onlyalocalminimumwillbeachieved.

Becauseoftheproblemswithusingrandomlyselectedinitialcentroids,whichevenrepeatedrunsmaynotovercome,othertechniquesareoftenem-ployedforinitialization.Oneeﬀectiveapproachistotakeasampleofpointsandclusterthemusingahierarchicalclusteringtechnique.Kclustersareex-tractedfromthehierarchicalclustering,andthecentroidsofthoseclustersareusedastheinitialcentroids.Thisapproachoftenworkswell,butispracticalonlyif(1)thesampleisrelativelysmall,e.g.,afewhundredtoafewthousand(hierarchicalclusteringisexpensive),and(2)Kisrelativelysmallcomparedtothesamplesize.

Thefollowingprocedureisanotherapproachtoselectinginitialcentroids.Selecttheﬁrstpointatrandomortakethecentroidofallpoints.Then,foreachsuccessiveinitialcentroid,selectthepointthatisfarthestfromanyoftheinitialcentroidsalreadyselected.Inthisway,weobtainasetofinitial

504Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Initialpoints.(b)Iteration1.

(c)Iteration2.(d)Iteration3.

Figure8.6.Twopairsofclusterswithapairofinitialcentroidswithineachpairofclusters.

centroidsthatisguaranteedtobenotonlyrandomlyselectedbutalsowellseparated.Unfortunately,suchanapproachcanselectoutliers,ratherthanpointsindenseregions(clusters).Also,itisexpensivetocomputethefarthestpointfromthecurrentsetofinitialcentroids.Toovercometheseproblems,thisapproachisoftenappliedtoasampleofthepoints.Sinceoutliersarerare,theytendnottoshowupinarandomsample.Incontrast,pointsfromeverydenseregionarelikelytobeincludedunlessthesamplesizeisverysmall.Also,thecomputationinvolvedinﬁndingtheinitialcentroidsisgreatlyreducedbecausethesamplesizeistypicallymuchsmallerthanthenumberofpoints.

Lateron,wewilldiscusstwootherapproachesthatareusefulforproduc-ingbetter-quality(lowerSSE)clusterings:usingavariantofK-meansthat

8.2K-means505

(a)Iteration1.(b)Iteration2.

(c)Iteration3.(d)Iteration4.

Figure8.7.Twopairsofclusterswithmoreorfewerthantwoinitialcentroidswithinapairofclusters.

islesssusceptibletoinitializationproblems(bisectingK-means)andusingpostprocessingto“ﬁxup”thesetofclustersproduced.TimeandSpaceComplexity

ThespacerequirementsforK-meansaremodestbecauseonlythedatapointsandcentroidsarestored.Speciﬁcally,thestoragerequiredisO((m+K)n),wheremisthenumberofpointsandnisthenumberofattributes.ThetimerequirementsforK-meansarealsomodest—basicallylinearinthenumberofdatapoints.Inparticular,thetimerequiredisO(I∗K∗m∗n),whereIisthenumberofiterationsrequiredforconvergence.Asmentioned,Iisoftensmallandcanusuallybesafelybounded,asmostchangestypicallyoccurinthe

506Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

ﬁrstfewiterations.Therefore,K-meansislinearinm,thenumberofpoints,andiseﬃcientaswellassimpleprovidedthatK,thenumberofclusters,issigniﬁcantlylessthanm.

8.2.2K-means:AdditionalIssues

HandlingEmptyClusters

OneoftheproblemswiththebasicK-meansalgorithmgivenearlieristhatemptyclusterscanbeobtainedifnopointsareallocatedtoaclusterduringtheassignmentstep.Ifthishappens,thenastrategyisneededtochooseareplacementcentroid,sinceotherwise,thesquarederrorwillbelargerthannecessary.Oneapproachistochoosethepointthatisfarthestawayfromanycurrentcentroid.Ifnothingelse,thiseliminatesthepointthatcurrentlycontributesmosttothetotalsquarederror.AnotherapproachistochoosethereplacementcentroidfromtheclusterthathasthehighestSSE.ThiswilltypicallysplittheclusterandreducetheoverallSSEoftheclustering.Ifthereareseveralemptyclusters,thenthisprocesscanberepeatedseveraltimes.Outliers

Whenthesquarederrorcriterionisused,outlierscanundulyinﬂuencetheclustersthatarefound.Inparticular,whenoutliersarepresent,theresultingclustercentroids(prototypes)maynotbeasrepresentativeastheyotherwisewouldbeandthus,theSSEwillbehigheraswell.Becauseofthis,itisoftenusefultodiscoveroutliersandeliminatethembeforehand.Itisimportant,however,toappreciatethattherearecertainclusteringapplicationsforwhichoutliersshouldnotbeeliminated.Whenclusteringisusedfordatacom-pression,everypointmustbeclustered,andinsomecases,suchasﬁnancialanalysis,apparentoutliers,e.g.,unusuallyproﬁtablecustomers,canbethemostinterestingpoints.

Anobviousissueishowtoidentifyoutliers.AnumberoftechniquesforidentifyingoutlierswillbediscussedinChapter10.Ifweuseapproachesthatremoveoutliersbeforeclustering,weavoidclusteringpointsthatwillnotclus-terwell.Alternatively,outlierscanalsobeidentiﬁedinapostprocessingstep.Forinstance,wecankeeptrackoftheSSEcontributedbyeachpoint,andeliminatethosepointswithunusuallyhighcontributions,especiallyovermul-tipleruns.Also,wemaywanttoeliminatesmallclusterssincetheyfrequentlyrepresentgroupsofoutliers.

8.2

ReducingtheSSEwithPostprocessing

K-means507

AnobviouswaytoreducetheSSEistoﬁndmoreclusters,i.e.,tousealargerK.However,inmanycases,wewouldliketoimprovetheSSE,butdon’twanttoincreasethenumberofclusters.ThisisoftenpossiblebecauseK-meanstypicallyconvergestoalocalminimum.Varioustechniquesareusedto“ﬁxup”theresultingclustersinordertoproduceaclusteringthathaslowerSSE.ThestrategyistofocusonindividualclusterssincethetotalSSEissimplythesumoftheSSEcontributedbyeachcluster.(WewillusetheterminologytotalSSEandclusterSSE,respectively,toavoidanypotentialconfusion.)WecanchangethetotalSSEbyperformingvariousoperationsontheclusters,suchassplittingormergingclusters.Onecommonlyusedapproachistousealternateclustersplittingandmergingphases.Duringasplittingphase,clustersaredivided,whileduringamergingphase,clustersarecombined.Inthisway,itisoftenpossibletoescapelocalSSEminimaandstillproduceaclusteringsolutionwiththedesirednumberofclusters.Thefollowingaresometechniquesusedinthesplittingandmergingphases.

TwostrategiesthatdecreasethetotalSSEbyincreasingthenumberofclustersarethefollowing:

Splitacluster:TheclusterwiththelargestSSEisusuallychosen,butwe

couldalsosplittheclusterwiththelargeststandarddeviationforoneparticularattribute.Introduceanewclustercentroid:Oftenthepointthatisfarthestfrom

anyclustercenterischosen.WecaneasilydeterminethisifwekeeptrackoftheSSEcontributedbyeachpoint.AnotherapproachistochooserandomlyfromallpointsorfromthepointswiththehighestSSE.Twostrategiesthatdecreasethenumberofclusters,whiletryingtomini-mizetheincreaseintotalSSE,arethefollowing:

Disperseacluster:Thisisaccomplishedbyremovingthecentroidthatcor-respondstotheclusterandreassigningthepointstootherclusters.Ide-ally,theclusterthatisdispersedshouldbetheonethatincreasesthetotalSSEtheleast.Mergetwoclusters:Theclusterswiththeclosestcentroidsaretypically

chosen,althoughanother,perhapsbetter,approachistomergethetwoclustersthatresultinthesmallestincreaseintotalSSE.Thesetwomergingstrategiesarethesameonesthatareusedinthehierarchical

508Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

clusteringtechniquesknownasthecentroidmethodandWard’smethod,respectively.BothmethodsarediscussedinSection8.3.UpdatingCentroidsIncrementally

Insteadofupdatingclustercentroidsafterallpointshavebeenassignedtoacluster,thecentroidscanbeupdatedincrementally,aftereachassignmentofapointtoacluster.Noticethatthisrequireseitherzeroortwoupdatestoclustercentroidsateachstep,sinceapointeithermovestoanewcluster(twoupdates)orstaysinitscurrentcluster(zeroupdates).Usinganincrementalupdatestrategyguaranteesthatemptyclustersarenotproducedsinceallclustersstartwithasinglepoint,andifaclustereverhasonlyonepoint,thenthatpointwillalwaysbereassignedtothesamecluster.

Inaddition,ifincrementalupdatingisused,therelativeweightofthepointbeingaddedmaybeadjusted;e.g.,theweightofpointsisoftendecreasedastheclusteringproceeds.Whilethiscanresultinbetteraccuracyandfasterconvergence,itcanbediﬃculttomakeagoodchoicefortherelativeweight,especiallyinawidevarietyofsituations.Theseupdateissuesaresimilartothoseinvolvedinupdatingweightsforartiﬁcialneuralnetworks.

Yetanotherbeneﬁtofincrementalupdateshastodowithusingobjectivesotherthan“minimizeSSE.”Supposethatwearegivenanarbitraryobjectivefunctiontomeasurethegoodnessofasetofclusters.Whenweprocessanindividualpoint,wecancomputethevalueoftheobjectivefunctionforeachpossibleclusterassignment,andthenchoosetheonethatoptimizestheobjec-tive.SpeciﬁcexamplesofalternativeobjectivefunctionsaregiveninSection8.5.2.

Onthenegativeside,updatingcentroidsincrementallyintroducesanor-derdependency.Inotherwords,theclustersproducedmaydependontheorderinwhichthepointsareprocessed.Althoughthiscanbeaddressedbyrandomizingtheorderinwhichthepointsareprocessed,thebasicK-meansapproachofupdatingthecentroidsafterallpointshavebeenassignedtoclus-tershasnoorderdependency.Also,incrementalupdatesareslightlymoreexpensive.However,K-meansconvergesratherquickly,andtherefore,thenumberofpointsswitchingclustersquicklybecomesrelativelysmall.

8.2.3BisectingK-means

ThebisectingK-meansalgorithmisastraightforwardextensionofthebasicK-meansalgorithmthatisbasedonasimpleidea:toobtainKclusters,splitthesetofallpointsintotwoclusters,selectoneoftheseclusterstosplit,and

8.2K-means509

soon,untilKclustershavebeenproduced.ThedetailsofbisectingK-meansaregivenbyAlgorithm8.2.

Algorithm8.2BisectingK-meansalgorithm.

1:Initializethelistofclusterstocontaintheclusterconsistingofallpoints.2:repeat3:Removeaclusterfromthelistofclusters.4:{Performseveral“trial”bisectionsofthechosencluster.}5:fori=1tonumberoftrialsdo6:BisecttheselectedclusterusingbasicK-means.7:endfor8:SelectthetwoclustersfromthebisectionwiththelowesttotalSSE.9:Addthesetwoclusterstothelistofclusters.10:untilUntilthelistofclusterscontainsKclusters.

Thereareanumberofdiﬀerentwaystochoosewhichclustertosplit.Wecanchoosethelargestclusterateachstep,choosetheonewiththelargestSSE,oruseacriterionbasedonbothsizeandSSE.Diﬀerentchoicesresultindiﬀerentclusters.

WeoftenreﬁnetheresultingclustersbyusingtheircentroidsastheinitialcentroidsforthebasicK-meansalgorithm.Thisisnecessarybecause,althoughtheK-meansalgorithmisguaranteedtoﬁndaclusteringthatrepresentsalocalminimumwithrespecttotheSSE,inbisectingK-meansweareusingtheK-meansalgorithm“locally,”i.e.,tobisectindividualclusters.Therefore,theﬁnalsetofclustersdoesnotrepresentaclusteringthatisalocalminimumwithrespecttothetotalSSE.

Example8.3(BisectingK-meansandInitialization).ToillustratethatbisectingK-meansislesssusceptibletoinitializationproblems,weshow,inFigure8.8,howbisectingK-meansﬁndsfourclustersinthedatasetoriginallyshowninFigure8.6(a).Initeration1,twopairsofclustersarefound;initeration2,therightmostpairofclustersissplit;andiniteration3,theleftmostpairofclustersissplit.BisectingK-meanshaslesstroublewithinitializationbecauseitperformsseveraltrialbisectionsandtakestheonewiththelowestSSE,andbecausethereareonlytwocentroidsateachstep.

Finally,byrecordingthesequenceofclusteringsproducedasK-meansbisectsclusters,wecanalsousebisectingK-meanstoproduceahierarchicalclustering.

510Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Iteration1.(b)Iteration2.(c)Iteration3.

Figure8.8.BisectingK-meansonthefourclustersexample.

8.2.4K-meansandDiﬀerentTypesofClusters

K-meansanditsvariationshaveanumberoflimitationswithrespecttoﬁndingdiﬀerenttypesofclusters.Inparticular,K-meanshasdiﬃcultydetectingthe“natural”clusters,whenclustershavenon-sphericalshapesorwidelydiﬀerentsizesordensities.ThisisillustratedbyFigures8.9,8.10,and8.11.InFigure8.9,K-meanscannotﬁndthethreenaturalclustersbecauseoneoftheclustersismuchlargerthantheothertwo,andhence,thelargerclusterisbroken,whileoneofthesmallerclustersiscombinedwithaportionofthelargercluster.InFigure8.10,K-meansfailstoﬁndthethreenaturalclustersbecausethetwosmallerclustersaremuchdenserthanthelargercluster.Finally,inFigure8.11,K-meansﬁndstwoclustersthatmixportionsofthetwonaturalclustersbecausetheshapeofthenaturalclustersisnotglobular.

ThediﬃcultyinthesethreesituationsisthattheK-meansobjectivefunc-tionisamismatchforthekindsofclusterswearetryingtoﬁndsinceitisminimizedbyglobularclustersofequalsizeanddensityorbyclustersthatarewellseparated.However,theselimitationscanbeovercome,insomesense,iftheuseriswillingtoacceptaclusteringthatbreaksthenaturalclustersintoanumberofsubclusters.Figure8.12showswhathappenstothethreepreviousdatasetsifweﬁndsixclustersinsteadoftwoorthree.Eachsmallerclusterispureinthesensethatitcontainsonlypointsfromoneofthenaturalclusters.

8.2.5StrengthsandWeaknesses

K-meansissimpleandcanbeusedforawidevarietyofdatatypes.Itisalsoquiteeﬃcient,eventhoughmultiplerunsareoftenperformed.Somevariants,includingbisectingK-means,areevenmoreeﬃcient,andarelesssuscepti-bletoinitializationproblems.K-meansisnotsuitableforalltypesofdata,

8.2K-means511

(a)Originalpoints.(b)ThreeK-meansclusters.

Figure8.9.K-meanswithclustersofdifferentsize.

(a)Originalpoints.(b)ThreeK-meansclusters.

Figure8.10.K-meanswithclustersofdifferentdensity.

(a)Originalpoints.(b)TwoK-meansclusters.

Figure8.11.K-meanswithnon-globularclusters.

512Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Unequalsizes.

(b)Unequaldensities.

(c)Non-sphericalshapes.

Figure8.12.UsingK-meanstoﬁndclustersthataresubclustersofthenaturalclusters.

8.2K-means513

however.Itcannothandlenon-globularclustersorclustersofdiﬀerentsizesanddensities,althoughitcantypicallyﬁndpuresubclustersifalargeenoughnumberofclustersisspeciﬁed.K-meansalsohastroubleclusteringdatathatcontainsoutliers.Outlierdetectionandremovalcanhelpsigniﬁcantlyinsuchsituations.Finally,K-meansisrestrictedtodataforwhichthereisanotionofacenter(centroid).Arelatedtechnique,K-medoidclustering,doesnothavethisrestriction,butismoreexpensive.

8.2.6K-meansasanOptimizationProblem

Here,wedelveintothemathematicsbehindK-means.Thissection,whichcanbeskippedwithoutlossofcontinuity,requiresknowledgeofcalculusthroughpartialderivatives.Familiaritywithoptimizationtechniques,especiallythosebasedongradientdescent,mayalsobehelpful.

Asmentionedearlier,givenanobjectivefunctionsuchas“minimizeSSE,”clusteringcanbetreatedasanoptimizationproblem.Onewaytosolvethisproblem—toﬁndaglobaloptimum—istoenumerateallpossiblewaysofdi-vidingthepointsintoclustersandthenchoosethesetofclustersthatbestsatisﬁestheobjectivefunction,e.g.,thatminimizesthetotalSSE.Ofcourse,thisexhaustivestrategyiscomputationallyinfeasibleandasaresult,amorepracticalapproachisneeded,evenifsuchanapproachﬁndssolutionsthatarenotguaranteedtobeoptimal.Onetechnique,whichisknownasgradientdescent,isbasedonpickinganinitialsolutionandthenrepeatingthefol-lowingtwosteps:computethechangetothesolutionthatbestoptimizestheobjectivefunctionandthenupdatethesolution.

Weassumethatthedataisone-dimensional,i.e.,dist(x,y)=(x−y)2.Thisdoesnotchangeanythingessential,butgreatlysimpliﬁesthenotation.DerivationofK-meansasanAlgorithmtoMinimizetheSSEInthissection,weshowhowthecentroidfortheK-meansalgorithmcanbemathematicallyderivedwhentheproximityfunctionisEuclideandistanceandtheobjectiveistominimizetheSSE.Speciﬁcally,weinvestigatehowwecanbestupdateaclustercentroidsothattheclusterSSEisminimized.Inmathematicalterms,weseektominimizeEquation8.1,whichwerepeathere,specializedforone-dimensionaldata.

SSE=

K󰀁󰀁i=1x∈Ci

(ci−x)2

(8.4)

514Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

Here,Ciistheithcluster,xisapointinCi,andciisthemeanoftheithcluster.SeeTable8.1foracompletelistofnotation.

Wecansolveforthekthcentroidck,whichminimizesEquation8.4,bydiﬀerentiatingtheSSE,settingitequalto0,andsolving,asindicatedbelow.

∂󰀁󰀁

(ci−x)2

∂ck

i=1x∈Ci

∂

SSE=∂ck

K󰀁󰀁∂=(ci−x)2

∂ck

i=1x∈Ci󰀁=2∗(ck−xk)=0

x∈Ck

󰀁

x∈Ck

2∗(ck−xk)=0⇒mkck=

󰀁

x∈Ck

1󰀁

xkxk⇒ck=

x∈Ck

Thus,aspreviouslyindicated,thebestcentroidforminimizingtheSSEofaclusteristhemeanofthepointsinthecluster.DerivationofK-meansforSAE

TodemonstratethattheK-meansalgorithmcanbeappliedtoavarietyofdiﬀerentobjectivefunctions,weconsiderhowtopartitionthedataintoKclusterssuchthatthesumoftheManhattan(L1)distancesofpointsfromthecenteroftheirclustersisminimized.WeareseekingtominimizethesumoftheL1absoluteerrors(SAE)asgivenbythefollowingequation,wheredistL1istheL1distance.Again,fornotationalsimplicity,weuseone-dimensionaldata,i.e.,distL1=|ci−x|.

SAE=

K󰀁󰀁i=1x∈Ci

distL1(ci,x)

(8.5)

Wecansolveforthekthcentroidck,whichminimizesEquation8.5,by

diﬀerentiatingtheSAE,settingitequalto0,andsolving.

8.3AgglomerativeHierarchicalClustering515

∂

SAE=∂ck

∂󰀁󰀁

|ci−x|

∂ck

i=1x∈Ci

K󰀁󰀁∂=|ci−x|

∂ck

i=1x∈Ci

x∈Ck

󰀁∂

|ck−x|=0∂ck

x∈Ck

󰀁∂󰀁

|ck−x|=0⇒sign(x−ck)=0∂ck

x∈Ck

Ifwesolveforck,weﬁndthatck=median{x∈Ck},themedianofthepointsinthecluster.Themedianofagroupofpointsisstraightforwardtocomputeandlesssusceptibletodistortionbyoutliers.

8.3AgglomerativeHierarchicalClustering

Hierarchicalclusteringtechniquesareasecondimportantcategoryofcluster-ingmethods.AswithK-means,theseapproachesarerelativelyoldcomparedtomanyclusteringalgorithms,buttheystillenjoywidespreaduse.Therearetwobasicapproachesforgeneratingahierarchicalclustering:

Agglomerative:Startwiththepointsasindividualclustersand,ateach

step,mergetheclosestpairofclusters.Thisrequiresdeﬁninganotionofclusterproximity.Divisive:Startwithone,all-inclusiveclusterand,ateachstep,splitacluster

untilonlysingletonclustersofindividualpointsremain.Inthiscase,weneedtodecidewhichclustertosplitateachstepandhowtodothesplitting.Agglomerativehierarchicalclusteringtechniquesarebyfarthemostcommon,and,inthissection,wewillfocusexclusivelyonthesemethods.AdivisivehierarchicalclusteringtechniqueisdescribedinSection9.4.2.

Ahierarchicalclusteringisoftendisplayedgraphicallyusingatree-likediagramcalledadendrogram,whichdisplaysboththecluster-subcluster

516Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

p1p2p1p2p3p4p3p4(a)Dendrogram.(b)Nestedclusterdiagram.

Figure8.13.Ahierarchicalclusteringoffourpointsshownasadendrogramandasnestedclusters.

relationshipsandtheorderinwhichtheclustersweremerged(agglomerativeview)orsplit(divisiveview).Forsetsoftwo-dimensionalpoints,suchasthosethatwewilluseasexamples,ahierarchicalclusteringcanalsobegraphicallyrepresentedusinganestedclusterdiagram.Figure8.13showsanexampleofthesetwotypesofﬁguresforasetoffourtwo-dimensionalpoints.Thesepointswereclusteredusingthesingle-linktechniquethatisdescribedinSection8.3.2.

8.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm

Manyagglomerativehierarchicalclusteringtechniquesarevariationsonasin-gleapproach:startingwithindividualpointsasclusters,successivelymergethetwoclosestclustersuntilonlyoneclusterremains.Thisapproachisex-pressedmoreformallyinAlgorithm8.3.

Algorithm8.3Basicagglomerativehierarchicalclusteringalgorithm.1:Computetheproximitymatrix,ifnecessary.2:repeat3:Mergetheclosesttwoclusters.4:Updatetheproximitymatrixtoreﬂecttheproximitybetweenthenew

clusterandtheoriginalclusters.5:untilOnlyoneclusterremains.

8.3AgglomerativeHierarchicalClustering517

DeﬁningProximitybetweenClusters

ThekeyoperationofAlgorithm8.3isthecomputationoftheproximitybe-tweentwoclusters,anditisthedeﬁnitionofclusterproximitythatdiﬀer-entiatesthevariousagglomerativehierarchicaltechniquesthatwewilldis-cuss.Clusterproximityistypicallydeﬁnedwithaparticulartypeofclusterinmind—seeSection8.1.2.Forexample,manyagglomerativehierarchicalclusteringtechniques,suchasMIN,MAX,andGroupAverage,comefromagraph-basedviewofclusters.MINdeﬁnesclusterproximityastheprox-imitybetweentheclosesttwopointsthatareindiﬀerentclusters,orusinggraphterms,theshortestedgebetweentwonodesindiﬀerentsubsetsofnodes.Thisyieldscontiguity-basedclustersasshowninFigure8.2(c).Alternatively,MAXtakestheproximitybetweenthefarthesttwopointsindiﬀerentclusterstobetheclusterproximity,orusinggraphterms,thelongestedgebetweentwonodesindiﬀerentsubsetsofnodes.(Ifourproximitiesaredistances,thenthenames,MINandMAX,areshortandsuggestive.Forsimilarities,however,wherehighervaluesindicatecloserpoints,thenamesseemreversed.Forthatreason,weusuallyprefertousethealternativenames,singlelinkandcom-pletelink,respectively.)Anothergraph-basedapproach,thegroupaveragetechnique,deﬁnesclusterproximitytobetheaveragepairwiseproximities(av-eragelengthofedges)ofallpairsofpointsfromdiﬀerentclusters.Figure8.14illustratesthesethreeapproaches.

(a)MIN(singlelink.)(b)MAX(completelink.)(c)Groupaverage.

Figure8.14.Graph-baseddeﬁnitionsofclusterproximity

If,instead,wetakeaprototype-basedview,inwhicheachclusterisrepre-sentedbyacentroid,diﬀerentdeﬁnitionsofclusterproximityaremorenatural.Whenusingcentroids,theclusterproximityiscommonlydeﬁnedastheprox-imitybetweenclustercentroids.Analternativetechnique,Ward’smethod,alsoassumesthataclusterisrepresentedbyitscentroid,butitmeasurestheproximitybetweentwoclustersintermsoftheincreaseintheSSEthatre-

518Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

sultsfrommergingthetwoclusters.LikeK-means,Ward’smethodattemptstominimizethesumofthesquareddistancesofpointsfromtheirclustercentroids.

TimeandSpaceComplexity

Thebasicagglomerativehierarchicalclusteringalgorithmjustpresenteduses

2aproximitymatrix.Thisrequiresthestorageof12mproximities(assuming

theproximitymatrixissymmetric)wheremisthenumberofdatapoints.Thespaceneededtokeeptrackoftheclustersisproportionaltothenumberofclusters,whichism−1,excludingsingletonclusters.Hence,thetotalspacecomplexityisO(m2).

Theanalysisofthebasicagglomerativehierarchicalclusteringalgorithmisalsostraightforwardwithrespecttocomputationalcomplexity.O(m2)timeisrequiredtocomputetheproximitymatrix.Afterthatstep,therearem−1iterationsinvolvingsteps3and4becausetherearemclustersatthestartandtwoclustersaremergedduringeachiteration.Ifperformedasalinearsearchoftheproximitymatrix,thenfortheithiteration,step3requiresO((m−i+1)2)time,whichisproportionaltothecurrentnumberofclusterssquared.Step4onlyrequiresO(m−i+1)timetoupdatetheproximitymatrixafterthemergeroftwoclusters.(AclustermergeraﬀectsonlyO(m−i+1)proximitiesforthetechniquesthatweconsider.)Withoutmodiﬁcation,thiswouldyieldatimecomplexityofO(m3).Ifthedistancesfromeachclustertoallotherclustersarestoredasasortedlist(orheap),itispossibletoreducethecostofﬁndingthetwoclosestclusterstoO(m−i+1).However,becauseoftheadditionalcomplexityofkeepingdatainasortedlistorheap,theoveralltimerequiredforahierarchicalclusteringbasedonAlgorithm8.3isO(m2logm).

Thespaceandtimecomplexityofhierarchicalclusteringseverelylimitsthesizeofdatasetsthatcanbeprocessed.Wediscussscalabilityapproachesforclusteringalgorithms,includinghierarchicalclusteringtechniques,inSection9.5.

8.3.2SpeciﬁcTechniques

SampleData

Toillustratethebehaviorofthevarioushierarchicalclusteringalgorithms,weshallusesampledatathatconsistsof6two-dimensionalpoints,whichareshowninFigure8.15.ThexandycoordinatesofthepointsandtheEuclideandistancesbetweenthemareshowninTables8.3and8.4,respectively.

8.3AgglomerativeHierarchicalClustering519

0.60.50.40.30.20.100451236Pointp1p2p3p4p5p60.50.6xCoordinate0.400.220.350.260.080.45yCoordinate0.530.380.320.190.410.300.10.20.30.4Figure8.15.Setof6two-dimensionalpoints.

Table8.3.xycoordinatesof6points.

p1p2p3p4p5p6p10.000.240.220.370.340.23p20.240.000.150.200.140.25p30.220.150.000.150.280.11p40.370.200.150.000.290.22p50.340.140.280.290.000.39p60.230.250.110.220.390.00Table8.4.Euclideandistancematrixfor6points.

SingleLinkorMIN

ForthesinglelinkorMINversionofhierarchicalclustering,theproximityoftwoclustersisdeﬁnedastheminimumofthedistance(maximumofthesimilarity)betweenanytwopointsinthetwodiﬀerentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksﬁrst,thenthesesinglelinkscom-binethepointsintoclusters.Thesinglelinktechniqueisgoodathandlingnon-ellipticalshapes,butissensitivetonoiseandoutliers.

Example8.4(SingleLink).Figure8.16showstheresultofapplyingthesinglelinktechniquetoourexampledatasetofsixpoints.Figure8.16(a)showsthenestedclustersasasequenceofnestedellipses,wherethenumbersassociatedwiththeellipsesindicatetheorderoftheclustering.Figure8.16(b)showsthesameinformation,butasadendrogram.Theheightatwhichtwoclustersaremergedinthedendrogramreﬂectsthedistanceofthetwoclusters.Forinstance,fromTable8.4,weseethatthedistancebetweenpoints3and6

520Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

1522344(a)Singlelinkclustering.

31650.20.150.10.0503621(b)Singlelinkdendrogram.

Figure8.16.SinglelinkclusteringofthesixpointsshowninFigure8.15.

is0.11,andthatistheheightatwhichtheyarejoinedintooneclusterinthedendrogram.Asanotherexample,thedistancebetweenclusters{3,6}and{2,5}isgivenby

dist({3,6},{2,5})=min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))

=min(0.15,0.25,0.28,0.39)=0.15.

CompleteLinkorMAXorCLIQUE

ForthecompletelinkorMAXversionofhierarchicalclustering,theproximityoftwoclustersisdeﬁnedasthemaximumofthedistance(minimumofthesimilarity)betweenanytwopointsinthetwodiﬀerentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksﬁrst,thenagroupofpointsisnotaclusteruntilallthepointsinitarecompletelylinked,i.e.,formaclique.Completelinkislesssusceptibletonoiseandoutliers,butitcanbreaklargeclustersanditfavorsglobularshapes.

Example8.5(CompleteLink).Figure8.17showstheresultsofapplyingMAXtothesampledatasetofsixpoints.Aswithsinglelink,points3and6

8.3AgglomerativeHierarchicalClustering521

4252343150.40.3610.20.103125(a)Completelinkclustering.(b)Completelinkdendrogram.

Figure8.17.CompletelinkclusteringofthesixpointsshowninFigure8.15.

aremergedﬁrst.However,{3,6}ismergedwith{4},insteadof{2,5}or{1}because

dist({3,6},{4})=max(dist(3,4),dist(6,4))

=max(0.15,0.22)=0.22.

dist({3,6},{2,5})=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))

=max(0.15,0.25,0.28,0.39)=0.39.

dist({3,6},{1})=max(dist(3,1),dist(6,1))

=max(0.22,0.23)=0.23.

GroupAverage

Forthegroupaverageversionofhierarchicalclustering,theproximityoftwoclustersisdeﬁnedastheaveragepairwiseproximityamongallpairsofpointsinthediﬀerentclusters.Thisisanintermediateapproachbetweenthesingleandcompletelinkapproaches.Thus,forgroupaverage,theclusterproxim-

522Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

50.251252344310.260.150.10.0503251(a)Groupaverageclustering.(b)Groupaveragedendrogram.

Figure8.18.GroupaverageclusteringofthesixpointsshowninFigure8.15.

ityproximity(Ci,Cj)ofclustersCiandCj,whichareofsizemiandmj,respectively,isexpressedbythefollowingequation:

󰀂

x∈Ciproximity(x,y)y∈Cj

.(8.6)proximity(Ci,Cj)=

mi∗mjExample8.6(GroupAverage).Figure8.18showstheresultsofapplyingthegroupaverageapproachtothesampledatasetofsixpoints.Toillustratehowgroupaverageworks,wecalculatethedistancebetweensomeclusters.

dist({3,6,4},{1})=(0.22+0.37+0.23)/(3∗1)

=0.28

dist({2,5},{1})=(0.2357+0.3421)/(2∗1)

=0.28

dist({3,6,4},{2,5})=(0.15+0.28+0.25+0.39+0.20+0.29)/(6∗2)

=0.26

Becausedist({3,6,4},{2,5})issmallerthandist({3,6,4},{1})anddist({2,5},{1}),clusters{3,6,4}and{2,5}aremergedatthefourthstage.

8.3AgglomerativeHierarchicalClustering523

5252410.250.2314360.150.10.0503125(a)Ward’sclustering.(b)Ward’sdendrogram.

Figure8.19.Ward’sclusteringofthesixpointsshowninFigure8.15.

Ward’sMethodandCentroidMethods

ForWard’smethod,theproximitybetweentwoclustersisdeﬁnedasthein-creaseinthesquarederrorthatresultswhentwoclustersaremerged.Thus,thismethodusesthesameobjectivefunctionasK-meansclustering.WhileitmayseemthatthisfeaturemakesWard’smethodsomewhatdistinctfromotherhierarchicaltechniques,itcanbeshownmathematicallythatWard’smethodisverysimilartothegroupaveragemethodwhentheproximitybe-tweentwopointsistakentobethesquareofthedistancebetweenthem.Example8.7(Ward’sMethod).Figure8.19showstheresultsofapplyingWard’smethodtothesampledatasetofsixpoints.Theclusteringthatisproducedisdiﬀerentfromthoseproducedbysinglelink,completelink,andgroupaverage.

Centroidmethodscalculatetheproximitybetweentwoclustersbycalcu-latingthedistancebetweenthecentroidsofclusters.ThesetechniquesmayseemsimilartoK-means,butaswehaveremarked,Ward’smethodisthecorrecthierarchicalanalog.

Centroidmethodsalsohaveacharacteristic—oftenconsideredbad—thatisnotpossessedbytheotherhierarchicalclusteringtechniquesthatwehavediscussed:thepossibilityofinversions.Speciﬁcally,twoclustersthataremergedmaybemoresimilar(lessdistant)thanthepairofclustersthatweremergedinapreviousstep.Fortheothermethods,thedistancebetween

524Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

Table8.5.TableofLance-Williamscoefﬁcientsforcommonhierarchicalclusteringapproaches.ClusteringMethodSingleLinkCompleteLinkGroupAverageCentroidWard’s

αA1/21/2αB1/21/2β000−mAmB(mA+mB)2−mQ

mA+mB+mQ

mAmA+mBmAmA+mBmA+mQmA+mB+mQmBmA+mBmBmA+mBmB+mQmA+mB+mQ

γ−1/21/2000

mergedclustersmonotonicallyincreases(oris,atworst,non-increasing)asweproceedfromsingletonclusterstooneall-inclusivecluster.

8.3.3TheLance-WilliamsFormulaforClusterProximity

Anyoftheclusterproximitiesthatwehavediscussedinthissectioncanbeviewedasachoiceofdiﬀerentparameters(intheLance-WilliamsformulashownbelowinEquation8.7)fortheproximitybetweenclustersQandR,whereRisformedbymergingclustersAandB.Inthisequation,p(.,.)isaproximityfunction,whilemA,mB,andmQarethenumberofpointsinclustersA,B,andQ,respectively.Inotherwords,afterwemergeclustersAandBtoformclusterR,theproximityofthenewcluster,R,toanexistingcluster,Q,isalinearfunctionoftheproximitiesofQwithrespecttotheoriginalclustersAandB.Table8.5showsthevaluesofthesecoeﬃcientsforthetechniquesthatwehavediscussed.

p(R,Q)=αAp(A,Q)+αBp(B,Q)+βp(A,B)+γ|p(A,Q)−p(B,Q)|(8.7)AnyhierarchicalclusteringtechniquethatcanbeexpressedusingtheLance-Williamsformuladoesnotneedtokeeptheoriginaldatapoints.In-stead,theproximitymatrixisupdatedasclusteringoccurs.Whileageneralformulaisappealing,especiallyforimplementation,itiseasiertounderstandthediﬀerenthierarchicalmethodsbylookingdirectlyatthedeﬁnitionofclus-terproximitythateachmethoduses.

8.3.4KeyIssuesinHierarchicalClustering

LackofaGlobalObjectiveFunction

Wepreviouslymentionedthatagglomerativehierarchicalclusteringcannotbeviewedasgloballyoptimizinganobjectivefunction.Instead,agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,ateach

8.3AgglomerativeHierarchicalClustering525

step,whichclustersshouldbemerged(orsplitfordivisiveapproaches).Thisapproachyieldsclusteringalgorithmsthatavoidthediﬃcultyofattemptingtosolveahardcombinatorialoptimizationproblem.(Itcanbeshownthatthegeneralclusteringproblemforanobjectivefunctionsuchas“minimizeSSE”iscomputationallyinfeasible.)Furthermore,suchapproachesdonothaveproblemswithlocalminimaordiﬃcultiesinchoosinginitialpoints.Ofcourse,thetimecomplexityofO(m2logm)andthespacecomplexityofO(m2)areprohibitiveinmanycases.

AbilitytoHandleDiﬀerentClusterSizes

Oneaspectofagglomerativehierarchicalclusteringthatwehavenotyetdis-cussedishowtotreattherelativesizesofthepairsofclustersthataremerged.(Thisdiscussionappliesonlytoclusterproximityschemesthatinvolvesums,suchascentroid,Ward’s,andgroupaverage.)Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,whichtakesthenumberofpointsineachclusterintoaccount.Notethattheterminologyofweightedorunweightedreferstothedatapoints,nottheclusters.Inotherwords,treatingclustersofunequalsizeequallygivesdiﬀerentweightstothepointsindiﬀerentclusters,whiletakingtheclustersizeintoaccountgivespointsindiﬀerentclustersthesameweight.

WewillillustratethisusingthegroupaveragetechniquediscussedinSec-tion8.3.2,whichistheunweightedversionofthegroupaveragetechnique.Intheclusteringliterature,thefullnameofthisapproachistheUnweightedPairGroupMethodusingArithmeticaverages(UPGMA).InTable8.5,whichgivestheformulaforupdatingclustersimilarity,thecoeﬃcientsforUPGMA

mAinvolvethesizeofeachoftheclustersthatweremerged:αA=mA

+mB,αB=mBmA+mB,β=0,γ=0.Fortheweightedversionofgroupaverage—knownasWPGMA—thecoeﬃcientsareconstants:αA=1/2,αB=1/2,β=0,γ=0.Ingeneral,unweightedapproachesarepreferredunlessthereisreasontobe-lievethatindividualpointsshouldhavediﬀerentweights;e.g.,perhapsclassesofobjectshavebeenunevenlysampled.MergingDecisionsAreFinal

Agglomerativehierarchicalclusteringalgorithmstendtomakegoodlocalde-cisionsaboutcombiningtwoclusterssincetheycanuseinformationaboutthepairwisesimilarityofallpoints.However,onceadecisionismadetomergetwoclusters,itcannotbeundoneatalatertime.Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobaloptimizationcriterion.

526Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

Forexample,althoughthe“minimizesquarederror”criterionfromK-meansisusedindecidingwhichclusterstomergeinWard’smethod,theclustersateachleveldonotrepresentlocalminimawithrespecttothetotalSSE.Indeed,theclustersarenotevenstable,inthesensethatapointinoneclustermaybeclosertothecentroidofsomeotherclusterthanitistothecentroidofitscurrentcluster.Nonetheless,Ward’smethodisoftenusedasarobustmethodofinitializingaK-meansclustering,indicatingthatalocal“minimizesquarederror”objectivefunctiondoeshaveaconnectiontoaglobal“minimizesquarederror”objectivefunction.

Therearesometechniquesthatattempttoovercomethelimitationthatmergesareﬁnal.Oneapproachattemptstoﬁxupthehierarchicalclusteringbymovingbranchesofthetreearoundsoastoimproveaglobalobjectivefunction.AnotherapproachusesapartitionalclusteringtechniquesuchasK-meanstocreatemanysmallclusters,andthenperformshierarchicalclusteringusingthesesmallclustersasthestartingpoint.

8.3.5StrengthsandWeaknesses

Thestrengthsandweaknessofspeciﬁcagglomerativehierarchicalclusteringalgorithmswerediscussedabove.Moregenerally,suchalgorithmsaretypi-callyusedbecausetheunderlyingapplication,e.g.,creationofataxonomy,requiresahierarchy.Also,therehavebeensomestudiesthatsuggestthatthesealgorithmscanproducebetter-qualityclusters.However,agglomerativehierarchicalclusteringalgorithmsareexpensiveintermsoftheircomputa-tionalandstoragerequirements.Thefactthatallmergesareﬁnalcanalsocausetroublefornoisy,high-dimensionaldata,suchasdocumentdata.Inturn,thesetwoproblemscanbeaddressedtosomedegreebyﬁrstpartiallyclusteringthedatausinganothertechnique,suchasK-means.

8.4DBSCAN

Density-basedclusteringlocatesregionsofhighdensitythatareseparatedfromoneanotherbyregionsoflowdensity.DBSCANisasimpleandeﬀec-tivedensity-basedclusteringalgorithmthatillustratesanumberofimportantconceptsthatareimportantforanydensity-basedclusteringapproach.Inthissection,wefocussolelyonDBSCANafterﬁrstconsideringthekeynotionofdensity.Otheralgorithmsforﬁndingdensity-basedclustersaredescribedinthenextchapter.

8.4DBSCAN527

8.4.1TraditionalDensity:Center-BasedApproach

Althoughtherearenotasmanyapproachesfordeﬁningdensityastherearefordeﬁningsimilarity,thereareseveraldistinctmethods.Inthissectionwedis-cussthecenter-basedapproachonwhichDBSCANisbased.OtherdeﬁnitionsofdensitywillbepresentedinChapter9.

Inthecenter-basedapproach,densityisestimatedforaparticularpointinthedatasetbycountingthenumberofpointswithinaspeciﬁedradius,Eps,ofthatpoint.Thisincludesthepointitself.ThistechniqueisgraphicallyillustratedbyFigure8.20.ThenumberofpointswithinaradiusofEpsofpointAis7,includingAitself.

Thismethodissimpletoimplement,butthedensityofanypointwilldependonthespeciﬁedradius.Forinstance,iftheradiusislargeenough,thenallpointswillhaveadensityofm,thenumberofpointsinthedataset.Likewise,iftheradiusistoosmall,thenallpointswillhaveadensityof1.Anapproachfordecidingontheappropriateradiusforlow-dimensionaldataisgiveninthenextsectioninthecontextofourdiscussionofDBSCAN.ClassiﬁcationofPointsAccordingtoCenter-BasedDensity

Thecenter-basedapproachtodensityallowsustoclassifyapointasbeing(1)intheinteriorofadenseregion(acorepoint),(2)ontheedgeofadenseregion(aborderpoint),or(3)inasparselyoccupiedregion(anoiseorbackgroundpoint).Figure8.21graphicallyillustratestheconceptsofcore,border,andnoisepointsusingacollectionoftwo-dimensionalpoints.Thefollowingtextprovidesamoreprecisedescription.

Corepoints:Thesepointsareintheinteriorofadensity-basedcluster.A

pointisacorepointifthenumberofpointswithinagivenneighborhoodaroundthepointasdeterminedbythedistancefunctionandauser-speciﬁeddistanceparameter,Eps,exceedsacertainthreshold,MinPts,whichisalsoauser-speciﬁedparameter.InFigure8.21,pointAisacorepoint,fortheindicatedradius(Eps)ifMinPts≤7.Borderpoints:Aborderpointisnotacorepoint,butfallswithintheneigh-borhoodofacorepoint.InFigure8.21,pointBisaborderpoint.Aborderpointcanfallwithintheneighborhoodsofseveralcorepoints.Noisepoints:Anoisepointisanypointthatisneitheracorepointnora

borderpoint.InFigure8.21,pointCisanoisepoint.

528Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

border pointnoise pointcore pointAEpsBCEpsAEpsEpsFigure8.20.density.

Center-based

Figure8.21.Core,border,andnoisepoints.

8.4.2TheDBSCANAlgorithm

Giventhepreviousdeﬁnitionsofcorepoints,borderpoints,andnoisepoints,theDBSCANalgorithmcanbeinformallydescribedasfollows.Anytwocorepointsthatarecloseenough—withinadistanceEpsofoneanother—areputinthesamecluster.Likewise,anyborderpointthatiscloseenoughtoacorepointisputinthesameclusterasthecorepoint.(Tiesmayneedtoberesolvedifaborderpointisclosetocorepointsfromdiﬀerentclusters.)Noisepointsarediscarded.TheformaldetailsaregiveninAlgorithm8.4.ThisalgorithmusesthesameconceptsandﬁndsthesameclustersastheoriginalDBSCAN,butisoptimizedforsimplicity,noteﬃciency.Algorithm8.4DBSCANalgorithm.

1:2:3:4:5:

Labelallpointsascore,border,ornoisepoints.Eliminatenoisepoints.

PutanedgebetweenallcorepointsthatarewithinEpsofeachother.Makeeachgroupofconnectedcorepointsintoaseparatecluster.

Assigneachborderpointtooneoftheclustersofitsassociatedcorepoints.

TimeandSpaceComplexity

ThebasictimecomplexityoftheDBSCANalgorithmisO(m×timetoﬁndpointsintheEps-neighborhood),wheremisthenumberofpoints.Intheworstcase,thiscomplexityisO(m2).However,inlow-dimensionalspaces,therearedatastructures,suchaskd-trees,thatalloweﬃcientretrievalofall

8.4DBSCAN529

pointswithinagivendistanceofaspeciﬁedpoint,andthetimecomplexitycanbeaslowasO(mlogm).ThespacerequirementofDBSCAN,evenforhigh-dimensionaldata,isO(m)becauseitisonlynecessarytokeepasmallamountofdataforeachpoint,i.e.,theclusterlabelandtheidentiﬁcationofeachpointasacore,border,ornoisepoint.SelectionofDBSCANParameters

Thereis,ofcourse,theissueofhowtodeterminetheparametersEpsandMinPts.Thebasicapproachistolookatthebehaviorofthedistancefromapointtoitskthnearestneighbor,whichwewillcallthek-dist.Forpointsthatbelongtosomecluster,thevalueofk-distwillbesmallifkisnotlargerthantheclustersize.Notethattherewillbesomevariation,dependingonthedensityoftheclusterandtherandomdistributionofpoints,butonaverage,therangeofvariationwillnotbehugeiftheclusterdensitiesarenotradicallydiﬀerent.However,forpointsthatarenotinacluster,suchasnoisepoints,thek-distwillberelativelylarge.Therefore,ifwecomputethek-distforallthedatapointsforsomek,sorttheminincreasingorder,andthenplotthesortedvalues,weexpecttoseeasharpchangeatthevalueofk-distthatcorrespondstoasuitablevalueofEps.IfweselectthisdistanceastheEpsparameterandtakethevalueofkastheMinPtsparameter,thenpointsforwhichk-distislessthanEpswillbelabeledascorepoints,whileotherpointswillbelabeledasnoiseorborderpoints.

Figure8.22showsasampledataset,whilethek-distgraphforthedataisgiveninFigure8.23.ThevalueofEpsthatisdeterminedinthiswaydependsonk,butdoesnotchangedramaticallyaskchanges.Ifthevalueofkistoosmall,thenevenasmallnumberofcloselyspacedpointsthatarenoiseoroutlierswillbeincorrectlylabeledasclusters.Ifthevalueofkistoolarge,thensmallclusters(ofsizelessthank)arelikelytobelabeledasnoise.TheoriginalDBSCANalgorithmusedavalueofk=4,whichappearstobeareasonablevalueformosttwo-dimensionaldatasets.ClustersofVaryingDensity

DBSCANcanhavetroublewithdensityifthedensityofclustersvarieswidely.ConsiderFigure8.24,whichshowsfourclustersembeddedinnoise.Theden-sityoftheclustersandnoiseregionsisindicatedbytheirdarkness.Thenoisearoundthepairofdenserclusters,AandB,hasthesamedensityasclustersCandD.IftheEpsthresholdislowenoughthatDBSCANﬁndsCandDasclusters,thenAandBandthepointssurroundingthemwillbecomeasingle

530Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

4th Nearest Neighbor Distance40

050010001500200025003000

Points Sorted by Distance to 4th Nearest Neighbor

Figure8.22.Sampledata.Figure8.23.K-distplotforsampledata.

Cluster ACluster BCluster CCluster DNoiseNoiseFigure8.24.Fourclustersembeddedinnoise.

cluster.IftheEpsthresholdishighenoughthatDBSCANﬁndsAandBasseparateclusters,andthepointssurroundingthemaremarkedasnoise,thenCandDandthepointssurroundingthemwillalsobemarkedasnoise.AnExample

ToillustratetheuseofDBSCAN,weshowtheclustersthatitﬁndsintherelativelycomplicatedtwo-dimensionaldatasetshowninFigure8.22.Thisdatasetconsistsof3000two-dimensionalpoints.TheEpsthresholdforthisdatawasfoundbyplottingthesorteddistancesofthefourthnearestneighborofeachpoint(Figure8.23)andidentifyingthevalueatwhichthereisasharpincrease.WeselectedEps=10,whichcorrespondstothekneeofthecurve.TheclustersfoundbyDBSCANusingtheseparameters,i.e.,MinPts=4andEps=10,areshowninFigure8.25(a).Thecorepoints,borderpoints,andnoisepointsaredisplayedinFigure8.25(b).

8.4.3StrengthsandWeaknesses

BecauseDBSCANusesadensity-baseddeﬁnitionofacluster,itisrelativelyresistanttonoiseandcanhandleclustersofarbitraryshapesandsizes.Thus,

8.4DBSCAN531

(a)ClustersfoundbyDBSCAN.

x – Noise Point + – Border Point – Core Point(b)Core,border,andnoisepoints.

Figure8.25.DBSCANclusteringof3000two-dimensionalpoints.

532Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

DBSCANcanﬁndmanyclustersthatcouldnotbefoundusingK-means,suchasthoseinFigure8.22.Asindicatedpreviously,however,DBSCANhastroublewhentheclustershavewidelyvaryingdensities.Italsohastroublewithhigh-dimensionaldatabecausedensityismorediﬃculttodeﬁneforsuchdata.OnepossibleapproachtodealingwithsuchissuesisgiveninSection9.4.8.Finally,DBSCANcanbeexpensivewhenthecomputationofnearestneighborsrequirescomputingallpairwiseproximities,asisusuallythecaseforhigh-dimensionaldata.

8.5ClusterEvaluation

Insupervisedclassiﬁcation,theevaluationoftheresultingclassiﬁcationmodelisanintegralpartoftheprocessofdevelopingaclassiﬁcationmodel,andtherearewell-acceptedevaluationmeasuresandprocedures,e.g.,accuracyandcross-validation,respectively.However,becauseofitsverynature,clusterevaluationisnotawell-developedorcommonlyusedpartofclusteranalysis.Nonetheless,clusterevaluation,orclustervalidationasitismoretradition-allycalled,isimportant,andthissectionwillreviewsomeofthemostcommonandeasilyappliedapproaches.

Theremightbesomeconfusionastowhyclusterevaluationisnecessary.Manytimes,clusteranalysisisconductedasapartofanexploratorydataanalysis.Hence,evaluationseemslikeanunnecessarilycomplicatedadditiontowhatissupposedtobeaninformalprocess.Furthermore,sincethereareanumberofdiﬀerenttypesofclusters—insomesense,eachclusteringalgorithmdeﬁnesitsowntypeofcluster—itmayseemthateachsituationmightrequireadiﬀerentevaluationmeasure.Forinstance,K-meansclustersmightbeevaluatedintermsoftheSSE,butfordensity-basedclusters,whichneednotbeglobular,SSEwouldnotworkwellatall.

Nonetheless,clusterevaluationshouldbeapartofanyclusteranalysis.Akeymotivationisthatalmosteveryclusteringalgorithmwillﬁndclustersinadataset,evenifthatdatasethasnonaturalclusterstructure.Forinstance,considerFigure8.26,whichshowstheresultofclustering100pointsthatarerandomly(uniformly)distributedontheunitsquare.TheoriginalpointsareshowninFigure8.26(a),whiletheclustersfoundbyDBSCAN,K-means,andcompletelinkareshowninFigures8.26(b),8.26(c),and8.26(d),respectively.SinceDBSCANfoundthreeclusters(afterwesetEpsbylookingatthedistancesofthefourthnearestneighbors),wesetK-meansandcompletelinktoﬁndthreeclustersaswell.(InFigure8.26(b)thenoiseisshownbythesmallmarkers.)However,theclustersdonotlookcompellingforanyof

8.5ClusterEvaluation533

thethreemethods.Inhigherdimensions,suchproblemscannotbesoeasilydetected.

8.5.1Overview

Beingabletodistinguishwhetherthereisnon-randomstructureinthedataisjustoneimportantaspectofclustervalidation.Thefollowingisalistofseveralimportantissuesforclustervalidation.

1.Determiningtheclusteringtendencyofasetofdata,i.e.,distinguish-ingwhethernon-randomstructureactuallyexistsinthedata.2.Determiningthecorrectnumberofclusters.

3.Evaluatinghowwelltheresultsofaclusteranalysisﬁtthedatawithoutreferencetoexternalinformation.4.Comparingtheresultsofaclusteranalysistoexternallyknownresults,suchasexternallyprovidedclasslabels.5.Comparingtwosetsofclusterstodeterminewhichisbetter.

Noticethatitems1,2,and3donotmakeuseofanyexternalinformation—theyareunsupervisedtechniques—whileitem4requiresexternalinformation.Item5canbeperformedineitherasupervisedoranunsupervisedmanner.Afurtherdistinctioncanbemadewithrespecttoitems3,4,and5:Dowewanttoevaluatetheentireclusteringorjustindividualclusters?

Whileitispossibletodevelopvariousnumericalmeasurestoassessthediﬀerentaspectsofclustervaliditymentionedabove,thereareanumberofchallenges.First,ameasureofclustervaliditymaybequitelimitedinthescopeofitsapplicability.Forexample,mostworkonmeasuresofclusteringtendencyhasbeendonefortwo-orthree-dimensionalspatialdata.Second,weneedaframeworktointerpretanymeasure.Ifweobtainavalueof10forameasurethatevaluateshowwellclusterlabelsmatchexternallyprovidedclasslabels,doesthisvaluerepresentagood,fair,orpoormatch?Thegoodnessofamatchoftencanbemeasuredbylookingatthestatisticaldistributionofthisvalue,i.e.,howlikelyitisthatsuchavalueoccursbychance.Finally,ifameasureistoocomplicatedtoapplyortounderstand,thenfewwilluseit.

Theevaluationmeasures,orindices,thatareappliedtojudgevariousaspectsofclustervalidityaretraditionallyclassiﬁedintothefollowingthreetypes.

534Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

(a)Originalpoints.(b)ThreeclustersfoundbyDBSCAN.

(c)ThreeclustersfoundbyK-means.

(d)Threeclustersfoundbycompletelink.

Figure8.26.Clusteringof100uniformlydistributedpoints.

8.5ClusterEvaluation535

Unsupervised.Measuresthegoodnessofaclusteringstructurewithoutre-specttoexternalinformation.AnexampleofthisistheSSE.Unsu-pervisedmeasuresofclustervalidityareoftenfurtherdividedintotwoclasses:measuresofclustercohesion(compactness,tightness),whichdeterminehowcloselyrelatedtheobjectsinaclusterare,andmeasuresofclusterseparation(isolation),whichdeterminehowdistinctorwell-separatedaclusterisfromotherclusters.Unsupervisedmeasuresareoftencalledinternalindicesbecausetheyuseonlyinformationpresentinthedataset.Supervised.Measurestheextenttowhichtheclusteringstructurediscovered

byaclusteringalgorithmmatchessomeexternalstructure.Anexampleofasupervisedindexisentropy,whichmeasureshowwellclusterlabelsmatchexternallysuppliedclasslabels.Supervisedmeasuresareoftencalledexternalindicesbecausetheyuseinformationnotpresentinthedataset.Relative.Comparesdiﬀerentclusteringsorclusters.Arelativeclustereval-uationmeasureisasupervisedorunsupervisedevaluationmeasurethatisusedforthepurposeofcomparison.Thus,relativemeasuresarenotactuallyaseparatetypeofclusterevaluationmeasure,butareinsteadaspeciﬁcuseofsuchmeasures.Asanexample,twoK-meansclusteringscanbecomparedusingeithertheSSEorentropy.Intheremainderofthissection,weprovidespeciﬁcdetailsconcerningclus-tervalidity.Weﬁrstdescribetopicsrelatedtounsupervisedclusterevaluation,beginningwith(1)measuresbasedoncohesionandseparation,and(2)twotechniquesbasedontheproximitymatrix.Sincetheseapproachesareusefulonlyforpartitionalsetsofclusters,wealsodescribethepopularcopheneticcorrelationcoeﬃcient,whichcanbeusedfortheunsupervisedevaluationofahierarchicalclustering.Weendourdiscussionofunsupervisedevaluationwithbriefdiscussionsaboutﬁndingthecorrectnumberofclustersandevalu-atingclusteringtendency.Wethenconsidersupervisedapproachestoclustervalidity,suchasentropy,purity,andtheJaccardmeasure.Weconcludethissectionwithashortdiscussionofhowtointerpretthevaluesof(unsupervisedorsupervised)validitymeasures.

536Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

8.5.2

UnsupervisedClusterEvaluationUsingCohesionandSeparation

Manyinternalmeasuresofclustervalidityforpartitionalclusteringschemesarebasedonthenotionsofcohesionorseparation.Inthissection,weuseclustervaliditymeasuresforprototype-andgraph-basedclusteringtechniquestoexplorethesenotionsinsomedetail.Intheprocess,wewillalsoseesomeinterestingrelationshipsbetweenprototype-andgraph-basedclustering.

Ingeneral,wecanconsiderexpressingoverallclustervalidityforasetofKclustersasaweightedsumofthevalidityofindividualclusters,

overallvalidity=

K󰀁i=1

wivalidity(Ci).

(8.8)

Thevalidityfunctioncanbecohesion,separation,orsomecombinationofthese

quantities.Theweightswillvarydependingontheclustervaliditymeasure.Insomecases,theweightsaresimply1orthesizeofthecluster,whileinothercasestheyreﬂectamorecomplicatedproperty,suchasthesquarerootofthecohesion.SeeTable8.6.Ifthevalidityfunctioniscohesion,thenhighervaluesarebetter.Ifitisseparation,thenlowervaluesarebetter.Graph-BasedViewofCohesionandSeparation

Forgraph-basedclusters,thecohesionofaclustercanbedeﬁnedasthesumoftheweightsofthelinksintheproximitygraphthatconnectpointswithinthecluster.SeeFigure8.27(a).(Recallthattheproximitygraphhasdataobjectsasnodes,alinkbetweeneachpairofdataobjects,andaweightassignedtoeachlinkthatistheproximitybetweenthetwodataobjectsconnectedbythelink.)Likewise,theseparationbetweentwoclusterscanbemeasuredbythesumoftheweightsofthelinksfrompointsinoneclustertopointsintheothercluster.ThisisillustratedinFigure8.27(b).

Mathematically,cohesionandseparationforagraph-basedclustercanbeexpressedusingEquations8.9and8.10,respectively.Theproximityfunctioncanbeasimilarity,adissimilarity,orasimplefunctionofthesequantities.

cohesion(Ci)=separation(Ci,Cj)=

󰀁

x∈Ciy∈Ci

proximity(x,y)proximity(x,y)

(8.9)(8.10)

󰀁

x∈Ciy∈Cj

8.5ClusterEvaluation537

(a)Cohesion.(b)Separation.

Figure8.27.Graph-basedviewofclustercohesionandseparation.

Prototype-BasedViewofCohesionandSeparation

Forprototype-basedclusters,thecohesionofaclustercanbedeﬁnedasthesumoftheproximitieswithrespecttotheprototype(centroidormedoid)ofthecluster.Similarly,theseparationbetweentwoclusterscanbemeasuredbytheproximityofthetwoclusterprototypes.ThisisillustratedinFigure8.28,wherethecentroidofaclusterisindicatedbya“+”.

Cohesionforaprototype-basedclusterisgiveninEquation8.11,whiletwomeasuresforseparationaregiveninEquations8.12and8.13,respec-tively,whereciistheprototype(centroid)ofclusterCiandcistheoverallprototype(centroid).Therearetwomeasuresforseparationbecause,aswewillseeshortly,theseparationofclusterprototypesfromanoverallprototypeissometimesdirectlyrelatedtotheseparationofclusterprototypesfromoneanother.NotethatEquation8.11istheclusterSSEifweletproximitybethesquaredEuclideandistance.

󰀁

x∈Ci

cohesion(Ci)=proximity(x,ci)

(8.11)(8.12)(8.13)

separation(Ci,Cj)=proximity(ci,cj)

separation(Ci)=proximity(ci,c)

OverallMeasuresofCohesionandSeparation

Thepreviousdeﬁnitionsofclustercohesionandseparationgaveussomesim-pleandwell-deﬁnedmeasuresofclustervaliditythatcanbecombinedintoanoverallmeasureofclustervaliditybyusingaweightedsum,asindicated

538Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

+++(a)Cohesion.(b)Separation.

Figure8.28.Prototype-basedviewofclustercohesionandseparation.

inEquation8.8.However,weneedtodecidewhatweightstouse.Notsur-prisingly,theweightsusedcanvarywidely,althoughtypicallytheyaresomemeasureofclustersize.

Table8.6providesexamplesofvaliditymeasuresbasedoncohesionandseparation.I1isameasureofcohesionintermsofthepairwiseproximityofobjectsintheclusterdividedbytheclustersize.I2isameasureofcohesionbasedonthesumoftheproximitiesofobjectsintheclustertotheclustercentroid.E1isameasureofseparationdeﬁnedastheproximityofaclustercentroidtotheoverallcentroidmultipliedbythenumberofobjectsinthecluster.G1,whichisameasurebasedonbothcohesionandseparation,isthesumofthepairwiseproximityofallobjectsintheclusterwithallobjectsoutsidethecluster—thetotalweightoftheedgesoftheproximitygraphthatmustbecuttoseparatetheclusterfromallotherclusters—dividedbythesumofthepairwiseproximityofobjectsinthecluster.

Table8.6.Tableofgraph-basedclusterevaluationmeasures.

NameClusterMeasureI1I2E1G1

󰀂󰀂

x∈Ci

x∈Ciy∈CiClusterWeight1mi

proximity(x,y)proximity(x,ci)

Typegraph-basedcohesion

1mi

x∈Ciy∈Ci

proximity(ci,c)󰀂k

j=1j=i

󰀂

x∈Ciy∈Cj

proximity(x,y)󰀂prototype-basedcohesion

prototype-basedseparationgraph-based1

separationand

proximity(x,y)

cohesion

8.5ClusterEvaluation539

Notethatanyunsupervisedmeasureofclustervaliditypotentiallycanbeusedasanobjectivefunctionforaclusteringalgorithmandviceversa.TheCLUsteringTOolkit(CLUTO)(seethebibliographicnotes)usestheclusterevaluationmeasuresdescribedinTable8.6,aswellassomeotherevaluationmeasuresnotmentionedhere,todrivetheclusteringprocess.ItdoesthisbyusinganalgorithmthatissimilartotheincrementalK-meansalgorithmdis-cussedinSection8.2.2.Speciﬁcally,eachpointisassignedtotheclusterthatproducesthebestvaluefortheclusterevaluationfunction.Theclustereval-uationmeasureI2correspondstotraditionalK-meansandproducesclustersthathavegoodSSEvalues.TheothermeasuresproduceclustersthatarenotasgoodwithrespecttoSSE,butthataremoreoptimalwithrespecttothespeciﬁedclustervaliditymeasure.

RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesion

Whilethegraph-basedandprototype-basedapproachestomeasuringtheco-hesionandseparationofaclusterseemdistinct,forsomeproximitymeasurestheyareequivalent.Forinstance,fortheSSEandpointsinEuclideanspace,itcanbeshown(Equation8.14)thattheaveragepairwisedistancebetweenthepointsinaclusterisequivalenttotheSSEofthecluster.SeeExercise27onpage566.

ClusterSSE=

󰀁

x∈Ci

1󰀁󰀁

dist(ci,x)=dist(x,y)2

2mi

x∈Ciy∈Ci

(8.14)

TwoApproachestoPrototype-BasedSeparation

WhenproximityismeasuredbyEuclideandistance,thetraditionalmeasureofseparationbetweenclustersisthebetweengroupsumofsquares(SSB),whichisthesumofthesquareddistanceofaclustercentroid,ci,totheoverallmean,c,ofallthedatapoints.BysummingtheSSBoverallclusters,weobtainthetotalSSB,whichisgivenbyEquation8.15,whereciisthemeanoftheithclusterandcistheoverallmean.ThehigherthetotalSSBofaclustering,themoreseparatedtheclustersarefromoneanother.

TotalSSB=

K󰀁i=1

midist(ci,c)2

(8.15)

ItisstraightforwardtoshowthatthetotalSSBisdirectlyrelatedtothe

pairwisedistancesbetweenthecentroids.Inparticular,iftheclustersizesare

0Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

equal,i.e.,mi=m/K,thenthisrelationshiptakesthesimpleformgivenbyEquation8.16.(SeeExercise28onpage566.)ItisthistypeofequivalencethatmotivatesthedeﬁnitionofprototypeseparationintermsofbothEquations8.12and8.13.

1󰀁󰀁m

TotalSSB=dist(ci,cj)2

2Ki=1j=1K

(8.16)

RelationshipbetweenCohesionandSeparation

Insomecases,thereisalsoastrongrelationshipbetweencohesionandsepara-tion.Speciﬁcally,itispossibletoshowthatthesumofthetotalSSEandthetotalSSBisaconstant;i.e.,thatitisequaltothetotalsumofsquares(TSS),whichisthesumofsquaresofthedistanceofeachpointtotheoverallmeanofthedata.TheimportanceofthisresultisthatminimizingSSE(cohesion)isequivalenttomaximizingSSB(separation).

Weprovidetheproofofthisfactbelow,sincetheapproachillustratestechniquesthatarealsoapplicabletoprovingtherelationshipsstatedinthelasttwosections.Tosimplifythenotation,weassumethatthedataisone-dimensional,i.e.,dist(x,y)=(x−y)2.Also,weusethefactthatthecross-term󰀂󰀂Ki=1x∈Ci(x−ci)(c−ci)is0.(SeeExercise29onpage566.)

TSS

K󰀁󰀁i=1x∈Ci

(x−c)2

((x−ci)−(c−ci))2(x−ci)−2(x−ci)+(x−ci)2+

K󰀁󰀁i=1x∈CiK󰀁󰀁i=1x∈CiK󰀁i=1

K󰀁󰀁i=1x∈Ci

(x−ci)(c−ci)+

K󰀁󰀁i=1x∈Ci

(c−ci)2

K󰀁󰀁i=1x∈Ci

(c−ci)2

K󰀁󰀁i=1x∈Ci

|Ci|(c−ci)2

SSE+SSB

8.5

EvaluatingIndividualClustersandObjects

ClusterEvaluation1

Sofar,wehavefocusedonusingcohesionandseparationintheoveralleval-uationofagroupofclusters.Manyofthesemeasuresofclustervalidityalsocanbeusedtoevaluateindividualclustersandobjects.Forexample,wecanrankindividualclustersaccordingtotheirspeciﬁcvalueofclustervalidity,i.e.,clustercohesionorseparation.Aclusterthathasahighvalueofcohesionmaybeconsideredbetterthanaclusterthathasalowervalue.Thisinformationoftencanbeusedtoimprovethequalityofaclustering.If,forexample,aclusterisnotverycohesive,thenwemaywanttosplititintoseveralsubclus-ters.Ontheotherhand,iftwoclustersarerelativelycohesive,butnotwellseparated,wemaywanttomergethemintoasinglecluster.

Wecanalsoevaluatetheobjectswithinaclusterintermsoftheircon-tributiontotheoverallcohesionorseparationofthecluster.Objectsthatcontributemoretothecohesionandseparationarenearthe“interior”ofthecluster.Thoseobjectsforwhichtheoppositeistrueareprobablynearthe“edge”ofthecluster.Inthefollowingsection,weconsideraclusterevalua-tionmeasurethatusesanapproachbasedontheseideastoevaluatepoints,clusters,andtheentiresetofclusters.TheSilhouetteCoeﬃcient

Thepopularmethodofsilhouettecoeﬃcientscombinesbothcohesionandsep-aration.Thefollowingstepsexplainhowtocomputethesilhouettecoeﬃcientforanindividualpoint,aprocessthatconsistsofthefollowingthreesteps.Weusedistances,butananalogousapproachcanbeusedforsimilarities.1.Fortheithobject,calculateitsaveragedistancetoallotherobjectsinitscluster.Callthisvalueai.2.Fortheithobjectandanyclusternotcontainingtheobject,calculatetheobject’saveragedistancetoalltheobjectsinthegivencluster.Findtheminimumsuchvaluewithrespecttoallclusters;callthisvaluebi.3.Fortheithobject,thesilhouettecoeﬃcientissi=(bi−ai)/max(ai,bi).Thevalueofthesilhouettecoeﬃcientcanvarybetween−1and1.Anegativevalueisundesirablebecausethiscorrespondstoacaseinwhichai,theaveragedistancetopointsinthecluster,isgreaterthanbi,theminimumaveragedistancetopointsinanothercluster.Wewantthesilhouettecoeﬃcienttobepositive(ai2Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

00.10.20.30.40.50.60.70.80.91

Silhouette Coefficient

Figure8.29.Silhouettecoefﬁcientsforpointsintenclusters.

Wecancomputetheaveragesilhouettecoeﬃcientofaclusterbysimplytakingtheaverageofthesilhouettecoeﬃcientsofpointsbelongingtothecluster.Anoverallmeasureofthegoodnessofaclusteringcanbeobtainedbycomputingtheaveragesilhouettecoeﬃcientofallpoints.

Example8.8(SilhouetteCoeﬃcient).Figure8.29showsaplotofthesilhouettecoeﬃcientsforpointsin10clusters.Darkershadesindicatelowersilhouettecoeﬃcients.

8.5.3

UnsupervisedClusterEvaluationUsingtheProximityMatrix

Inthissection,weexamineacoupleofunsupervisedapproachesforassessingclustervaliditythatarebasedontheproximitymatrix.Theﬁrstcomparesanactualandidealizedproximitymatrix,whilethesecondusesvisualization.MeasuringClusterValidityviaCorrelation

Ifwearegiventhesimilaritymatrixforadatasetandtheclusterlabelsfromaclusteranalysisofthedataset,thenwecanevaluatethe“goodness”oftheclusteringbylookingatthecorrelationbetweenthesimilaritymatrixandanidealversionofthesimilaritymatrixbasedontheclusterlabels.(Withminorchanges,thefollowingappliestoproximitymatrices,butforsimplicity,wediscussonlysimilaritymatrices.)Morespeciﬁcally,anidealclusterisonewhosepointshaveasimilarityof1toallpointsinthecluster,andasimilarityof0toallpointsinotherclusters.Thus,ifwesorttherowsandcolumnsofthesimilaritymatrixsothatallobjectsbelongingtothesameclassaretogether,thenanidealsimilaritymatrixhasablockdiagonalstructure.Inotherwords,thesimilarityisnon-zero,i.e.,1,insidetheblocksofthesimilarity

8.5ClusterEvaluation3

matrixwhoseentriesrepresentintra-clustersimilarity,and0elsewhere.Theidealsimilaritymatrixisconstructedbycreatingamatrixthathasonerowandonecolumnforeachdatapoint—justlikeanactualsimilaritymatrix—andassigninga1toanentryiftheassociatedpairofpointsbelongstothesamecluster.Allotherentriesare0.

Highcorrelationbetweentheidealandactualsimilaritymatricesindicatesthatthepointsthatbelongtothesameclusterareclosetoeachother,whilelowcorrelationindicatestheopposite.(Sincetheactualandidealsimilaritymatricesaresymmetric,thecorrelationiscalculatedonlyamongthen(n−1)/2entriesbeloworabovethediagonalofthematrices.)Consequently,thisisnotagoodmeasureformanydensity-orcontiguity-basedclusters,becausetheyarenotglobularandmaybecloselyintertwinedwithotherclusters.

Example8.9(CorrelationofActualandIdealSimilarityMatrices).Toillustratethismeasure,wecalculatedthecorrelationbetweentheidealandactualsimilaritymatricesfortheK-meansclustersshowninFigure8.26(c)(randomdata)andFigure8.30(a)(datawiththreewell-separatedclusters).Thecorrelationswere0.5810and0.9235,respectively,whichreﬂectstheex-pectedresultthattheclustersfoundbyK-meansintherandomdataareworsethantheclustersfoundbyK-meansindatawithwell-separatedclusters.JudgingaClusteringVisuallybyItsSimilarityMatrix

Theprevioustechniquesuggestsamoregeneral,qualitativeapproachtojudg-ingasetofclusters:Orderthesimilaritymatrixwithrespecttoclusterlabelsandthenplotit.Intheory,ifwehavewell-separatedclusters,thenthesimi-laritymatrixshouldberoughlyblock-diagonal.Ifnot,thenthepatternsdis-playedinthesimilaritymatrixcanrevealtherelationshipsbetweenclusters.Again,allofthiscanbeappliedtodissimilaritymatrices,butforsimplicity,wewillonlydiscusssimilaritymatrices.

Example8.10(VisualizingaSimilarityMatrix).ConsiderthepointsinFigure8.30(a),whichformthreewell-separatedclusters.IfweuseK-meanstogroupthesepointsintothreeclusters,thenweshouldhavenotroubleﬁndingtheseclusterssincetheyarewell-separated.TheseparationoftheseclustersisillustratedbythereorderedsimilaritymatrixshowninFigure8.30(b).(Foruniformity,wehavetransformedthedistancesintosimilaritiesusingthefor-mulas=1−(d−mind)/(maxd−mind).)Figure8.31showsthereorderedsimilaritymatricesforclustersfoundintherandomdatasetofFigure8.26byDBSCAN,K-means,andcompletelink.

4Chapter8

1ClusterAnalysis:BasicConceptsandAlgorithms

0.90.80.70.60.50.40.30.20.1

Points

100

Similarity

0.82030

0.6Points40506070

y0.40.28090

000.20.4x

0.60.81100

(a)Well-separatedclusters.

(b)SimilaritymatrixsortedbyK-meansclusterlabels.

Figure8.30.Similaritymatrixforwell-separatedclusters.

Thewell-separatedclustersinFigure8.30showaverystrong,block-diagonalpatterninthereorderedsimilaritymatrix.However,therearealsoweakblockdiagonalpatterns—seeFigure8.31—inthereorderedsimilaritymatricesoftheclusteringsfoundbyK-means,DBSCAN,andcompletelinkintherandomdata.Justaspeoplecanﬁndpatternsinclouds,dataminingalgorithmscanﬁndclustersinrandomdata.Whileitisentertainingtoﬁndpatternsinclouds,itispointlessandperhapsembarrassingtoﬁndclustersinnoise.

Thisapproachmayseemhopelesslyexpensiveforlargedatasets,sincethecomputationoftheproximitymatrixtakesO(m2)time,wheremisthenumberofobjects,butwithsampling,thismethodcanstillbeused.Wecantakeasampleofdatapointsfromeachcluster,computethesimilaritybetweenthesepoints,andplottheresult.Itmaybenecessarytooversamplesmallclustersandundersamplelargeonestoobtainanadequaterepresentationofallclusters.

8.5.4UnsupervisedEvaluationofHierarchicalClustering

Thepreviousapproachestoclusterevaluationareintendedforpartitionalclusterings.Herewediscussthecopheneticcorrelation,apopularevaluationmeasureforhierarchicalclusterings.Thecopheneticdistancebetweentwoobjectsistheproximityatwhichanagglomerativehierarchicalclusteringtech-

8.5

10203040Points5060708090100

2040Points

60801000.90.80.70.6

Points0.50.40.30.20.1

Similarity

102030405060708090100

2040Points

608010010.90.80.70.6

ClusterEvaluation5

10203040Points5060708090100

2040Points

60801000.90.80.70.60.50.40.30.20.1

Similarity

0.50.40.30.20.1

Similarity

(a)SimilaritymatrixsortedbyDBSCANclusterlabels.(b)SimilaritymatrixsortedbyK-meansclusterlabels.(c)Similaritymatrixsortedbycompletelinkclusterlabels.

Figure8.31.Similaritymatricesforclustersfromrandomdata.

niqueputstheobjectsinthesameclusterfortheﬁrsttime.Forexample,ifatsomepointintheagglomerativehierarchicalclusteringprocess,thesmallestdistancebetweenthetwoclustersthataremergedis0.1,thenallpointsinoneclusterhaveacopheneticdistanceof0.1withrespecttothepointsintheothercluster.Inacopheneticdistancematrix,theentriesarethecopheneticdistancesbetweeneachpairofobjects.Thecopheneticdistanceisdiﬀerentforeachhierarchicalclusteringofasetofpoints.

Example8.11(CopheneticDistanceMatrix).Table8.7showsthecophen-ticdistancematrixforthesinglelinkclusteringshowninFigure8.16.(Thedataforthisﬁgureconsistsofthe6two-dimensionalpointsgiveninTable8.3.)

Table8.7.Copheneticdistancematrixforsinglelinkanddataintable8.3

PointP1P2P3P4P5P6P100.2220.2220.2220.2220.222P20.22200.1480.1510.1390.148P30.2220.14800.1510.1480.110P40.2220.1510.15100.1510.151P50.2220.1390.1480.15100.148P60.2220.1480.1100.1510.1480TheCoPheneticCorrelationCoeﬃcient(CPCC)isthecorrelationbetweentheentriesofthismatrixandtheoriginaldissimilaritymatrixandis

6Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

astandardmeasureofhowwellahierarchicalclustering(ofaparticulartype)ﬁtsthedata.Oneofthemostcommonusesofthismeasureistoevaluatewhichtypeofhierarchicalclusteringisbestforaparticulartypeofdata.Example8.12(CopheneticCorrelationCoeﬃcient).WecalculatedtheCPCCforthehierarchicalclusteringsshowninFigures8.16–8.19.ThesevaluesareshowninTable8.8.Thehierarchicalclusteringproducedbythesinglelinktechniqueseemstoﬁtthedatalesswellthantheclusteringsproducedbycompletelink,groupaverage,andWard’smethod.

Table8.8.CopheneticcorrelationcoefﬁcientfordataofTable8.3andfouragglomerativehierarchicalclusteringtechniques.

TechniqueSingleLinkCompleteLinkGroupAverage

Ward’s

CPCC0.440.630.660.

8.5.5DeterminingtheCorrectNumberofClusters

Variousunsupervisedclusterevaluationmeasurescanbeusedtoapproxi-matelydeterminethecorrectornaturalnumberofclusters.

Example8.13(NumberofClusters).ThedatasetofFigure8.29has10naturalclusters.Figure8.32showsaplotoftheSSEversusthenumberofclustersfora(bisecting)K-meansclusteringofthedataset,whileFigure8.33showstheaveragesilhouettecoeﬃcientversusthenumberofclustersforthesamedata.ThereisadistinctkneeintheSSEandadistinctpeakinthesilhouettecoeﬃcientwhenthenumberofclustersisequalto10.

Thus,wecantrytoﬁndthenaturalnumberofclustersinadatasetbylookingforthenumberofclustersatwhichthereisaknee,peak,ordipintheplotoftheevaluationmeasurewhenitisplottedagainstthenumberofclusters.Ofcourse,suchanapproachdoesnotalwaysworkwell.ClustersmaybeconsiderablymoreintertwinedoroverlappingthanthoseshowninFigure8.29.Also,thedatamayconsistofnestedclusters.Actually,theclustersinFigure8.29aresomewhatnested;i.e.,thereare5pairsofclusterssincetheclustersareclosertoptobottomthantheyarelefttoright.ThereisakneethatindicatesthisintheSSEcurve,butthesilhouettecoeﬃcientcurveisnot

8.5

0.750.7

ClusterEvaluation7

0.65Silhouette Coefficient0.60.550.50.450.40.35

6SSE420

051015202530

0.3

051015202530

Number of ClustersNumber of Clusters

Figure8.32.SSEversusnumberofclustersforthedataofFigure8.29.

Figure8.33.Averagesilhouettecoefﬁcientver-susnumberofclustersforthedataofFigure8.29.

asclear.Insummary,whilecautionisneeded,thetechniquewehavejustdescribedcanprovideinsightintothenumberofclustersinthedata.

8.5.6ClusteringTendency

Oneobviouswaytodetermineifadatasethasclustersistotrytoclusterit.However,almostallclusteringalgorithmswilldutifullyﬁndclusterswhengivendata.Toaddressthisissue,wecouldevaluatetheresultingclustersandonlyclaimthatadatasethasclustersifatleastsomeoftheclustersareofgoodquality.However,thisapproachdoesnotaddressthefacttheclustersinthedatacanbeofadiﬀerenttypethanthosesoughtbyourclusteringalgorithm.Tohandlethisadditionalproblem,wecouldusemultiplealgorithmsandagainevaluatethequalityoftheresultingclusters.Iftheclustersareuniformlypoor,thenthismayindeedindicatethattherearenoclustersinthedata.

Alternatively,andthisisthefocusofmeasuresofclusteringtendency,wecantrytoevaluatewhetheradatasethasclusterswithoutclustering.Themostcommonapproach,especiallyfordatainEuclideanspace,hasbeentousestatisticaltestsforspatialrandomness.Unfortunately,choosingthecor-rectmodel,estimatingtheparameters,andevaluatingthestatisticalsigniﬁ-canceofthehypothesisthatthedataisnon-randomcanbequitechallenging.Nonetheless,manyapproacheshavebeendeveloped,mostofthemforpointsinlow-dimensionalEuclideanspace.

Example8.14(HopkinsStatistic).Forthisapproach,wegenerateppointsthatarerandomlydistributedacrossthedataspaceandalsosamplepactual

8Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

datapoints.Forbothsetsofpointsweﬁndthedistancetothenearestneigh-borintheoriginaldataset.Lettheuibethenearestneighbordistancesoftheartiﬁciallygeneratedpoints,whilethewiarethenearestneighbordistancesofthesampleofpointsfromtheoriginaldataset.TheHopkinsstatisticHisthendeﬁnedbyEquation8.17.

󰀂p

wii=1󰀂H=󰀂p(8.17)pu+wiii=1i=1Iftherandomlygeneratedpointsandthesampleofdatapointshave

roughlythesamenearestneighbordistances,thenHwillbenear0.5.ValuesofHnear0and1indicate,respectively,datathatishighlyclusteredanddatathatisregularlydistributedinthedataspace.Togiveanexample,theHopkinsstatisticforthedataofFigure8.26wascomputedforp=20and100diﬀerenttrials.TheaveragevalueofHwas0.56withastandarddeviationof0.03.Thesameexperimentwasperformedforthewell-separatedpointsofFigure8.30.TheaveragevalueofHwas0.95withastandarddeviationof0.006.

8.5.7SupervisedMeasuresofClusterValidity

Whenwehaveexternalinformationaboutdata,itistypicallyintheformofexternallyderivedclasslabelsforthedataobjects.Insuchcases,theusualprocedureistomeasurethedegreeofcorrespondencebetweentheclusterlabelsandtheclasslabels.Butwhyisthisofinterest?Afterall,ifwehavetheclasslabels,thenwhatisthepointinperformingaclusteranalysis?Motivationsforsuchananalysisarethecomparisonofclusteringtechniqueswiththe“groundtruth”ortheevaluationoftheextenttowhichamanualclassiﬁcationprocesscanbeautomaticallyproducedbyclusteranalysis.

Weconsidertwodiﬀerentkindsofapproaches.Theﬁrstsetoftechniquesusemeasuresfromclassiﬁcation,suchasentropy,purity,andtheF-measure.Thesemeasuresevaluatetheextenttowhichaclustercontainsobjectsofasingleclass.Thesecondgroupofmethodsisrelatedtothesimilaritymeasuresforbinarydata,suchastheJaccardmeasurethatwesawinChapter2.Theseapproachesmeasuretheextenttowhichtwoobjectsthatareinthesameclassareinthesameclusterandviceversa.Forconvenience,wewillrefertothesetwotypesofmeasuresasclassiﬁcation-orientedandsimilarity-oriented,respectively.

8.5ClusterEvaluation9

Classiﬁcation-OrientedMeasuresofClusterValidity

Thereareanumberofmeasures—entropy,purity,precision,recall,andtheF-measure—thatarecommonlyusedtoevaluatetheperformanceofaclassi-ﬁcationmodel.Inthecaseofclassiﬁcation,wemeasurethedegreetowhichpredictedclasslabelscorrespondtoactualclasslabels,butforthemeasuresjustmentioned,nothingfundamentalischangedbyusingclusterlabelsin-steadofpredictedclasslabels.Next,wequicklyreviewthedeﬁnitionsofthesemeasures,whichwerediscussedinChapter4.

Entropy:Thedegreetowhicheachclusterconsistsofobjectsofasingleclass.

Foreachcluster,theclassdistributionofthedataiscalculatedﬁrst,i.e.,forclusterjwecomputepij,theprobabilitythatamemberofclusteribelongstoclassjaspij=mij/mi,wheremiisthenumberofobjectsinclusteriandmijisthenumberofobjectsofclassjinclusteri.Usingthisclassdistribution,theentropy󰀂Lofeachclusteriiscalculatedusingthestandardformula,ei=−j=1pijlog2pij,whereListhenumberofclasses.Thetotalentropyforasetofclustersiscalculatedasthesumoftheentropiesofeachclusterweightedbythesizeofeachcluster,i.e.,󰀂mie=Ki=1mei,whereKisthenumberofclustersandmisthetotalnumberofdatapoints.Purity:Anothermeasureoftheextenttowhichaclustercontainsobjectsof

asingleclass.Usingthepreviousterminology,thepurity󰀂ofclusteriis

mipi=maxpij,theoverallpurityofaclusteringispurity=Ki=1mpi.

Precision:Thefractionofaclusterthatconsistsofobjectsofaspeciﬁedclass.Theprecisionofclusteriwithrespecttoclassjisprecision(i,j)=pij.Recall:Theextenttowhichaclustercontainsallobjectsofaspeciﬁedclass.

Therecallofclusteriwithrespecttoclassjisrecall(i,j)=mij/mj,wheremjisthenumberofobjectsinclassj.F-measureAcombinationofbothprecisionandrecallthatmeasuresthe

extenttowhichaclustercontainsonlyobjectsofaparticularclassandallobjectsofthatclass.TheF-measureofclusteriwithrespecttoclassjisF(i,j)=(2×precision(i,j)×recall(i,j))/(precision(i,j)+recall(i,j)).Example8.15(SupervisedEvaluationMeasures).Wepresentanexam-pletoillustratethesemeasures.Speciﬁcally,weuseK-meanswiththecosinesimilaritymeasuretocluster3204newspaperarticlesfromtheLosAngeles

550Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

Table8.9.K-meansclusteringresultsfortheLATimesdocumentdataset.

Cluster123456TotalEnter-Financialtainment3711101623312253583555Foreign4028013512341Metro50629711970212943National96394731348273Sports27267122313738Entropy1.22701.14720.18131.74871.39761.55231.1450Purity0.74740.77560.97960.43900.71340.55250.7203Times.Thesearticlescomefromsixdiﬀerentclasses:Entertainment,Finan-cial,Foreign,Metro,National,andSports.Table8.9showstheresultsofaK-meansclusteringtoﬁndsixclusters.Theﬁrstcolumnindicatestheclus-ter,whilethenextsixcolumnstogetherformtheconfusionmatrix;i.e.,thesecolumnsindicatehowthedocumentsofeachcategoryaredistributedamongtheclusters.Thelasttwocolumnsaretheentropyandpurityofeachcluster,respectively.

Ideally,eachclusterwillcontaindocumentsfromonlyoneclass.Inreality,eachclustercontainsdocumentsfrommanyclasses.Nevertheless,manyclus-terscontaindocumentsprimarilyfromjustoneclass.Inparticular,cluster3,whichcontainsmostlydocumentsfromtheSportssection,isexceptionallygood,bothintermsofpurityandentropy.Thepurityandentropyoftheotherclustersisnotasgood,butcantypicallybegreatlyimprovedifthedataispartitionedintoalargernumberofclusters.

Precision,recall,andtheF-measurecanbecalculatedforeachcluster.Togiveaconcreteexample,weconsidercluster1andtheMetroclassofTable8.9.Theprecisionis506/677=0.75,recallis506/943=0.26,andhence,theFvalueis0.39.Incontrast,theFvalueforcluster3andSportsis0.94.Similarity-OrientedMeasuresofClusterValidity

Themeasuresthatwediscussinthissectionareallbasedonthepremisethatanytwoobjectsthatareinthesameclustershouldbeinthesameclassandviceversa.Wecanviewthisapproachtoclustervalidityasinvolvingthecomparisonoftwomatrices:(1)theidealclustersimilaritymatrixdiscussedpreviously,whichhasa1intheijthentryiftwoobjects,iandj,areinthesameclusterand0,otherwise,and(2)anidealclasssimilaritymatrixdeﬁnedwithrespecttoclasslabels,whichhasa1intheijthentryif

8.5ClusterEvaluation551

twoobjects,iandj,belongtothesameclass,anda0otherwise.Asbefore,wecantakethecorrelationofthesetwomatricesasthemeasureofclustervalidity.ThismeasureisknownastheΓstatisticinclusteringvalidationliterature.Example8.16(CorrelationbetweenClusterandClassMatrices).Todemonstratethisideamoreconcretely,wegiveanexampleinvolvingﬁvedatapoints,p1,p2,p3,p4,p5,twoclusters,C1={p1,p2,p3}andC2={p4,p5},andtwoclasses,L1={p1,p2}andL2={p3,p4,p5}.TheidealclusterandclasssimilaritymatricesaregiveninTables8.10and8.11.Thecorrelationbetweentheentriesofthesetwomatricesis0.359.

Table8.10.Idealclustersimilaritymatrix.Pointp1p2p3p4p5

p111100

p211100

p311100

p400011

p500011

Table8.11.Idealclasssimilaritymatrix.Pointp1p2p3p4p5

p111000

p211000

p300111

p400111

p500111

Moregenerally,wecanuseanyofthemeasuresforbinarysimilaritythatwesawinSection2.4.5.(Forexample,wecanconvertthesetwomatricesintobinaryvectorsbyappendingtherows.)Werepeatthedeﬁnitionsofthefourquantitiesusedtodeﬁnethosesimilaritymeasures,butmodifyourdescriptivetexttoﬁtthecurrentcontext.Speciﬁcally,weneedtocomputethefollowingfourquantitiesforallpairsofdistinctobjects.(Therearem(m−1)/2suchpairs,ifmisthenumberofobjects.)f00f01f10f11

=numberofpairsofobjectshavingadiﬀerentclassandadiﬀerentcluster=numberofpairsofobjectshavingadiﬀerentclassandthesamecluster=numberofpairsofobjectshavingthesameclassandadiﬀerentcluster=numberofpairsofobjectshavingthesameclassandthesamecluster

Inparticular,thesimplematchingcoeﬃcient,whichisknownastheRandstatisticinthiscontext,andtheJaccardcoeﬃcientaretwoofthemostfre-quentlyusedclustervaliditymeasures.

Randstatistic=

f00+f11

f00+f01+f10+f11

(8.18)

552Chapter8ClusterAnalysis:BasicConceptsandAlgorithmsJaccardcoeﬃcient=

f11

f01+f10+f11

(8.19)

Example8.17(RandandJaccardMeasures).Basedontheseformulas,wecanreadilycomputetheRandstatisticandJaccardcoeﬃcientfortheexamplebasedonTables8.10and8.11.Notingthatf00=4,f01=2,f10=2,andf11=2,theRandstatistic=(2+4)/10=0.6andtheJaccardcoeﬃcient=2/(2+2+2)=0.33.

Wealsonotethatthefourquantities,f00,f01,f10,andf11,deﬁneacon-tingencytableasshowninTable8.12.

Table8.12.Two-waycontingencytablefordeterminingwhetherpairsofobjectsareinthesameclassandsamecluster.

SameClusterDiﬀerentClusterSameClassf11f10DiﬀerentClassf01f00

Previously,inthecontextofassociationanalysis—seeSection6.7.1—we

presentedanextensivediscussionofmeasuresofassociationthatcanbeusedforthistypeofcontingencytable.(CompareTable8.12withTable6.7.)Thosemeasurescanalsobeappliedtoclustervalidity.ClusterValidityforHierarchicalClusterings

Sofarinthissection,wehavediscussedsupervisedmeasuresofclusterva-lidityonlyforpartitionalclusterings.Supervisedevaluationofahierarchicalclusteringismorediﬃcultforavarietyofreasons,includingthefactthatapreexistinghierarchicalstructureoftendoesnotexist.Here,wewillgiveanexampleofanapproachforevaluatingahierarchicalclusteringintermsofa(ﬂat)setofclasslabels,whicharemorelikelytobeavailablethanapreexistinghierarchicalstructure.

Thekeyideaofthisapproachistoevaluatewhetherahierarchicalclus-teringcontains,foreachclass,atleastoneclusterthatisrelativelypureandincludesmostoftheobjectsofthatclass.Toevaluateahierarchicalcluster-ingwithrespecttothisgoal,wecompute,foreachclass,theF-measureforeachclusterintheclusterhierarchy.Foreachclass,wetakethemaximumF-measureattainedforanycluster.Finally,wecalculateanoverallF-measureforthehierarchicalclusteringbycomputingtheweightedaverageofallper-classF-measures,wheretheweightsarebasedontheclasssizes.Moreformally,

8.5

thishierarchicalF-measureisdeﬁnedasfollows:

󰀁mj

ClusterEvaluation553

maxF(i,j)

wherethemaximumistakenoverallclustersiatalllevels,mjisthenumberofobjectsinclassj,andmisthetotalnumberofobjects.

8.5.8AssessingtheSigniﬁcanceofClusterValidityMeasures

Clustervaliditymeasuresareintendedtohelpusmeasurethegoodnessoftheclustersthatwehaveobtained.Indeed,theytypicallygiveusasinglenumberasameasureofthatgoodness.However,wearethenfacedwiththeproblemofinterpretingthesigniﬁcanceofthisnumber,ataskthatmaybeevenmorediﬃcult.

Theminimumandmaximumvaluesofclusterevaluationmeasuresmayprovidesomeguidanceinmanycases.Forinstance,bydeﬁnition,apurityof0isbad,whileapurityof1isgood,atleastifwetrustourclasslabelsandwantourclusterstructuretoreﬂecttheclassstructure.Likewise,anentropyof0isgood,asisanSSEof0.

Sometimes,however,theremaynotbeaminimumormaximumvalue,orthescaleofthedatamayaﬀecttheinterpretation.Also,evenifthereareminimumandmaximumvalueswithobviousinterpretations,intermediatevaluesstillneedtobeinterpreted.Insomecases,wecanuseanabsolutestandard.If,forexample,weareclusteringforutility,wemaybewillingtotolerateonlyacertainleveloferrorintheapproximationofourpointsbyaclustercentroid.

Butifthisisnotthecase,thenwemustdosomethingelse.Acommonapproachistointerpretthevalueofourvaliditymeasureinstatisticalterms.Speciﬁcally,weattempttojudgehowlikelyitisthatourobservedvaluemaybeachievedbyrandomchance.Thevalueisgoodifitisunusual;i.e.,ifitisunlikelytobetheresultofrandomchance.Themotivationforthisapproachisthatweareonlyinterestedinclustersthatreﬂectnon-randomstructureinthedata,andsuchstructuresshouldgenerateunusuallyhigh(low)valuesofourclustervaliditymeasure,atleastifthevaliditymeasuresaredesignedtoreﬂectthepresenceofstrongclusterstructure.

Example8.18(SigniﬁcanceofSSE).Toshowhowthisworks,wepresentanexamplebasedonK-meansandtheSSE.Supposethatwewantameasureofhowgoodthewell-separatedclustersofFigure8.30arewithrespecttorandomdata.Wegeneratemanyrandomsetsof100pointshavingthesamerangeas

5Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

504030Count201000.0150.020.025SSE

0.030.0350.04Figure8.34.HistogramofSSEfor500randomdatasets.

thepointsinthethreeclusters,ﬁndthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure8.34showsthehistogramoftheSSEfrom500randomruns.ThelowestSSEshowninFigure8.34is0.0173.ForthethreeclustersofFigure8.30,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure8.30couldoccurbychance.

Toconclude,westressthatthereismoretoclusterevaluation—supervisedorunsupervised—thanobtaininganumericalmeasureofclustervalidity.Un-lessthisvaluehasanaturalinterpretationbasedonthedeﬁnitionofthemea-sure,weneedtointerpretthisvalueinsomeway.Ifourclusterevaluationmeasureisdeﬁnedsuchthatlowervaluesindicatestrongerclusters,thenwecanusestatisticstoevaluatewhetherthevaluewehaveobtainedisunusuallylow,providedwehaveadistributionfortheevaluationmeasure.Wehavepre-sentedanexampleofhowtoﬁndsuchadistribution,butthereisconsiderablymoretothistopic,andwereferthereadertothebibliographicnotesformorepointers.

Finally,evenwhenanevaluationmeasureisusedasarelativemeasure,i.e.,tocomparetwoclusterings,westillneedtoassessthesigniﬁcanceinthediﬀerencebetweentheevaluationmeasuresofthetwoclusterings.Althoughonevaluewillalmostalwaysbebetterthananother,itcanbediﬃculttodetermineifthediﬀerenceissigniﬁcant.Notethattherearetwoaspectstothissigniﬁcance:whetherthediﬀerenceisstatisticallysigniﬁcant(repeatable)

8.6BibliographicNotes555

andwhetherthemagnitudeofthediﬀerenceismeaningfulwithrespecttotheapplication.Manywouldnotregardadiﬀerenceof0.1%assigniﬁcant,evenifitisconsistentlyreproducible.

8.6BibliographicNotes

DiscussioninthischapterhasbeenmostheavilyinﬂuencedbythebooksonclusteranalysiswrittenbyJainandDubes[396],Anderberg[374],andKauf-manandRousseeuw[400].AdditionalclusteringbooksthatmayalsobeofinterestincludethosebyAldenderferandBlashﬁeld[373],Everittetal.[388],Hartigan[394],Mirkin[405],Murtagh[407],Romesburg[409],andSp¨ath[413].AmorestatisticallyorientedapproachtoclusteringisgivenbythepatternrecognitionbookofDudaetal.[385],themachinelearningbookofMitchell[406],andthebookonstatisticallearningbyHastieetal.[395].AgeneralsurveyofclusteringisgivenbyJainetal.[397],whileasurveyofspatialdataminingtechniquesisprovidedbyHanetal.[393].Behrkin[379]providesasurveyofclusteringtechniquesfordatamining.AgoodsourceofreferencestoclusteringoutsideofthedataminingﬁeldisthearticlebyArabieandHu-bert[376].ApaperbyKleinberg[401]providesadiscussionofsomeofthetrade-oﬀsthatclusteringalgorithmsmakeandprovesthatitisimpossibletoforaclusteringalgorithmtosimultaneouslypossessthreesimpleproperties.

TheK-meansalgorithmhasalonghistory,butisstillthesubjectofcurrentresearch.TheoriginalK-meansalgorithmwasproposedbyMacQueen[403].TheISODATAalgorithmbyBallandHall[377]wasanearly,butsophisticatedversionofK-meansthatemployedvariouspre-andpostprocessingtechniquestoimproveonthebasicalgorithm.TheK-meansalgorithmandmanyofitsvariationsaredescribedindetailinthebooksbyAnderberg[374]andJainandDubes[396].ThebisectingK-meansalgorithmdiscussedinthischapterwasdescribedinapaperbySteinbachetal.[414],andanimplementationofthisandotherclusteringapproachesisfreelyavailableforacademicuseintheCLUTO(CLUsteringTOolkit)packagecreatedbyKarypis[382].Boley[380]hascreatedadivisivepartitioningclusteringalgorithm(PDDP)basedonﬁndingtheﬁrstprincipaldirection(component)ofthedata,andSavaresiandBoley[411]haveexploreditsrelationshiptobisectingK-means.RecentvariationsofK-meansareanewincrementalversionofK-means(Dhillonetal.[383]),X-means(PellegandMoore[408]),andK-harmonicmeans(Zhangetal[416]).HamerlyandElkan[392]discusssomeclusteringalgorithmsthatpro-ducebetterresultsthanK-means.WhilesomeofthepreviouslymentionedapproachesaddresstheinitializationproblemofK-meansinsomemanner,

556Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

otherapproachestoimprovingK-meansinitializationcanalsobefoundintheworkofBradleyandFayyad[381].DhillonandModha[384]presentagen-eralizationofK-means,calledsphericalK-means,thatworkswithcommonlyusedsimilarityfunctions.AgeneralframeworkforK-meansclusteringthatusesdissimilarityfunctionsbasedonBregmandivergenceswasconstructedbyBanerjeeetal.[378].

Hierarchicalclusteringtechniquesalsohavealonghistory.MuchoftheinitialactivitywasintheareaoftaxonomyandiscoveredinbooksbyJardineandSibson[398]andSneathandSokal[412].General-purposediscussionsofhierarchicalclusteringarealsoavailableinmostoftheclusteringbooksmen-tionedabove.Agglomerativehierarchicalclusteringisthefocusofmostworkintheareaofhierarchicalclustering,butdivisiveapproacheshavealsoreceivedsomeattention.Forexample,Zahn[415]describesadivisivehierarchicaltech-niquethatusestheminimumspanningtreeofagraph.Whilebothdivisiveandagglomerativeapproachestypicallytaketheviewthatmerging(splitting)decisionsareﬁnal,therehasbeensomeworkbyFisher[3]andKarypisetal.[399]toovercometheselimitations.

Esteretal.proposedDBSCAN[387],whichwaslatergeneralizedtotheGDBSCANalgorithmbySanderetal.[410]inordertohandlemoregeneraltypesofdataanddistancemeasures,suchaspolygonswhoseclosenessismea-suredbythedegreeofintersection.AnincrementalversionofDBSCANwasdevelopedbyKriegeletal.[386].OneinterestingoutgrowthofDBSCANisOPTICS(OrderingPointsToIdentifytheClusteringStructure)(Ankerstetal.[375]),whichallowsthevisualizationofclusterstructureandcanalsobeusedforhierarchicalclustering.

Anauthoritativediscussionofclustervalidity,whichstronglyinﬂuencedthediscussioninthischapter,isprovidedinChapter4ofJainandDubes’clusteringbook[396].MorerecentreviewsofclustervalidityarethoseofHalkidietal.[390,391]andMilligan[404].SilhouettecoeﬃcientsaredescribedinKaufmanandRousseeuw’sclusteringbook[400].ThesourceofthecohesionandseparationmeasuresinTable8.6isapaperbyZhaoandKarypis[417],whichalsocontainsadiscussionofentropy,purity,andthehierarchicalF-measure.TheoriginalsourceofthehierarchicalF-measureisanarticlebyLarsenandAone[402].

Bibliography

[373]M.S.AldenderferandR.K.Blashﬁeld.ClusterAnalysis.SagePublications,Los

Angeles,1985.

[374]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,NewYork,

December1973.

Bibliography557

[375]M.Ankerst,M.M.Breunig,H.-P.Kriegel,andJ.Sander.OPTICS:OrderingPoints

ToIdentifytheClusteringStructure.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages49–60,Philadelphia,Pennsylvania,June1999.ACMPress.[376]P.Arabie,L.Hubert,andG.D.Soete.Anoverviewofcombinatorialdataanalysis.

InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassiﬁcation,pages188–217.WorldScientiﬁc,Singapore,January1996.

[377]G.BallandD.Hall.AClusteringTechniqueforSummarizingMultivariateData.

BehaviorScience,12:153–155,March1967.

[378]A.Banerjee,S.Merugu,I.S.Dhillon,andJ.Ghosh.ClusteringwithBregmanDiver-gences.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages234–245,LakeBuenaVista,FL,April2004.

[379]P.Berkhin.SurveyOfClusteringDataMiningTechniques.Technicalreport,Accrue

Software,SanJose,CA,2002.

[380]D.Boley.PrincipalDirectionDivisivePartitioning.DataMiningandKnowledge

Discovery,2(4):325–344,1998.

[381]P.S.BradleyandU.M.Fayyad.ReﬁningInitialPointsforK-MeansClustering.In

Proc.ofthe15thIntl.Conf.onMachineLearning,pages91–99,Madison,WI,July1998.MorganKaufmannPublishersInc.[382]CLUTO2.1.1:SoftwareforClusteringHigh-DimensionalDatasets.

/www.cs.umn.edu/∼karypis,November2003.

[383]I.S.Dhillon,Y.Guan,andJ.Kogan.IterativeClusteringofHighDimensionalText

DataAugmentedbyLocalSearch.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages131–138.IEEEComputerSociety,2002.

[384]I.S.DhillonandD.S.Modha.ConceptDecompositionsforLargeSparseTextData

UsingClustering.MachineLearning,42(1/2):143–175,2001.

[385]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassiﬁcation.JohnWiley&Sons,

Inc.,NewYork,secondedition,2001.

[386]M.Ester,H.-P.Kriegel,J.Sander,M.Wimmer,andX.Xu.IncrementalClustering

forMininginaDataWarehousingEnvironment.InProc.ofthe24thVLDBConf.,pages323–333,NewYorkCity,August1998.MorganKaufmann.

[387]M.Ester,H.-P.Kriegel,J.Sander,andX.Xu.ADensity-BasedAlgorithmforDis-coveringClustersinLargeSpatialDatabaseswithNoise.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages226–231,Portland,Oregon,August1996.AAAIPress.

[388]B.S.Everitt,S.Landau,andM.Leese.ClusterAnalysis.ArnoldPublishers,London,

fourthedition,May2001.

[3]D.Fisher.IterativeOptimizationandSimpliﬁcationofHierarchicalClusterings.Jour-nalofArtiﬁcialIntelligenceResearch,4:147–179,1996.

[390]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clustervaliditymethods:partI.

SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(2):40–45,June2002.

[391]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clusteringvaliditycheckingmethods:

partII.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(3):19–27,Sept.2002.

[392]G.HamerlyandC.Elkan.Alternativestothek-meansalgorithmthatﬁndbetter

clusterings.InProc.ofthe11thIntl.Conf.onInformationandKnowledgeManagement,pages600–607,McLean,Virginia,2002.ACMPress.

558Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

[393]J.Han,M.Kamber,andA.Tung.SpatialClusteringMethodsinDataMining:A

review.InH.J.MillerandJ.Han,editors,GeographicDataMiningandKnowledgeDiscovery,pages188–217.TaylorandFrancis,London,December2001.[394]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975.

[395]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:

DataMining,Inference,Prediction.Springer,NewYork,2001.

[396]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHall

AdvancedReferenceSeries.PrenticeHall,March1988.Bookavailableonlineathttp://www.cse.msu.edu/∼jain/ClusteringJainDubes.pdf.

[397]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview.ACMComputing

Surveys,31(3):2–323,September1999.

[398]N.JardineandR.Sibson.MathematicalTaxonomy.Wiley,NewYork,1971.

[399]G.Karypis,E.-H.Han,andV.Kumar.MultilevelReﬁnementforHierarchicalClus-tering.TechnicalReportTR99-020,UniversityofMinnesota,Minneapolis,MN,1999.[400]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoCluster

Analysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[401]J.M.Kleinberg.AnImpossibilityTheoremforClustering.InProc.ofthe16thAnnual

Conf.onNeuralInformationProcessingSystems,December,9–142002.

[402]B.LarsenandC.Aone.FastandEﬀectiveTextMiningUsingLinear-TimeDocument

Clustering.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages16–22,SanDiego,California,1999.ACMPress.

[403]J.MacQueen.Somemethodsforclassiﬁcationandanalysisofmultivariateobserva-tions.InProc.ofthe5thBerkeleySymp.onMathematicalStatisticsandProbability,pages281–297.UniversityofCaliforniaPress,1967.

[404]G.W.Milligan.ClusteringValidation:ResultsandImplicationsforAppliedAnalyses.

InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassiﬁcation,pages345–375.WorldScientiﬁc,Singapore,January1996.

[405]B.Mirkin.MathematicalClassiﬁcationandClustering,volume11ofNonconvexOpti-mizationandItsApplications.KluwerAcademicPublishers,August1996.[406]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[407]F.Murtagh.MultidimensionalClusteringAlgorithms.Physica-Verlag,Heidelbergand

Vienna,1985.

[408]D.PellegandA.W.Moore.X-means:ExtendingK-meanswithEﬃcientEstimation

oftheNumberofClusters.InProc.ofthe17thIntl.Conf.onMachineLearning,pages727–734.MorganKaufmann,SanFrancisco,CA,2000.

[409]C.Romesburg.ClusterAnalysisforResearchers.LifeTimeLearning,Belmont,CA,

1984.

[410]J.Sander,M.Ester,H.-P.Kriegel,andX.Xu.Density-BasedClusteringinSpatial

Databases:TheAlgorithmGDBSCANanditsApplications.DataMiningandKnowl-edgeDiscovery,2(2):169–194,1998.

[411]S.M.SavaresiandD.Boley.AcomparativeanalysisonthebisectingK-meansand

thePDDPclusteringalgorithms.IntelligentDataAnalysis,8(4):345–362,2004.

[412]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.[413]H.Sp¨ath.ClusterAnalysisAlgorithmsforDataReductionandClassiﬁcationofOb-jects,volume4ofComputersandTheirApplication.EllisHorwoodPublishers,Chich-ester,1980.ISBN0-85312-141-9.

8.7Exercises559

[414]M.Steinbach,G.Karypis,andV.Kumar.AComparisonofDocumentClustering

Techniques.InProc.ofKDDWorkshoponTextMining,Proc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,Boston,MA,August2000.

[415]C.T.Zahn.Graph-TheoreticalMethodsforDetectingandDescribingGestaltClusters.

IEEETransactionsonComputers,C-20(1):68–86,Jan.1971.

[416]B.Zhang,M.Hsu,andU.Dayal.K-HarmonicMeans—ADataClusteringAlgorithm.

TechnicalReportHPL-1999-124,HewlettPackardLaboratories,Oct.291999.

[417]Y.ZhaoandG.Karypis.Empiricalandtheoreticalcomparisonsofselectedcriterion

functionsfordocumentclustering.MachineLearning,55(3):311–331,2004.

8.7Exercises

1.Consideradatasetconsistingof220datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvectorquan-tizationisusedforcompressionandthat216prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?2.Findallwell-separatedclustersinthesetofpointsshowninFigure8.35.

Figure8.35.PointsforExercise2.

3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenum-berofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris1/K,buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation8.20.(Thisassumessamplingwithreplacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis4!/44=0.0938.

560Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

numberofwaystoselectonecentroidfromeachclusterK!nKK!p===

numberofwaystoselectKcentroids(Kn)KKK(8.20)

(a)Plottheprobabilityofobtainingonepointfromeachclusterinasample

ofsizeKforvaluesofKbetween2and100.(b)ForKclusters,K=10,100,and1000,ﬁndtheprobabilitythatasample

ofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.5.IdentifytheclustersinFigure8.36usingthecenter-,contiguity-,anddensity-baseddeﬁnitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN.

(a)(b)(c)(d)

Figure8.36.ClustersforExercise5.

6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberofclustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.Assumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure8.37matchesthecorrespondingpartofthisquestion,e.g.,Figure8.37(a)goeswithpart(a).

(a)K=2.Assumingthatthepointsareuniformlydistributedinthecircle,

howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.)

8.7Exercises561

(a)(b)(c)(d)(e)

Figure8.37.DiagramsforExercise6.

(b)K=3.Thedistancebetweentheedgesofthecirclesisslightlygreater

thantheradiiofthecircles.(c)K=3.Thedistancebetweentheedgesofthecirclesismuchlessthan

theradiiofthecircles.(d)K=2.(e)K=3.Hint:Usethesymmetryofthesituationandrememberthatwe

arelookingforaroughsketchofwhattheresultwouldbe.7.Supposethatforadataset

•therearempointsandKclusters,

•halfthepointsandclustersarein“moredense”regions,•halfthepointsandclustersarein“lessdense”regions,and•thetworegionsarewell-separatedfromeachother.

Forthegivendataset,whichofthefollowingshouldoccurinordertominimizethesquarederrorwhenﬁndingKclusters:

(a)Centroidsshouldbeequallydistributedbetweenmoredenseandlessdense

regions.(b)Morecentroidsshouldbeallocatedtothelessdenseregion.(c)Morecentroidsshouldbeallocatedtothedenserregion.

Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdiﬀerentfromanygivenabove,justifyyourresponse.

8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcompo-nentsmostaccuratelycharacterizetheobjectsinthecluster?9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyﬁndthecorrectclusters,butbisectingK-meanswouldnot.

562Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?11.TotalSSEisthesumoftheSSEforeachseparateattribute.Whatdoesitmean

iftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?12.Theleaderalgorithm(Hartigan[394])representseachclusterusingapoint,

knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-speciﬁedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster.

(a)Whataretheadvantagesanddisadvantagesoftheleaderalgorithmas

comparedtoK-means?(b)Suggestwaysinwhichtheleaderalgorithmmightbeimproved.

13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofall

thepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspeciﬁedpoints.(SeeFigure8.38.)WhatistherelationshipbetweenVoronoidiagramsandK-meansclus-ters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters?

Figure8.38.VoronoidiagramforExercise13.

14.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata.

YouuseK-meanstoclusterthedata,butforallvaluesofK,1≤K≤100,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersat

eachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe

8.7Exercises563

(nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.16.UsethesimilaritymatrixinTable8.13toperformsingleandcompletelink

hierarchicalclustering.Showyourresultsbydrawingadendrogram.Theden-drogramshouldclearlyshowtheorderinwhichthepointsaremerged.

Table8.13.SimilaritymatrixforExercise16.p1p2p3p4p5p11.000.100.410.550.35p20.101.000.0.470.98p30.410.1.000.440.85p40.550.470.441.000.76p50.350.980.850.761.0017.HierarchicalclusteringissometimesusedtogenerateKclusters,K>1by

takingtheclustersattheKthlevelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondiﬀerenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means.

Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}.

(a)Foreachofthefollowingsetsofinitialcentroids,createtwoclustersby

assigningeachpointtothenearestcentroid,andthencalculatethetotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids.

i.{18,45}ii.{15,40}

(b)Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-means

algorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?(c)Whatarethetwoclustersproducedbysinglelink?

(d)Whichtechnique,K-meansorsinglelink,seemstoproducethe“most

natural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)(e)Whatdeﬁnition(s)ofclusteringdoesthisnaturalclusteringcorrespond

to?(Well-separated,center-based,contiguous,ordensity.)(f)Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthe

previousbehavior?

5Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

18.SupposeweﬁndKclustersusingWard’smethod,bisectingK-means,andordi-naryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.19.HierarchicalclusteringalgorithmsrequireO(m2log(m))time,andconsequently,

areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetime√requiredistosamplethedataset.Forexample,ifKclustersaredesiredandmpointsaresampledfromthempoints,thenahierarchi-calclusteringalgorithmwillproduceahierarchicalclusteringinroughlyO(m)time.KclusterscanbeextractedfromthishierarchicalclusteringbytakingtheclustersontheKthlevelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspeciﬁcexample,√thecentroidsoftheKclusterscanbecomputed,andtheneachofthem−mremainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid.

Foreachofthefollowingtypesofdataorclusters,discussbrieﬂyif(1)samplingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromthetotalsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm.

(a)Datawithverydiﬀerentsizedclusters.(b)High-dimensionaldata.

(c)Datawithoutliers,i.e.,atypicalpoints.(d)Datawithhighlyirregularregions.(e)Datawithglobularclusters.(f)Datawithwidelydiﬀerentdensities.(g)Datawithasmallpercentageofnoisepoints.(h)Non-Euclideandata.(i)Euclideandata.

(j)Datawithmanyandmixedattributetypes.

20.ConsiderthefollowingfourfacesshowninFigure8.39.Again,darknessor

numberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints.

(a)Foreachﬁgure,couldyouusesinglelinktoﬁndthepatternsrepresented

bythenose,eyes,andmouth?Explain.(b)Foreachﬁgure,couldyouuseK-meanstoﬁndthepatternsrepresented

bythenose,eyes,andmouth?Explain.

8.7Exercises565

(a)(b)(c)(d)

Figure8.39.FigureforExercise20.

(c)Whatlimitationdoesclusteringhaveindetectingallthepatternsformed

bythepointsinFigure8.39(c)?21.ComputetheentropyandpurityfortheconfusionmatrixinTable8.14.

Table8.14.ConfusionmatrixforExercise21.

EntertainmentFinancialForeignMetroNational1101142733382725332658105163555341943273Cluster#1#2#3TotalSports6763329738Total6931562949320422.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Oneset

ofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare.

(a)Isthereadiﬀerencebetweenthetwosetsofpoints?

(b)Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforK=10

clusters?(c)WhatwillbethebehaviorofDBSCANontheuniformdataset?The

randomdataset?23.UsingthedatainExercise24,computethesilhouettecoeﬃcientforeachpoint,

eachofthetwoclusters,andtheoverallclustering.24.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables8.15and

8.16,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhoseijthentryis1iftwoobjectsbelongtothesamecluster,and0otherwise.

566Chapter8ClusterAnalysis:BasicConceptsandAlgorithms

SimilaritymatrixforExercise24.P110.80.650.55P20.810.70.6P30.650.710.9P40.550.60.91Table8.15.TableofclusterlabelsforExercise24.Table8.16.

PointClusterLabelPointP11P1P21P2P32P3P42P425.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,p8}andhierarchicalclusteringshowninFigure8.40.ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB.

{p1, p2, p3, p4, p5, p6, p7, p8}{p1, p2, p4, p5,}{p3, p6, p7, p8}{p1, p2}{p4, p5}{p3, p6}{p7, p8}Figure8.40.HierarchicalclusteringforExercise25.

26.Computethecopheneticcorrelationcoeﬃcientforthehierarchicalclusterings

inExercise16.(Youwillneedtoconvertthesimilaritiesintodissimilarities.)27.ProveEquation8.14.

28.ProveEquation8.16.

󰀂K󰀂

29.Provethati=1x∈Ci(x−mi)(m−mi)=0.Thisfactwasusedintheproof

thatTSS=SSE+SSBinSection8.5.2.30.Clustersofdocumentscanbesummarizedbyﬁndingthetopterms(words)for

thedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefrequentlythanaspeciﬁedthreshold.SupposethatK-meansisusedtoﬁndclustersofbothdocumentsandwordsforadocumentdataset.

(a)Howmightasetoftermclustersdeﬁnedbythetoptermsinadocument

clusterdiﬀerfromthewordclustersfoundbyclusteringthetermswithK-means?(b)Howcouldtermclusteringbeusedtodeﬁneclustersofdocuments?31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionof

attributenodes,wherethereisalinkbetweeneachobjectandeachattribute,

8.7Exercises567

andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindiﬀerentclusters.Thistypeofclusteringisalsoknownasco-clusteringsincetheobjectsandattributesareclusteredatthesametime.

(a)Howisbipartiteclustering(co-clustering)diﬀerentfromclusteringthe

setsofobjectsandattributesseparately?(b)Arethereanycasesinwhichtheseapproachesyieldthesameclusters?(c)Whatarethestrengthsandweaknessesofco-clusteringascomparedto

ordinaryclustering?32.InFigure8.41,matchthesimilaritymatrices,whicharesortedaccordingto

clusterlabels,withthesetsofpoints.Diﬀerencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equal-sizedclusters.

568Chapter8

10.90.80.70.60.50.40.30.20.1000.20.4yClusterAnalysis:BasicConceptsandAlgorithms

10.90.80.70.60.50.40.30.20.1

0.6x

0.8100

0.2

0.4

10.90.80.70.60.50.40.30.20.1y0.6

0.8

10.90.80.70.60.50.40.30.20.1000.20.4y34

0.6x

0.81000.20.4x

0.60.81Figure8.41.PointsandsimilaritymatricesforExercise32.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文