Learning Large Scale Common Sense Models of Everyday Life
WilliamPentney
DepartmentofComputerScience&Engineering
UniversityofWashingtonbill@cs.washington.edu
DepartmentJeffofElectricalBilmes
Engineering
UniversityofWashington
bilmes@ee.washington.edu
Abstract
Recentworkhasshownpromiseinusinglarge,publiclyavail-able,hand-contributedcommonsensedatabasesasjointmod-elsthatcanbeusedtoinferhumanstatefromday-to-daysen-sordata.Theparametersofthesemodelsareminedfromtheweb.Weshowinthispaperthatlearningtheseparametersusingsensordata(withtheminedparametersaspriors)canimproveperformanceofthemodelssignificantly.Thepri-marychallengeinlearningisscale.Sincethemodelcom-prisesroughly50,000irregularlyconnectednodesineachtimeslice,itisintractableeithertocompletelylabelobserveddatamanuallyortocomputetheexpectedlikelihoodofevenasingletimeslice.Weshowhowtosolvetheresultingsemi-supervisedlearningproblembycombiningavarietyofcon-ventionalapproximationtechniquesandanoveltechniqueforsimplifyingthemodelcalledcontext-basedpruning.Weshowempiricallythatthelearnedmodelissubstantiallybet-teratinterpretingsensordataandandetailedanalysisofhowvarioustechniquescontributetotheperformance.
Introduction
Sensor-basedmethodsforinferringthestateofpeopleandtheirenvironmenthaveavarietyofapplicationssuchasel-dercaremanagement(Wilsonetal.2005)andinstitutionalworkflowmanagement.Asystemthatcouldtellwhetheranelderlypersonlivingalonehasacold,hastakenmedicationorisdepressed,forinstance,couldsubstantiallyreducethefinancialburdenofcare.Akeychallengeinbuildingsuchsystemsistheneedformodelsthatrelatelow-levelsensorsignals(e.g.,vision)tohigh-levelconcepts(e.g.,depres-sion).Theconventionalapproachtoacquiringsuchmod-elsistoapplymachinelearningtechniquestolabeledsensordata,givena“structure”prioronthedependenciesbetweenvariables.Thestructureitselfisoftenprovidedbyapplica-tiondevelopersinamanualfashion.However,ifmanyofthetensofthousandsofaspectsofdailylifearetobetracked,thevariablesandtheirrelationshipsamounttoa“common-sense”encodingofdailylife,andmanualspecificationbe-comesimpracticalforasingleindividual.
MatthaiPhilipose
IntelResearchSeattle
matthai.philipose@intel.com
DepartmentHenryofComputerKautz
Science
UniversityofRochesterkautz@cs.rochester.edu
graphrepresentsawidevarietyofrandomvariablesaboutthestateoftheworldateachstep(frompeoples’emotionstothestateoftheirroof)itisrealistictoexpectmanualla-belingofnomorethanaverysmallfractionofthenodesateachtimestep.Theendresultisaverylargesemi-supervisedlearningproblem(Zhu2005)overamixeddi-rected/undirectedprobabilisticgraphicalmodel.
Althoughgeneralapproachestosemi-supervisedlearn-ingofgraphicalmodelsdoexist,solvinglargeproblemsrequiresconsiderablesophisticationinexploitingproblemstructureinpractice.Inthislight,thispapermakesthreemaincontributions.First,itformulatesthecommonsensemodellearningproblemasmaximumlikelihoodestimationonagraphicalmodelwithminedpriors,andappliesavarietyofexistingtechniquestomakelearningtractable.Second,itpresentsasimplenoveltechniquecalledcontext-basedpruningtospeedupinferencesubstantially.Context-basedpruningcapturestheintuitionthatwemaybeabletoasso-ciatewithsetsofobservationsasubsetofvariables(calledacontext)thatarelikelytobeaffectedbytheseobserva-tions.Byidentifying,andavoidingreasoningabout,irrel-evantcontexts,itmaybepossibletosubstantiallyreducethenumberofvariablesconsideredduringinference.Third,itprovidesadetailedevaluationthatshowsthatlearningisbothreasonablyfastandimprovestheSRCSmodelsignif-icantly,andthatmanyofourdesignchoicescontributesig-nificantlytothisresult.Toourknowledge,thisworkisthefirsttoshowhowtoimprovealargecommonsensedatabasebyobservingsensordataofhumanactivity.
Preliminaries
Themodelonwhichweperformlearningisaslightvari-ationoftheSRCSmodel(Pentneyetal.2006).Wearegivenasetofobservationso=o1,o2,...or;inourcontext,theseobservationswillbeBooleanrandomvariables.Wewishtoinferthestateofasetofqueriess=srandomvariables,foragivenslice1,s2,...softimen−r,alsoBooleant.Collectively,wewillspeakofthestateoftheobservationsandqueriesasthesetofvariablesx=x=s1,x2,...xxon,where1...xr=1,...orandxr+1...xn1,...sn−r,andthesetoftheobservationsandqueriesatsomespecificpointintimetasthestatevectorxWeassumethatwet=xhave1t,xa2t,xset3oft...xweightednt.
proposi-tionalHornclauses(c1,q1),(c2,qm),...(cm,qisarelationoverBooleanvariablesm),whereeachclausectheiaboutthestateofenvironment,weightedbyqualityscoreqiin[0,1].Forinstance,theclauseused(bowl)→action(makecereal)mayhaveweight0.9,andrep-resenttheproposition“ifthebowlisbeingused,theuserismakingcereal.”
GivenasetofHornclauses,weproduceaprobabilisticgraphicalmodeldesignedtomodelthestateoftheenviron-mentovertime.AsinSRCS,timeslicetmaybemodeledwithaMarkovrandomfield,inwhichxoftheworldatt.Someofthevariablestwillrepresentthestatexgroundedinactualobservations;ifwecanobserveitcanbetheuseofabowlattimet,forinstance,thevariablerepresentingused(bowl)maybeprovidedasevidence.InaslightdeviationforSRCS(whichproducesonefeatureperHorn
clause),werepresenteachHornclausecfunctionsφibytwobinary-valuedfeaturei(xci)andφm+i(xallvariablesxci),wherexisthesetofcijwhichappearinci.Thefunc-tionφi(xci)isapositivefeature,andsetto1ifallvariablesinxciaretrue(i.e.boththeleftandrightsidesoftheHornclausearetrue,andthuscfunctionφiissatisfied),and0otherwise.Theifthemassignment+i(xci),conversely,isanegativefeatureandsetto1toxWeassignweightciviolatestheHornclauseand0otherwise.λ(xi=qitopositivefeatureφici),andλm+i=−qitonegativefeatureφwherem+i(xxci).Forsimplicity,wemaysometimesrefertoφi(x),isanentirestatevector;thisshouldbereadasφxˆi(xˆci),where
ciistheassignmentimposedonxcibyˆx
.Thefullsetofallsuchfeaturefunctionsφ=φbetweendifferent1,φvariables2,...φ2mwillrepresenttherelationshipsattimet.
WefollowSRCSinmodelinginhowvariablesxsimplicity,wewillassumethat,giventchangeovertime.ForthatxThisit−1=p,xisrepresentedit=pwithsomefixedprobabilityσwithanothersetoffeaturefunctionsp.ψ={ψ1,ψ2,...ψn},whereψiisafunctionof(xψp,q)=1−σit−1,xit),i(p,p)=σp,andψi(pwherep∈{0,1},p=q.Intuitively,thissimplifyingmodelmeansthat,givennootherevidence,variableswillremaininthestatetheyare,butwithdecliningprobabilityovertime.
Theprobabilityofthestatextgiventhestatexmodelast−1asevi-dencemaybecalculatedinthisP(xt|xt−1)=
1
d
property,sologP(xˆ|λ,µ)=ˆt|xˆt−1,λ,µ).t=1logP(xThegradientofLLwithrespecttoeachλjis
∂LL
=
ψj(xˆt,xˆt−1)+
P(xit,xit−1)ψj(xit,xit−1)
logZxˆt−1canbeprohibitivelyexpen-ˆt−1,xˆt\\ytandlogZx
sivetocalculate,evenwithmanystandardapproximationtechniquessuchasMCMC.Toefficientlycalculateanap-proximationtobothlogZxˆt−1,weusetheˆt−1,xˆt\\ytandZx
Bethefreeenergyapproximationtechnique,describedin(Yedidia,Freeman,&Weiss2004).Thistechniqueapproxi-matesthevalueoflogZxˆt−1usingmarginalprobabilitiesofthevariablesxitandfeaturefunctionsφj,ψj,andthisap-proximationtakestheform
∂µj
t
xit,xit−1
Thesecondtermsoftheseequationsareequalto
E(φj(xˆ))andE(ψj(xˆ)),respectively,andmaybecalcu-latedefficientlybycomputingthemarginalprobabilitiesofeachxˆciusingthewell-knownbeliefpropagation(BP)algo-rithm(Pearl1988).WemaythenoptimizeLLbyfindingthemodelparameters(λ,µ)forwhichthegradientiszerousingstochasticgradientdescent(SGD)methods.SGDoptimiza-tionhasbeenshowntobeeffectiveforoptimizationofcon-ditionalrandomfieldsinothercontexts(Vishwanathanetal.2006).ItshouldbenotedthatwhilesomeotherresearchershavefoundtheL-BFGSalgorithmtobeeffectiveforopti-mization(Sutton,Rohanimanesh,&McCallum2004),wefoundSGDtoprovidemuchfasterconvergenceinourex-periments.
Let∇θk(x)bethegradientofthelikelihoodfunctionontraininginstancexundermodelθk.ThestandardSGDal-gorithmbeginswithaninitialmodelθ0=(λ,µ)anditera-tivelyupdatesthemodelbyselectingasingletrainingexam-plex,calculatingthegradient∇θmodelwiththeequation
k−1(x),andupdatingtheθk=θk−1−ηk∇θk−1(x)
whereηkisthegainfactorforiterationk.Weselecttrain-inginstancesforeachiterationbyselectingarandompermu-tationofanentiresetoftracesandproceedingthroughthispermutation,oneiterationpertrace.ThegainfactorcontrolstherateofconvergenceofSGD,andmuchresearchhasbeenconcernedwitheffectiveselectionanditerativeupdatingofηk.Webeginwithηthatof(Y.Lecun1=.1andfollowatechniquesimi-lartoetal.1998)forgainselection.Ouralgorithmtypicallyranfor500-550iterations.
Inpractice,wewillprobablynothaveaccesstoafullyla-beleddatatracexˆ;providingcorrectlabelsforeveryqueryaboutthestateoftheworldateverypointinaperiodoftimecanbequiteexpensive,ifnotimpossible.Wewillthus
assumethateachsliceˆx
imaybeonlypartiallylabeled(per-hapsverysparselyso),withsomesetofvariablesybeledateacht.ThecomputationofeachterminLLtunla-isthen
logP(xˆt|xˆt−1)=log[
exp((λiφi(xˆt|yˆt))+
yˆt
i
µiψi(xˆt|yˆt,ˆxt−1|yˆt−1))]−logZˆxt−1i
=logZxˆt−1,xˆt\\yt−logZˆx
t−1wherexˆt|ˆvariablesyˆtisthepartialassignmentx
unlabeledsettoyˆtwiththet.Unfortunately,both
logZxˆt−1=
−[
P(φj)logλjφj(xˆt)+
xˆt
φj
P(ψj)logµjψj(xˆt,xˆt−1)−
P(φj)logP(φj)+
ψj
φj
P(ψj)logP(ψj)]+
(dj−1)P(ˆxkt)logP(ˆ
xkt)ψjj
k
wheredkisthedegreeofvariablexk,orthenumberof
featuresφjandψjdependentuponxk.SincethemarginalscangenerallybecalculatedcorrectlyandefficientlyusingBPontheentiremodelwithnovariablesfixed,thiscanbecomputedefficientlyforlearning,andwehavegenerallyfoundittobeagoodapproximationinpractice.Similarly,theBetheapproximationoflogZthevariablesxˆt−1,xˆt\\ytmaybecalcu-latedbyperformingBPwithfixedbyxxappropriatevalues.
tandt−To1clampedtotheimproveouroptimizationprocedure,wemaketwomorechanges.First,itshouldbenotedthatouroptimizationproceduremayresultinoverfitting.Tolimitthis,weactuallyoptimizethefunctionRLL(λ,µ)=LL(λ,µ)−N(λ,µ),whereN(λ,µ)isaGaussianfunctionofthevector(λ,µ)withmeanzeroandvarianceη(setto5inpractice).Thistermpenalizesexcessivelylargeweightsµ.Wethenhave∂RLLλinthevectorsλand
jandmayoptimizeasdescribed.
∂λ∂LL
j−∂µj=η,Second,inpractice,becausetrueinstancesofvariablesarerelativelyrare,theoptimizationproceduremayfavoramodelwhichisdisproportionatelylikelytolabelvariablesasfalse.Wethusgivemoreweighttocertainslicescontain-ingtrueinstancesofqueriesthatarelessfrequentlytrueinthetrainingdata.Forthelabeledqueryvariablesq1...ql,lettqibethenumberoftrueappearancesofqiinatimeslicein
thetrainingdataxˆ=xˆ1...xˆd,andα(qi)=
d−tqi
100110011001...10011001t=1t=2t=3t=4Figure1:Exampleofourdynamicgraphicalmodelwith
useofcontextforinference.Shadednodesrepresentobser-vations,andaretruewhenblack;thedashedregionsaretheobservations’contexts.Whenobservationsaretrue,featuresinthoseobservations’contextsareappliedtothattimeslice.Thedirectedarrowsrepresentthetemporalfeaturesbetweenslices.
no. of slices 1000se 100cs fo rebmuN 10 1 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018Pct. of total features in feature setFigure2:Distributionofsizesoffeaturesetsovertimesliceswithcontext-basedpruning.
breadth-firsttraversalstoadepthdoneachfactpper,d=2);thiseliminatesalargenumberofthei(inthepa-nodesandfeaturesinthegraph.Whilethisisaneffectivetechnique,itrequiresforeknowledgeofwhatpredicatesonewishestoqueryoverbeforeconstructingthegraph.
Toimproveefficiencyinanothermanner,wetakeadvan-tageofintuitionaboutthestructureofourgraph.Ourcol-lectionofeverydaycommonsensepredicatescontainsmanypredicateswhoserespectivestatesarelikelytobemostlyun-related.Forexample,ifweobservethatakitchenimplementhasbeenused,thisisunlikelytoaffectthestateofthebath-roomsink,whoseintuitive“context”isdifferent.Thus,ifobservationsrelatingtothebathroomsinkhavenotchangedatthistimeaswell,itisunlikelythatthebathroomsinkwillbeaffectedbythenewobservationsthistimeslice.Giventhisintuition,weseektopartitionthetimeslicegraphintosubgraphsbasedonthefeaturesdefineduponthegraph.Werefertothisascontext-basedpruning.
Mostvariables,atanygivenpointintime,arefalse.GivenanobservationorepresentedbyrandomvariablexappropriatesetofnodesCi,itisourgoaltofindanaresultofobeingowhicharemostlikelytobecometrueastrue.Tofindthesenodes,wecalculatethepoint-wisemutualinformationbetweeneachxsetx
.Thepoint-wiseiandofortruevaluesofxmutualinformationioinatraining
ˆbetweenxoforvaluesxˆiandoˆinˆxiscalculatedasPMI(ˆxiandi,oˆ)=
log
P(xi=ˆxi,o=ˆo)d1ˆx
i,oˆ,where1ˆand0otherwise.However,xˆxi,oˆ=1ifxˆi=
xlabelediando=ointhetrainingdata.Inthiscase,weimightnotbemayestimatePMI(ˆxi,oˆ)byestimatingP(x).Weestimatethesei=xˆvaluesi,o=oˆ),P(xbyperformingi=xˆP(o=oˆBPi)andonourexistingmodelwithalllabeledxfindthecontextofanobservationifixedappropriately.Too,wethenselectathresholdǫoandletthecontextofobeC(xInourexperimentalenvironment,o={xi:PMIwhichcontainedi,o)>ǫ24o}.distinctobservations,thesizeofthere-sultingcontextsrangesfromacoupleofnodestoseveralthousand.Thistechniquepermitsthepruningofmanyfea-turesfromthegraphateachtimeslicethatareunlikelytocarrymeaningfulinformationaboutqueriesbasedonobser-vations,andmanynodeswhicharenotsufficientlyrelevanttoanypossibleobservationmaybeprunedfromthegraph.WeproduceasetofcontextsC=(Cobservationsbycalculatingoeach1,CoC2,...Cor)forthefullsetofdataasdescribed.Letζ(C)={φoibasedonthetrainingi:xci⊂C},Oxtbethesetofallobservationsoisett,andζxt=ζ(probability
totrueinassignmentoi∈OtCoi).Toperforminference,wethencomputetheP(xt|xt−1)=
1
No.1
d=2
3
d=2
5
Context
LearnNo
24.85%
FL/SGD
69.31%
No
36.66%
Recall30.12%
81.81%
45.87%
94.87%
72.18%
90.76%
F-measure15.94
-26.99
54:52
26.76
43:13
Inf.time8:132:411:12
NolearningAcc96.12%
location(shower)
87.61%
action(addmilkto)
72.26%
stateof(window,dirty)
61.65%86.20%
0.000.000.00
LearningAcc99.47%
47.48
97.12%
0.00
95.97%
5.55
94.87%96.13%
Acc
(d=2)-basedcontextPMI-basedcontext
22.2622.98
Time/slice
Avg.time/slice(s)
15.576
d=1subgraph
5.12
d=3subgraph
1.08
498254370
因篇幅问题不能全部显示,请点此查看更多更全内容