StatisticalFramework:
acousticmodellanguagemodel
LanguageModel
—priorprobabilityofwordsequence
LMdoesnotdependonacoustics—canbeobtainedfromtextdata
+
+
+
11:1
Lecture11:COM326–SpeechTechnology–LanguageModelling
-grams(1)
Withoutassuminganything:
Mapcontextintoafinitesetofequivalenceclasses
-grammodelassumesequivalenceclassesareprevious
trigram
11:2
words:
+
+
+
-grams(2)
+
+
+
Howlargeshouldbe?
Larger
Muchbettercontext,e.g.“He’llplaytheworldnumberonePete
Sampras”Smaller
Thetotalnumberof-gramsscalesexponentiallywith
:
ModelNumParameters
unigrambigramtrigramfourgram
11:3
Lecture11:COM326–SpeechTechnology–LanguageModelling
Building-grams
RemovepunctuationandnormalizetextFinitevocabulary(eg
words):mapout-of-vocabulary(OOV)
wordstounknownsymbol
Estimateconditionalprobabilitiesasaratioofjointprobabilities:
11:4
+
+
+
MaximumLikelihoodEstimation
Userelativefrequencyasaprobabilityestimate:
where
isthecountofwordsequence
inthetrainingdata.So,
Thisisthemaximumlikelihood(ML)solution:themodelthatmaximizestheprobabilityofthetrainingdata.
+
11:5
+
Lecture11:COM326–SpeechTechnology–LanguageModelling
DataSparseness
Languagemodellingsuffersfromadatasparsity:eg,trigrams,wordvocabulary.Alargetrainingsetmaycontain
millionwords—
thereare1milliontimesasmanypossibletrigrams...
Datasparsity+maximumlikelihood=zeroprobabilities!Ifwordsequence
isnotobservedthen
issetto
bymaximumlikelihood;thismeansany
wordsequenceincludingthesub-sequencewillhave
probability
WithMLestimated-grams,anywordsequencesthatwasnotpresentin
thetrainingdatacannotberecognized
+
11:6
+
+
+
Smoothing
+
+
+
Analternativesolutionistosmooththeprobabilityestimatessothatnowordsequenceisgivenaprobabilityof.
DiscountingAdjustourprobabilityestimators,sothat
relativefrequencyin
thetrainingdatadoesnotimply
relativecounts
ModelCombinationCombinemodels(unigram,bigram,trigram,...)insuch
awaysoweusethemostprecisemodelavailable:interpolationandback-off
11:7
Lecture11:COM326–SpeechTechnology–LanguageModelling
ComparisonofDiscountingMethods
ChurchandGale(1991)comparedvariousdiscountingstrategiesonanewswirecorpussplitintoatrainingsetof22millionwords,andatestsetof22millionwords
differentwordsintotaldifferentwordsintrainingset
11:8
+
+
+
Discounting:Notation
isthetotalnumberoftraininginstancesisthetotalnumberofpossiblebigramsisthenumberofbigramsthatoccur
timesinthetrainingcorpus
isthetrainingcorpuscountofbigram
isthediscountedcountforbigramsoccuring
times
isprobabilityestimatefor
+
11:9
+
Lecture11:COM326–SpeechTechnology–LanguageModelling
ComparisonofDiscountingMethodsforBigrams
+
11:10
+
+
+
Discounting(1):Laplace
+
+
+
Oldestsolution(Laplace,1814).Add1toeverycount:
OnC+Gdata:99.96%ofprobabilitymassassignedtounseenbigrams!!
11:11
Lecture11:COM326–SpeechTechnology–LanguageModelling
Discounting(2):ValidatingonHeld-OutData
Wecanempiricallyvalidateourmodelsusingaheld-outsetofdata(the22millionwordtestsetinChurchandGale(1991))
Agoodlanguagemodeliftheprobabilitiesestimatedontrainingdataareclosetothosethatwouldhavebeenestimatedonheld-outdata
:thetotalnumberoftimesthatallbigramswithfrequency
inthe
trainingdataappearedintheheld-outdata
Averagefrequencyoftrainingfrequency
bigramsis
.
Validationestimateforabigramprobabilityis:
11:12
+
+
+
+
+
+
Discounting(3):DeletedEstimation
Similarideatothevalidationestimateabove(butthatisacheatasweusetheheld-outdataasatestset!)
Deletedestimation(cross-validation):1.Splitthetrainingdatainto
sections
2.Foreachsection:hold-outsectionandcomputecountsfrom
remaining
sections;compute
3.Estimateprobabilitiesbyaveragingoverallsections:
Worksreliablywellinpractice(ifyouhaveenoughdata)
11:13
Lecture11:COM326–SpeechTechnology–LanguageModelling
Discounting(4a):Good-Turing
Thelimitofdeletedestimationis“leave-one-out”:onebigramineachsection
Itcanbeshowthatthisformofdiscountingresultsinthefollowingadjustedfrequencycount:
ThisisknownasGood-Turingdiscounting.
isathreshold
forNotethat,
11:14
+
+
+
+
+
+
Discounting(4b):Good-Turing
Thediscountedprobabilitiesarethus:
Formulaonlyapplieswhen
where
issmall(eg,10)
Needtorenormalizetoensureeverythingsumsto1
11:15
Lecture11:COM326–SpeechTechnology–LanguageModelling
Discounting(5):AbsoluteDiscounting
AbsoluteDiscountingsubtractsaconstant
fromeachnon-zerocount,
andthe“leftover”probabilityisdistributedoverunseenevents:
if
if
Estimate
fromheld-outdata—valueofaround
wouldworkwellin
theexample.
11:16
+
+
+
+
+
+
Combination(1):Interpolation
Aswellasadjustingcountsbydiscounting,itisbettertouseless-specificmodels,ratherthanrelyingonthedefault“unseenprobability”
Interpolation:interpolatelower-ordermodelswithhigherordermodels,egfortrigrams:
Constraints:
;
.
of
Moregenerally
canbeafunction(eg
).
11:17
Lecture11:COM326–SpeechTechnology–LanguageModelling
Combination(2a):Back-Off
Basicideaofback-offisnottosmoothtogetherdifferentordermodels,buttochoosethemostappropriatemodelorder:
if
if
if
isthedicsountedprobabilityestimate
computedusinge.g.Good-TuringorAbsoluteDiscounting
Theprobabilitymassthatdiscountingwouldassigntounseentrigramsispartionedbythe(backed-off)bigramprobabilities
11:18
+
+
+
Combination(2b):Back-Off
and
arechosensothatprobabilitiesnormalizeto
NormalizationcoefficientsmaybeprecomputedThisisarecursion:compute
inasimilarfashion.
Back-offsmoothingpathfortrigrams:
+
11:19
+
Lecture11:COM326–SpeechTechnology–LanguageModelling
ExampleofBack-OffTrigram(1)
“HE’LLPLAYTHEWORLDNUMBERONEPETESAMPRAS”
logP(HE’LL##)=-4.23(P=0.000059)logP(PLAY#HE’LL)=-2.18(P=0.0066)
logP(THE
HE’LLPLAY)=-1.53(P=0.029)
logP(WORLDPLAYTHE)=-2.(P=0.0023)
logP(NUMBERTHEWORLD)=-3.48(P=0.00033)
logP(ONEWORLDNUMBER)=-0.43(P=0.37)
logP(PETE
NUMBERONE)=-3.10(P=0.00079)
logP(SAMPRASONEPETE)=-0.027(P=0.94)
sum-logprob=-17.63TestSetPerplexity=159.65
+
11:20
+
+
+
+
+
+
LanguageModelSmoothing
Avoidzeroprobabilityestimatesbydiscounting
Usetheappropriateordermodelbyinterpolationorback-off
Mostusualapproachinlargevocabspeechrecognition:trigramlanguagemodel,Good-Turingdiscounting,back-offcombination
11:21
Lecture11:COM326–SpeechTechnology–LanguageModelling
EvaluatingLanguageModels
BacktotheShannonGame(playedbyalanguagemodelthistime)—thebestlanguagemodelwillaveragethefewestguessesoveralargepieceoftext
Computetheaveragelogprobability()ofthelanguagemodel’s
probabilityestimates(
)overatesttext(withwords):
Perplexity(
):theaveragebranchingfactor:
11:22
+
+
+
References
+
+
+
C.D.ManningandH.Schutze(1999).FoundationsofStatisticalNaturalLanguageProcessing,MITPress,chapter6.
F.Jelinek(1997).StatisticalMethodsforSpeechRecognition,MITPress,chapters4,8,15.
K.W.ChurchandW.A.Gale(1991).“AcomparisonoftheenhancedGood-TuringanddeletedestimationmethodsforestimatingprobabilitiesofEnglishbigrams”,ComputerSpeechandLanguage,5,19–.
H.Ney,U.EssenandR.Kneser(1995).“Ontheestimationofsmallprobabilitiesbyleavingoneout”,IEEETrans.PatternAnalysisandMachineIntelligence,17,1202–1212.
S.Katz(1987).“Estimationofprobabilitiesfromsparsedataforthelanguagemodelcomponentofaspeechrecognizer”,IEEETrans.AcousticsSpeechandSignalProcessing,35,400–401.
11:23
Lecture11:COM326–SpeechTechnology–LanguageModelling
Summary
Statisticallanguagemodelling:estimatesoftheprobabilityofasequenceofwords:
-grammodels-onlyconsiderwordsequencesoflength
Estimating-gramprobabilitiesrequiressmoothingtoavoidzeroprobabilities
Discounting-assignsomeprobabilitytounseenevents
Back-offmodelling-choosethemodeloftheappropriateorder(askthetrainingdata).
11:24
+
+
+
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- huatuo0.com 版权所有 湘ICP备2023021991号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务