您的当前位置：首页 Model Num Parameters

Model Num Parameters

来源：华佗健康网

StatisticalLanguageModelling

StatisticalFramework:

acousticmodellanguagemodel

LanguageModel

—priorprobabilityofwordsequence

LMdoesnotdependonacoustics—canbeobtainedfromtextdata

11:1

Lecture11:COM326–SpeechTechnology–LanguageModelling

-grams(1)

Withoutassuminganything:

Mapcontextintoaﬁnitesetofequivalenceclasses

-grammodelassumesequivalenceclassesareprevious

trigram

11:2

words:

-grams(2)

Howlargeshouldbe?

Larger

Muchbettercontext,e.g.“He’llplaytheworldnumberonePete

Sampras”Smaller

Thetotalnumberof-gramsscalesexponentiallywith

ModelNumParameters

unigrambigramtrigramfourgram

11:3

Lecture11:COM326–SpeechTechnology–LanguageModelling

Building-grams

RemovepunctuationandnormalizetextFinitevocabulary(eg

words):mapout-of-vocabulary(OOV)

wordstounknownsymbol

Estimateconditionalprobabilitiesasaratioofjointprobabilities:

11:4

MaximumLikelihoodEstimation

Userelativefrequencyasaprobabilityestimate:

where

isthecountofwordsequence

inthetrainingdata.So,

Thisisthemaximumlikelihood(ML)solution:themodelthatmaximizestheprobabilityofthetrainingdata.

11:5

Lecture11:COM326–SpeechTechnology–LanguageModelling

DataSparseness

Languagemodellingsuffersfromadatasparsity:eg,trigrams,wordvocabulary.Alargetrainingsetmaycontain

millionwords—

thereare1milliontimesasmanypossibletrigrams...

Datasparsity+maximumlikelihood=zeroprobabilities!Ifwordsequence

isnotobservedthen

issetto

bymaximumlikelihood;thismeansany

wordsequenceincludingthesub-sequencewillhave

probability

WithMLestimated-grams,anywordsequencesthatwasnotpresentin

thetrainingdatacannotberecognized

11:6

Smoothing

Analternativesolutionistosmooththeprobabilityestimatessothatnowordsequenceisgivenaprobabilityof.

DiscountingAdjustourprobabilityestimators,sothat

relativefrequencyin

thetrainingdatadoesnotimply

relativecounts

ModelCombinationCombinemodels(unigram,bigram,trigram,...)insuch

awaysoweusethemostprecisemodelavailable:interpolationandback-off

11:7

Lecture11:COM326–SpeechTechnology–LanguageModelling

ComparisonofDiscountingMethods

ChurchandGale(1991)comparedvariousdiscountingstrategiesonanewswirecorpussplitintoatrainingsetof22millionwords,andatestsetof22millionwords

differentwordsintotaldifferentwordsintrainingset

11:8

Discounting:Notation

isthetotalnumberoftraininginstancesisthetotalnumberofpossiblebigramsisthenumberofbigramsthatoccur

timesinthetrainingcorpus

isthetrainingcorpuscountofbigram

isthediscountedcountforbigramsoccuring

times

isprobabilityestimatefor

11:9

Lecture11:COM326–SpeechTechnology–LanguageModelling

ComparisonofDiscountingMethodsforBigrams

11:10

Discounting(1):Laplace

Oldestsolution(Laplace,1814).Add1toeverycount:

OnC+Gdata:99.96%ofprobabilitymassassignedtounseenbigrams!!

11:11

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(2):ValidatingonHeld-OutData

Wecanempiricallyvalidateourmodelsusingaheld-outsetofdata(the22millionwordtestsetinChurchandGale(1991))

Agoodlanguagemodeliftheprobabilitiesestimatedontrainingdataareclosetothosethatwouldhavebeenestimatedonheld-outdata

:thetotalnumberoftimesthatallbigramswithfrequency

inthe

trainingdataappearedintheheld-outdata

Averagefrequencyoftrainingfrequency

bigramsis

Validationestimateforabigramprobabilityis:

11:12

Discounting(3):DeletedEstimation

Similarideatothevalidationestimateabove(butthatisacheatasweusetheheld-outdataasatestset!)

Deletedestimation(cross-validation):1.Splitthetrainingdatainto

sections

2.Foreachsection:hold-outsectionandcomputecountsfrom

remaining

sections;compute

3.Estimateprobabilitiesbyaveragingoverallsections:

Worksreliablywellinpractice(ifyouhaveenoughdata)

11:13

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(4a):Good-Turing

Thelimitofdeletedestimationis“leave-one-out”:onebigramineachsection

Itcanbeshowthatthisformofdiscountingresultsinthefollowingadjustedfrequencycount:

ThisisknownasGood-Turingdiscounting.

isathreshold

forNotethat,

11:14

Discounting(4b):Good-Turing

Thediscountedprobabilitiesarethus:

Formulaonlyapplieswhen

where

issmall(eg,10)

Needtorenormalizetoensureeverythingsumsto1

11:15

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(5):AbsoluteDiscounting

AbsoluteDiscountingsubtractsaconstant

fromeachnon-zerocount,

andthe“leftover”probabilityisdistributedoverunseenevents:

Estimate

fromheld-outdata—valueofaround

wouldworkwellin

theexample.

11:16

Combination(1):Interpolation

Aswellasadjustingcountsbydiscounting,itisbettertouseless-speciﬁcmodels,ratherthanrelyingonthedefault“unseenprobability”

Interpolation:interpolatelower-ordermodelswithhigherordermodels,egfortrigrams:

Constraints:

;

Moregenerally

canbeafunction(eg

11:17

Lecture11:COM326–SpeechTechnology–LanguageModelling

Combination(2a):Back-Off

Basicideaofback-offisnottosmoothtogetherdifferentordermodels,buttochoosethemostappropriatemodelorder:

isthedicsountedprobabilityestimate

computedusinge.g.Good-TuringorAbsoluteDiscounting

Theprobabilitymassthatdiscountingwouldassigntounseentrigramsispartionedbythe(backed-off)bigramprobabilities

11:18

Combination(2b):Back-Off

and

arechosensothatprobabilitiesnormalizeto

NormalizationcoefﬁcientsmaybeprecomputedThisisarecursion:compute

inasimilarfashion.

Back-offsmoothingpathfortrigrams:

11:19

Lecture11:COM326–SpeechTechnology–LanguageModelling

ExampleofBack-OffTrigram(1)

“HE’LLPLAYTHEWORLDNUMBERONEPETESAMPRAS”

logP(HE’LL##)=-4.23(P=0.000059)logP(PLAY#HE’LL)=-2.18(P=0.0066)

logP(THE

HE’LLPLAY)=-1.53(P=0.029)

logP(WORLDPLAYTHE)=-2.(P=0.0023)

logP(NUMBERTHEWORLD)=-3.48(P=0.00033)

logP(ONEWORLDNUMBER)=-0.43(P=0.37)

logP(PETE

NUMBERONE)=-3.10(P=0.00079)

logP(SAMPRASONEPETE)=-0.027(P=0.94)

sum-logprob=-17.63TestSetPerplexity=159.65

11:20

LanguageModelSmoothing

Avoidzeroprobabilityestimatesbydiscounting

Usetheappropriateordermodelbyinterpolationorback-off

Mostusualapproachinlargevocabspeechrecognition:trigramlanguagemodel,Good-Turingdiscounting,back-offcombination

11:21

Lecture11:COM326–SpeechTechnology–LanguageModelling

EvaluatingLanguageModels

BacktotheShannonGame(playedbyalanguagemodelthistime)—thebestlanguagemodelwillaveragethefewestguessesoveralargepieceoftext

Computetheaveragelogprobability()ofthelanguagemodel’s

probabilityestimates(

)overatesttext(withwords):

Perplexity(

):theaveragebranchingfactor:

11:22

References

C.D.ManningandH.Schutze(1999).FoundationsofStatisticalNaturalLanguageProcessing,MITPress,chapter6.

F.Jelinek(1997).StatisticalMethodsforSpeechRecognition,MITPress,chapters4,8,15.

K.W.ChurchandW.A.Gale(1991).“AcomparisonoftheenhancedGood-TuringanddeletedestimationmethodsforestimatingprobabilitiesofEnglishbigrams”,ComputerSpeechandLanguage,5,19–.

H.Ney,U.EssenandR.Kneser(1995).“Ontheestimationofsmallprobabilitiesbyleavingoneout”,IEEETrans.PatternAnalysisandMachineIntelligence,17,1202–1212.

S.Katz(1987).“Estimationofprobabilitiesfromsparsedataforthelanguagemodelcomponentofaspeechrecognizer”,IEEETrans.AcousticsSpeechandSignalProcessing,35,400–401.

11:23

Lecture11:COM326–SpeechTechnology–LanguageModelling

Summary

Statisticallanguagemodelling:estimatesoftheprobabilityofasequenceofwords:

-grammodels-onlyconsiderwordsequencesoflength

Estimating-gramprobabilitiesrequiressmoothingtoavoidzeroprobabilities

Discounting-assignsomeprobabilitytounseenevents

Back-offmodelling-choosethemodeloftheappropriateorder(askthetrainingdata).

11:24

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文