您好,欢迎来到华佗健康网。
搜索
您的当前位置:首页Model Num Parameters

Model Num Parameters

来源:华佗健康网
StatisticalLanguageModelling

StatisticalFramework:

acousticmodellanguagemodel

LanguageModel

—priorprobabilityofwordsequence

LMdoesnotdependonacoustics—canbeobtainedfromtextdata

+

+

+

11:1

Lecture11:COM326–SpeechTechnology–LanguageModelling

-grams(1)

Withoutassuminganything:

Mapcontextintoafinitesetofequivalenceclasses

-grammodelassumesequivalenceclassesareprevious

trigram

11:2

words:

+

+

+

-grams(2)

+

+

+

Howlargeshouldbe?

Larger

Muchbettercontext,e.g.“He’llplaytheworldnumberonePete

Sampras”Smaller

Thetotalnumberof-gramsscalesexponentiallywith

:

ModelNumParameters

unigrambigramtrigramfourgram

11:3

Lecture11:COM326–SpeechTechnology–LanguageModelling

Building-grams

RemovepunctuationandnormalizetextFinitevocabulary(eg

words):mapout-of-vocabulary(OOV)

wordstounknownsymbol

Estimateconditionalprobabilitiesasaratioofjointprobabilities:

11:4

+

+

+

MaximumLikelihoodEstimation

Userelativefrequencyasaprobabilityestimate:

where

isthecountofwordsequence

inthetrainingdata.So,

Thisisthemaximumlikelihood(ML)solution:themodelthatmaximizestheprobabilityofthetrainingdata.

+

11:5

+

Lecture11:COM326–SpeechTechnology–LanguageModelling

DataSparseness

Languagemodellingsuffersfromadatasparsity:eg,trigrams,wordvocabulary.Alargetrainingsetmaycontain

millionwords—

thereare1milliontimesasmanypossibletrigrams...

Datasparsity+maximumlikelihood=zeroprobabilities!Ifwordsequence

isnotobservedthen

issetto

bymaximumlikelihood;thismeansany

wordsequenceincludingthesub-sequencewillhave

probability

WithMLestimated-grams,anywordsequencesthatwasnotpresentin

thetrainingdatacannotberecognized

+

11:6

+

+

+

Smoothing

+

+

+

Analternativesolutionistosmooththeprobabilityestimatessothatnowordsequenceisgivenaprobabilityof.

DiscountingAdjustourprobabilityestimators,sothat

relativefrequencyin

thetrainingdatadoesnotimply

relativecounts

ModelCombinationCombinemodels(unigram,bigram,trigram,...)insuch

awaysoweusethemostprecisemodelavailable:interpolationandback-off

11:7

Lecture11:COM326–SpeechTechnology–LanguageModelling

ComparisonofDiscountingMethods

ChurchandGale(1991)comparedvariousdiscountingstrategiesonanewswirecorpussplitintoatrainingsetof22millionwords,andatestsetof22millionwords

differentwordsintotaldifferentwordsintrainingset

11:8

+

+

+

Discounting:Notation

isthetotalnumberoftraininginstancesisthetotalnumberofpossiblebigramsisthenumberofbigramsthatoccur

timesinthetrainingcorpus

isthetrainingcorpuscountofbigram

isthediscountedcountforbigramsoccuring

times

isprobabilityestimatefor

+

11:9

+

Lecture11:COM326–SpeechTechnology–LanguageModelling

ComparisonofDiscountingMethodsforBigrams

+

11:10

+

+

+

Discounting(1):Laplace

+

+

+

Oldestsolution(Laplace,1814).Add1toeverycount:

OnC+Gdata:99.96%ofprobabilitymassassignedtounseenbigrams!!

11:11

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(2):ValidatingonHeld-OutData

Wecanempiricallyvalidateourmodelsusingaheld-outsetofdata(the22millionwordtestsetinChurchandGale(1991))

Agoodlanguagemodeliftheprobabilitiesestimatedontrainingdataareclosetothosethatwouldhavebeenestimatedonheld-outdata

:thetotalnumberoftimesthatallbigramswithfrequency

inthe

trainingdataappearedintheheld-outdata

Averagefrequencyoftrainingfrequency

bigramsis

.

Validationestimateforabigramprobabilityis:

11:12

+

+

+

+

+

+

Discounting(3):DeletedEstimation

Similarideatothevalidationestimateabove(butthatisacheatasweusetheheld-outdataasatestset!)

Deletedestimation(cross-validation):1.Splitthetrainingdatainto

sections

2.Foreachsection:hold-outsectionandcomputecountsfrom

remaining

sections;compute

3.Estimateprobabilitiesbyaveragingoverallsections:

Worksreliablywellinpractice(ifyouhaveenoughdata)

11:13

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(4a):Good-Turing

Thelimitofdeletedestimationis“leave-one-out”:onebigramineachsection

Itcanbeshowthatthisformofdiscountingresultsinthefollowingadjustedfrequencycount:

ThisisknownasGood-Turingdiscounting.

isathreshold

forNotethat,

11:14

+

+

+

+

+

+

Discounting(4b):Good-Turing

Thediscountedprobabilitiesarethus:

Formulaonlyapplieswhen

where

issmall(eg,10)

Needtorenormalizetoensureeverythingsumsto1

11:15

Lecture11:COM326–SpeechTechnology–LanguageModelling

Discounting(5):AbsoluteDiscounting

AbsoluteDiscountingsubtractsaconstant

fromeachnon-zerocount,

andthe“leftover”probabilityisdistributedoverunseenevents:

if

if

Estimate

fromheld-outdata—valueofaround

wouldworkwellin

theexample.

11:16

+

+

+

+

+

+

Combination(1):Interpolation

Aswellasadjustingcountsbydiscounting,itisbettertouseless-specificmodels,ratherthanrelyingonthedefault“unseenprobability”

Interpolation:interpolatelower-ordermodelswithhigherordermodels,egfortrigrams:

Constraints:

;

.

of

Moregenerally

canbeafunction(eg

).

11:17

Lecture11:COM326–SpeechTechnology–LanguageModelling

Combination(2a):Back-Off

Basicideaofback-offisnottosmoothtogetherdifferentordermodels,buttochoosethemostappropriatemodelorder:

if

if

if

isthedicsountedprobabilityestimate

computedusinge.g.Good-TuringorAbsoluteDiscounting

Theprobabilitymassthatdiscountingwouldassigntounseentrigramsispartionedbythe(backed-off)bigramprobabilities

11:18

+

+

+

Combination(2b):Back-Off

and

arechosensothatprobabilitiesnormalizeto

NormalizationcoefficientsmaybeprecomputedThisisarecursion:compute

inasimilarfashion.

Back-offsmoothingpathfortrigrams:

+

11:19

+

Lecture11:COM326–SpeechTechnology–LanguageModelling

ExampleofBack-OffTrigram(1)

“HE’LLPLAYTHEWORLDNUMBERONEPETESAMPRAS”

logP(HE’LL##)=-4.23(P=0.000059)logP(PLAY#HE’LL)=-2.18(P=0.0066)

logP(THE

HE’LLPLAY)=-1.53(P=0.029)

logP(WORLDPLAYTHE)=-2.(P=0.0023)

logP(NUMBERTHEWORLD)=-3.48(P=0.00033)

logP(ONEWORLDNUMBER)=-0.43(P=0.37)

logP(PETE

NUMBERONE)=-3.10(P=0.00079)

logP(SAMPRASONEPETE)=-0.027(P=0.94)

sum-logprob=-17.63TestSetPerplexity=159.65

+

11:20

+

+

+

+

+

+

LanguageModelSmoothing

Avoidzeroprobabilityestimatesbydiscounting

Usetheappropriateordermodelbyinterpolationorback-off

Mostusualapproachinlargevocabspeechrecognition:trigramlanguagemodel,Good-Turingdiscounting,back-offcombination

11:21

Lecture11:COM326–SpeechTechnology–LanguageModelling

EvaluatingLanguageModels

BacktotheShannonGame(playedbyalanguagemodelthistime)—thebestlanguagemodelwillaveragethefewestguessesoveralargepieceoftext

Computetheaveragelogprobability()ofthelanguagemodel’s

probabilityestimates(

)overatesttext(withwords):

Perplexity(

):theaveragebranchingfactor:

11:22

+

+

+

References

+

+

+

C.D.ManningandH.Schutze(1999).FoundationsofStatisticalNaturalLanguageProcessing,MITPress,chapter6.

F.Jelinek(1997).StatisticalMethodsforSpeechRecognition,MITPress,chapters4,8,15.

K.W.ChurchandW.A.Gale(1991).“AcomparisonoftheenhancedGood-TuringanddeletedestimationmethodsforestimatingprobabilitiesofEnglishbigrams”,ComputerSpeechandLanguage,5,19–.

H.Ney,U.EssenandR.Kneser(1995).“Ontheestimationofsmallprobabilitiesbyleavingoneout”,IEEETrans.PatternAnalysisandMachineIntelligence,17,1202–1212.

S.Katz(1987).“Estimationofprobabilitiesfromsparsedataforthelanguagemodelcomponentofaspeechrecognizer”,IEEETrans.AcousticsSpeechandSignalProcessing,35,400–401.

11:23

Lecture11:COM326–SpeechTechnology–LanguageModelling

Summary

Statisticallanguagemodelling:estimatesoftheprobabilityofasequenceofwords:

-grammodels-onlyconsiderwordsequencesoflength

Estimating-gramprobabilitiesrequiressmoothingtoavoidzeroprobabilities

Discounting-assignsomeprobabilitytounseenevents

Back-offmodelling-choosethemodeloftheappropriateorder(askthetrainingdata).

11:24

+

+

+

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- huatuo0.com 版权所有 湘ICP备2023021991号-1

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务