691 lines
22 KiB
HTML
691 lines
22 KiB
HTML
<! $Id: ngram-discount.7,v 1.5 2019/09/09 22:35:37 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>ngram-discount</TITLE>
|
|
<BODY>
|
|
<H1>ngram-discount</H1>
|
|
<H2> NAME </H2>
|
|
ngram-discount - notes on the N-gram smoothing implementations in SRILM
|
|
<H2> NOTATION </H2>
|
|
<DL>
|
|
<DT><I>a</I>_<I>z</I><I></I><I></I>
|
|
<DD>
|
|
An N-gram where
|
|
<I> a </I>
|
|
is the first word,
|
|
<I> z </I>
|
|
is the last word, and "_" represents 0 or more words in between.
|
|
<DT><I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
<DD>
|
|
The estimated conditional probability of the <I>n</I>th word
|
|
<I> z </I>
|
|
given the first <I>n</I>-1 words
|
|
(<I>a</I>_)<I></I><I></I>
|
|
of an N-gram.
|
|
<DT><I>a</I>_<I></I><I></I><I></I>
|
|
<DD>
|
|
The <I>n</I>-1 word prefix of the N-gram
|
|
<I>a</I>_<I>z</I>.<I></I><I></I>
|
|
<DT>_<I>z</I><I></I><I></I>
|
|
<DD>
|
|
The <I>n</I>-1 word suffix of the N-gram
|
|
<I>a</I>_<I>z</I>.<I></I><I></I>
|
|
<DT><I>c</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
<DD>
|
|
The count of N-gram
|
|
<I>a</I>_<I>z</I><I></I><I></I>
|
|
in the training data.
|
|
<DT><I>n</I>(*_<I>z</I>)<I></I><I></I>
|
|
<DD>
|
|
The number of unique N-grams that match a given pattern.
|
|
``(*)'' represents a wildcard matching a single word.
|
|
<DT><I>n1</I>,<I>n</I>[1]<I></I><I></I>
|
|
<DD>
|
|
The number of unique N-grams with count = 1.
|
|
</DD>
|
|
</DL>
|
|
<H2> DESCRIPTION </H2>
|
|
<P>
|
|
N-gram models try to estimate the probability of a word
|
|
<I> z </I>
|
|
in the context of the previous <I>n</I>-1 words
|
|
(<I>a</I>_),<I></I><I></I>
|
|
i.e.,
|
|
<I>Pr</I>(<I>z</I>|<I>a</I>_).<I></I>
|
|
We will
|
|
denote this conditional probability using
|
|
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
for convenience.
|
|
One way to estimate
|
|
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
is to look at the number of times word
|
|
<I> z </I>
|
|
has followed the previous <I>n</I>-1 words
|
|
(<I>a</I>_):<I></I><I></I>
|
|
<PRE>
|
|
|
|
(1) <I>p</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>)/<I>c</I>(<I>a</I>_)
|
|
|
|
</PRE>
|
|
This is known as the maximum likelihood (ML) estimate.
|
|
Unfortunately it does not work very well because it assigns zero probability to
|
|
N-grams that have not been observed in the training data.
|
|
To avoid the zero probabilities, we take some probability mass from the observed
|
|
N-grams and distribute it to unobserved N-grams.
|
|
Such redistribution is known as smoothing or discounting.
|
|
<P>
|
|
Most existing smoothing algorithms can be described by the following equation:
|
|
<PRE>
|
|
|
|
(2) <I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
|
|
|
</PRE>
|
|
If the N-gram
|
|
<I>a</I>_<I>z</I><I></I><I></I>
|
|
has been observed in the training data, we use the
|
|
distribution
|
|
<I>f</I>(<I>a</I>_<I>z</I>).<I></I>
|
|
Typically
|
|
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
is discounted to be less than
|
|
the ML estimate so we have some leftover probability for the
|
|
<I> z </I>
|
|
words unseen in the context
|
|
(<I>a</I>_).<I></I><I></I>
|
|
Different algorithms mainly differ on how
|
|
they discount the ML estimate to get
|
|
<I>f</I>(<I>a</I>_<I>z</I>).<I></I>
|
|
<P>
|
|
If the N-gram
|
|
<I>a</I>_<I>z</I><I></I><I></I>
|
|
has not been observed in the training data, we use
|
|
the lower order distribution
|
|
<I>p</I>(_<I>z</I>).<I></I><I></I>
|
|
If the context has never been
|
|
observed (<I>c</I>(<I>a</I>_) = 0),
|
|
we can use the lower order distribution directly (bow(<I>a</I>_) = 1).
|
|
Otherwise we need to compute a backoff weight (bow) to
|
|
make sure probabilities are normalized:
|
|
</PRE>
|
|
|
|
Sum_<I>z</I> <I>p</I>(<I>a</I>_<I>z</I>) = 1
|
|
|
|
</PRE>
|
|
<P>
|
|
Let
|
|
<I> Z </I>
|
|
be the set of all words in the vocabulary,
|
|
<I> Z0 </I>
|
|
be the set of all words with <I>c</I>(<I>a</I>_<I>z</I>) = 0, and
|
|
<I> Z1 </I>
|
|
be the set of all words with <I>c</I>(<I>a</I>_<I>z</I>) > 0.
|
|
Given
|
|
<I>f</I>(<I>a</I>_<I>z</I>),<I></I>
|
|
bow(<I>a</I>_)<I></I><I></I>
|
|
can be determined as follows:
|
|
<PRE>
|
|
|
|
(3) Sum_<I>Z</I> <I>p</I>(<I>a</I>_<I>z</I>) = 1
|
|
Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>) + Sum_<I>Z0</I> bow(<I>a</I>_) <I>p</I>(_<I>z</I>) = 1
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / Sum_<I>Z0</I> <I>p</I>(_<I>z</I>)
|
|
= (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>p</I>(_<I>z</I>))
|
|
= (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>))
|
|
|
|
</PRE>
|
|
<P>
|
|
Smoothing is generally done in one of two ways.
|
|
The backoff models compute
|
|
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
based on the N-gram counts
|
|
<I>c</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
when <I>c</I>(<I>a</I>_<I>z</I>) > 0, and
|
|
only consider lower order counts
|
|
<I>c</I>(_<I>z</I>)<I></I><I></I>
|
|
when <I>c</I>(<I>a</I>_<I>z</I>) = 0.
|
|
Interpolated models take lower order counts into account when
|
|
<I>c</I>(<I>a</I>_<I>z</I>) > 0 as well.
|
|
A common way to express an interpolated model is:
|
|
<PRE>
|
|
|
|
(4) <I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
|
|
|
</PRE>
|
|
Where <I>g</I>(<I>a</I>_<I>z</I>) = 0 when <I>c</I>(<I>a</I>_<I>z</I>) = 0
|
|
and it is discounted to be less than
|
|
the ML estimate when <I>c</I>(<I>a</I>_<I>z</I>) > 0
|
|
to reserve some probability mass for
|
|
the unseen
|
|
<I> z </I>
|
|
words.
|
|
Given
|
|
<I>g</I>(<I>a</I>_<I>z</I>),<I></I>
|
|
bow(<I>a</I>_)<I></I><I></I>
|
|
can be determined as follows:
|
|
<PRE>
|
|
|
|
(5) Sum_<I>Z</I> <I>p(</I><I>a_</I><I>z)</I> = 1
|
|
Sum_<I>Z1</I> <I>g(</I><I>a_</I><I>z</I>) + Sum_<I>Z</I> bow(<I>a</I>_) <I>p</I>(_<I>z</I>) = 1
|
|
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>)
|
|
|
|
</PRE>
|
|
<P>
|
|
An interpolated model can also be expressed in the form of equation
|
|
(2), which is the way it is represented in the ARPA format model files
|
|
in SRILM:
|
|
<PRE>
|
|
|
|
(6) <I>f</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
|
|
|
</PRE>
|
|
<P>
|
|
Most algorithms in SRILM have both backoff and interpolated versions.
|
|
Empirically, interpolated algorithms usually do better than the backoff
|
|
ones, and Kneser-Ney does better than others.
|
|
|
|
<H2> OPTIONS </H2>
|
|
<P>
|
|
This section describes the formulation of each discounting option in
|
|
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
|
|
After giving the motivation for each discounting method,
|
|
we will give expressions for
|
|
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
and
|
|
bow(<I>a</I>_)<I></I><I></I>
|
|
of Equation 2 in terms of the counts.
|
|
Note that some counts may not be included in the model
|
|
file because of the
|
|
<B> -gtmin </B>
|
|
options; see Warning 4 in the next section.
|
|
<P>
|
|
Backoff versions are the default but interpolated versions of most
|
|
models are available using the
|
|
<B> -interpolate </B>
|
|
option.
|
|
In this case we will express
|
|
<I>g</I>(<I>a</I>_z<I>)</I><I></I>
|
|
and
|
|
bow(<I>a</I>_)<I></I><I></I>
|
|
of Equation 4 in terms of the counts as well.
|
|
Note that the ARPA format model files store the interpolated
|
|
models and the backoff models the same way using
|
|
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
and
|
|
bow(<I>a</I>_);<I></I><I></I>
|
|
see Warning 3 in the next section.
|
|
The conversion between backoff and
|
|
interpolated formulations is given in Equation 6.
|
|
<P>
|
|
The discounting options may be followed by a digit (1-9) to indicate
|
|
that only specific N-gram orders be affected.
|
|
See
|
|
<A HREF="ngram-count.1.html">ngram-count(1)</A>
|
|
for more details.
|
|
<DL>
|
|
<DT><B>-cdiscount</B><I> D</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Ney's absolute discounting using
|
|
<I> D </I>
|
|
as the constant to subtract.
|
|
<I> D </I>
|
|
should be between 0 and 1.
|
|
If
|
|
<I> Z1 </I>
|
|
is the set
|
|
of all words
|
|
<I> z </I>
|
|
with <I>c</I>(<I>a</I>_<I>z</I>) > 0:
|
|
<PRE>
|
|
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
|
|
|
</PRE>
|
|
With the
|
|
<B> -interpolate </B>
|
|
option we have:
|
|
<PRE>
|
|
|
|
<I>g</I>(<I>a</I>_<I>z</I>) = max(0, <I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
|
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>) ; Eqn.5
|
|
= <I>D</I> <I>n</I>(<I>a</I>_*) / <I>c</I>(<I>a</I>_)
|
|
|
|
</PRE>
|
|
The suggested discount factor is:
|
|
<PRE>
|
|
|
|
<I>D</I> = <I>n1</I> / (<I>n1</I> + 2*<I>n2</I>)
|
|
|
|
</PRE>
|
|
where
|
|
<I> n1 </I>
|
|
and
|
|
<I> n2 </I>
|
|
are the total number of N-grams with exactly one and
|
|
two counts, respectively.
|
|
Different discounting constants can be
|
|
specified for different N-gram orders using options
|
|
<B>-cdiscount1</B>,<B></B><B></B><B></B>
|
|
<B>-cdiscount2</B>,<B></B><B></B><B></B>
|
|
etc.
|
|
<DT><B>-kndiscount</B> and <B>-ukndiscount</B><B></B><B></B>
|
|
<DD>
|
|
Kneser-Ney discounting.
|
|
This is similar to absolute discounting in
|
|
that the discounted probability is computed by subtracting a constant
|
|
<I> D </I>
|
|
from the N-gram count.
|
|
The options
|
|
<B> -kndiscount </B>
|
|
and
|
|
<B> -ukndiscount </B>
|
|
differ as to how this constant is computed.
|
|
<BR>
|
|
The main idea of Kneser-Ney is to use a modified probability estimate
|
|
for lower order N-grams used for backoff.
|
|
Specifically, the modified
|
|
probability for a lower order N-gram is taken to be proportional to the
|
|
number of unique words that precede it in the training data.
|
|
With discounting and normalization we get:
|
|
<PRE>
|
|
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) - <I>D0</I>) / <I>c</I>(<I>a</I>_) ;; for highest order N-grams
|
|
<I>f</I>(_<I>z</I>) = (<I>n</I>(*_<I>z</I>) - <I>D1</I>) / <I>n</I>(*_*) ;; for lower order N-grams
|
|
|
|
</PRE>
|
|
where the
|
|
<I>n</I>(*_<I>z</I>)<I></I><I></I>
|
|
notation represents the number of unique N-grams that
|
|
match a given pattern with (*) used as a wildcard for a single word.
|
|
<I> D0 </I>
|
|
and
|
|
<I> D1 </I>
|
|
represent two different discounting constants, as each N-gram
|
|
order uses a different discounting constant.
|
|
The resulting
|
|
conditional probability and the backoff weight is calculated as given
|
|
in equations (2) and (3):
|
|
<PRE>
|
|
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
|
|
|
</PRE>
|
|
The option
|
|
<B> -interpolate </B>
|
|
is used to create the interpolated versions of
|
|
<B> -kndiscount </B>
|
|
and
|
|
<B>-ukndiscount</B>.<B></B><B></B><B></B>
|
|
In this case we have:
|
|
<PRE>
|
|
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
|
|
|
</PRE>
|
|
Let
|
|
<I> Z1 </I>
|
|
be the set {<I>z</I>: <I>c</I>(<I>a</I>_<I>z</I>) > 0}.
|
|
For highest order N-grams we have:
|
|
<PRE>
|
|
|
|
<I>g</I>(<I>a</I>_<I>z</I>) = max(0, <I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
|
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>)
|
|
= 1 - Sum_<I>Z1</I> <I>c</I>(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_) + Sum_<I>Z1</I> <I>D</I> / <I>c</I>(<I>a</I>_)
|
|
= <I>D</I> <I>n</I>(<I>a</I>_*) / <I>c</I>(<I>a</I>_)
|
|
|
|
</PRE>
|
|
Let
|
|
<I> Z2 </I>
|
|
be the set {<I>z</I>: <I>n</I>(*_<I>z</I>) > 0}.
|
|
For lower order N-grams we have:
|
|
<PRE>
|
|
|
|
<I>g</I>(_<I>z</I>) = max(0, <I>n</I>(*_<I>z</I>) - <I>D</I>) / <I>n</I>(*_*)
|
|
bow(_) = 1 - Sum_<I>Z2</I> <I>g</I>(_<I>z</I>)
|
|
= 1 - Sum_<I>Z2</I> <I>n</I>(*_<I>z</I>) / <I>n</I>(*_*) + Sum_<I>Z2</I> <I>D</I> / <I>n</I>(*_*)
|
|
= <I>D</I> <I>n</I>(_*) / <I>n</I>(*_*)
|
|
|
|
</PRE>
|
|
The original Kneser-Ney discounting
|
|
(<B>-ukndiscount</B>)<B></B><B></B>
|
|
uses one discounting constant for each N-gram order.
|
|
These constants are estimated as
|
|
<PRE>
|
|
|
|
<I>D</I> = <I>n1</I> / (<I>n1</I> + 2*<I>n2</I>)
|
|
|
|
</PRE>
|
|
where
|
|
<I> n1 </I>
|
|
and
|
|
<I> n2 </I>
|
|
are the total number of N-grams with exactly one and
|
|
two counts, respectively.
|
|
<BR>
|
|
Chen and Goodman's modified Kneser-Ney discounting
|
|
(<B>-kndiscount</B>)<B></B><B></B>
|
|
uses three discounting constants for each N-gram order, one for one-count
|
|
N-grams, one for two-count N-grams, and one for three-plus-count N-grams:
|
|
<PRE>
|
|
|
|
<I>Y</I> = <I>n1</I>/(<I>n1</I>+2*<I>n2</I>)
|
|
<I>D1</I> = 1 - 2<I>Y</I>(<I>n2</I>/<I>n1</I>)
|
|
<I>D2</I> = 2 - 3<I>Y</I>(<I>n3</I>/<I>n2</I>)
|
|
<I>D3+</I> = 3 - 4<I>Y</I>(<I>n4</I>/<I>n3</I>)
|
|
|
|
</PRE>
|
|
<DT><B> Warning: </B>
|
|
<DD>
|
|
SRILM implements Kneser-Ney discounting by actually modifying the
|
|
counts of the lower order N-grams. Thus, when the
|
|
<B> -write </B>
|
|
option is
|
|
used to write the counts with
|
|
<B> -kndiscount </B>
|
|
or
|
|
<B>-ukndiscount</B>,<B></B><B></B><B></B>
|
|
only the highest order N-grams and N-grams that start with <s> will have their
|
|
regular counts
|
|
<I>c</I>(<I>a</I>_<I>z</I>),<I></I>
|
|
all others will have the modified counts
|
|
<I>n</I>(*_<I>z</I>)<I></I><I></I>
|
|
instead.
|
|
See Warning 2 in the next section.
|
|
<DT><B> -wbdiscount </B>
|
|
<DD>
|
|
Witten-Bell discounting.
|
|
The intuition is that the weight given
|
|
to the lower order model should be proportional to the probability of
|
|
observing an unseen word in the current context
|
|
(<I>a</I>_).<I></I><I></I>
|
|
Witten-Bell computes this weight as:
|
|
<PRE>
|
|
|
|
bow(<I>a</I>_) = <I>n</I>(<I>a</I>_*) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
|
|
|
</PRE>
|
|
Here
|
|
<I>n</I>(<I>a</I>_*)<I></I><I></I>
|
|
represents the number of unique words following the
|
|
context
|
|
(<I>a</I>_)<I></I><I></I>
|
|
in the training data.
|
|
Witten-Bell is originally an interpolated discounting method.
|
|
So with the
|
|
<B> -interpolate </B>
|
|
option we get:
|
|
<PRE>
|
|
|
|
<I>g</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
|
|
|
</PRE>
|
|
Without the
|
|
<B> -interpolate </B>
|
|
option we have the backoff version which is
|
|
implemented by taking
|
|
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
|
to be the same as the interpolated
|
|
<I>g</I>(<I>a</I>_<I>z</I>).<I></I>
|
|
<PRE>
|
|
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
|
|
|
</PRE>
|
|
<DT><B> -ndiscount </B>
|
|
<DD>
|
|
Ristad's natural discounting law.
|
|
See Ristad's technical report "A natural law of succession"
|
|
for a justification of the discounting factor.
|
|
The
|
|
<B> -interpolate </B>
|
|
option has no effect, only a backoff version has been implemented.
|
|
<PRE>
|
|
|
|
<I>c</I>(<I>a</I>_<I>z</I>) <I>c</I>(<I>a</I>_) (<I>c</I>(<I>a</I>_) + 1) + <I>n</I>(<I>a</I>_*) (1 - <I>n</I>(<I>a</I>_*))
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = ------ ---------------------------------------
|
|
<I>c</I>(<I>a</I>_) <I>c</I>(<I>a</I>_)^2 + <I>c</I>(<I>a</I>_) + 2 <I>n</I>(<I>a</I>_*)
|
|
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
|
|
|
</PRE>
|
|
<DT><B> -count-lm </B>
|
|
<DD>
|
|
Estimate a count-based interpolated LM using Jelinek-Mercer smoothing
|
|
(Chen & Goodman, 1998), also known as "deleted interpolation."
|
|
Note that this does not produce a backoff model; instead of
|
|
count-LM parameter file in the format described in
|
|
<A HREF="ngram.1.html">ngram(1)</A>
|
|
needs to be specified using
|
|
<B>-init-lm</B>,<B></B><B></B><B></B>
|
|
and a reestimated file in the same format is produced.
|
|
In the process, the mixture weights that interpolate the ML estimates
|
|
at all levels of N-grams are estimated using an expectation-maximization (EM)
|
|
algorithm.
|
|
The options
|
|
<B> -em-iters </B>
|
|
and
|
|
<B> -em-delta </B>
|
|
control termination of the EM algorithm.
|
|
Note that the N-gram counts used to estimate the maximum-likelihood
|
|
estimates are specified in the
|
|
<B> -init-lm </B>
|
|
model file.
|
|
The counts specified with
|
|
<B> -read </B>
|
|
or
|
|
<B> -text </B>
|
|
are used only to estimate the interpolation weights.
|
|
\" ???What does this all mean in terms of the math???
|
|
<DT><B>-addsmooth</B><I> D</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Smooth by adding
|
|
<I> D </I>
|
|
to each N-gram count.
|
|
This is usually a poor smoothing method,
|
|
included mainly for instructional purposes.
|
|
<PRE>
|
|
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) + <I>D</I>) / (<I>c</I>(<I>a</I>_) + <I>D</I> <I>n</I>(*))
|
|
|
|
</PRE>
|
|
<DT>default
|
|
<DD>
|
|
If the user does not specify any discounting options,
|
|
<B> ngram-count </B>
|
|
uses Good-Turing discounting (aka Katz smoothing) by default.
|
|
The Good-Turing estimate states that for any N-gram that occurs
|
|
<I> r </I>
|
|
times, we should pretend that it occurs
|
|
<I>r</I>'<I></I><I></I><I></I>
|
|
times where
|
|
<PRE>
|
|
|
|
<I>r</I>' = (<I>r</I>+1) <I>n</I>[<I>r</I>+1]/<I>n</I>[<I>r</I>]
|
|
|
|
</PRE>
|
|
Here
|
|
<I>n</I>[<I>r</I>]<I></I><I></I>
|
|
is the number of N-grams that occur exactly
|
|
<I> r </I>
|
|
times in the training data.
|
|
<BR>
|
|
Large counts are taken to be reliable, thus they are not subject to
|
|
any discounting.
|
|
By default unigram counts larger than 1 and other N-gram counts larger
|
|
than 7 are taken to be reliable and maximum
|
|
likelihood estimates are used.
|
|
These limits can be modified using the
|
|
<B>-gt</B><I>n</I><B>max</B><I></I><B></B><I></I><B></B>
|
|
options.
|
|
<PRE>
|
|
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_)) if <I>c</I>(<I>a</I>_<I>z</I>) > <I>gtmax</I>
|
|
|
|
</PRE>
|
|
The lower counts are discounted proportional to the Good-Turing
|
|
estimate with a small correction
|
|
<I> A </I>
|
|
to account for the high-count N-grams not being discounted.
|
|
If 1 <= <I>c</I>(<I>a</I>_<I>z</I>) <= <I>gtmax</I>:
|
|
<PRE>
|
|
|
|
<I>n</I>[<I>gtmax</I> + 1]
|
|
<I>A</I> = (<I>gtmax</I> + 1) --------------
|
|
<I>n</I>[1]
|
|
|
|
<I>n</I>[<I>c</I>(<I>a</I>_<I>z</I>) + 1]
|
|
<I>c</I>'(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) + 1) ---------------
|
|
<I>n</I>[<I>c</I>(<I>a</I>_<I>z</I>)]
|
|
|
|
<I>c</I>(<I>a</I>_<I>z</I>) (<I>c</I>'(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_<I>z</I>) - <I>A</I>)
|
|
<I>f</I>(<I>a</I>_<I>z</I>) = -------- ----------------------
|
|
<I>c</I>(<I>a</I>_) (1 - <I>A</I>)
|
|
|
|
</PRE>
|
|
The
|
|
<B> -interpolate </B>
|
|
option has no effect in this case, only a backoff
|
|
version has been implemented, thus:
|
|
<PRE>
|
|
|
|
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
|
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
|
|
|
</PRE>
|
|
</DD>
|
|
</DL>
|
|
<H2> FILE FORMATS </H2>
|
|
SRILM can generate simple N-gram counts from plain text files with the
|
|
following command:
|
|
<PRE>
|
|
ngram-count -order <I>N</I> -text <I>file.txt</I> -write <I>file.cnt</I>
|
|
</PRE>
|
|
The
|
|
<B> -order </B>
|
|
option determines the maximum length of the N-grams.
|
|
The file
|
|
<I> file.txt </I>
|
|
should contain one sentence per line with tokens
|
|
separated by whitespace.
|
|
The output
|
|
<I> file.cnt </I>
|
|
contains the N-gram
|
|
tokens followed by a tab and a count on each line:
|
|
<PRE>
|
|
|
|
<I>a</I>_<I>z</I> <tab> <I>c</I>(<I>a</I>_<I>z</I>)
|
|
|
|
</PRE>
|
|
A couple of warnings:
|
|
<DL>
|
|
<DT><B> Warning 1 </B>
|
|
<DD>
|
|
SRILM implicitly assumes an <s> token in the beginning of each line
|
|
and an </s> token at the end of each line and counts N-grams that start
|
|
with <s> and end with </s>.
|
|
You do not need to include these tags in
|
|
<I>file.txt</I>.<I></I><I></I><I></I>
|
|
<DT><B> Warning 2 </B>
|
|
<DD>
|
|
When
|
|
<B> -kndiscount </B>
|
|
or
|
|
<B> -ukndiscount </B>
|
|
options are used, the count file contains modified counts.
|
|
Specifically, all N-grams of the maximum
|
|
order, and all N-grams that start with <s> have their regular counts
|
|
<I>c</I>(<I>a</I>_<I>z</I>),<I></I>
|
|
but shorter N-grams that do not start with <s> have the number
|
|
of unique words preceding them
|
|
<I>n</I>(*<I>a</I>_<I>z</I>)<I></I>
|
|
instead.
|
|
See the description of
|
|
<B> -kndiscount </B>
|
|
and
|
|
<B> -ukndiscount </B>
|
|
for details.
|
|
</DD>
|
|
</DL>
|
|
<P>
|
|
For most smoothing methods (except
|
|
<B>-count-lm</B>)<B></B><B></B><B></B>
|
|
SRILM generates and uses N-gram model files in the ARPA format.
|
|
A typical command to generate a model file would be:
|
|
<PRE>
|
|
ngram-count -order <I>N</I> -text <I>file.txt</I> -lm <I>file.lm</I>
|
|
</PRE>
|
|
The ARPA format output
|
|
<I> file.lm </I>
|
|
will contain the following information about an N-gram on each line:
|
|
<PRE>
|
|
|
|
log10(<I>f</I>(<I>a</I>_<I>z</I>)) <tab> <I>a</I>_<I>z</I> <tab> log10(bow(<I>a</I>_<I>z</I>))
|
|
|
|
</PRE>
|
|
Based on Equation 2, the first entry represents the base 10 logarithm
|
|
of the conditional probability (logprob) for the N-gram
|
|
<I>a</I>_<I>z</I>.<I></I><I></I>
|
|
This is followed by the actual words in the N-gram separated by spaces.
|
|
The last and optional entry is the base-10 logarithm of the backoff weight
|
|
for (<I>n</I>+1)-grams starting with
|
|
<I>a</I>_<I>z</I>.<I></I><I></I>
|
|
<DL>
|
|
<DT><B> Warning 3 </B>
|
|
<DD>
|
|
Both backoff and interpolated models are represented in the same
|
|
format.
|
|
This means interpolation is done during model building and
|
|
represented in the ARPA format with logprob and backoff weight using
|
|
equation (6).
|
|
<DT><B> Warning 4 </B>
|
|
<DD>
|
|
Not all N-grams in the count file necessarily end up in the model file.
|
|
The options
|
|
<B>-gtmin</B>,<B></B><B></B><B></B>
|
|
<B>-gt1min</B>,<B></B><B></B><B></B>
|
|
...,
|
|
<B> -gt9min </B>
|
|
specify the minimum counts
|
|
for N-grams to be included in the LM (not only for Good-Turing
|
|
discounting but for the other methods as well).
|
|
By default all unigrams and bigrams
|
|
are included, but for higher order N-grams only those with count >= 2 are
|
|
included.
|
|
Some exceptions arise, because if one N-gram is included in
|
|
the model file, all its prefix N-grams have to be included as well.
|
|
This causes some higher order 1-count N-grams to be included when using
|
|
KN discounting, which uses modified counts as described in Warning 2.
|
|
<DT><B> Warning 5 </B>
|
|
<DD>
|
|
Not all N-grams in the model file have backoff weights.
|
|
The highest order N-grams do not need a backoff weight.
|
|
For lower order N-grams
|
|
backoff weights are only recorded for those that appear as the prefix
|
|
of a longer N-gram included in the model.
|
|
For other lower order N-grams
|
|
the backoff weight is implicitly 1 (or 0, in log representation).
|
|
|
|
</DD>
|
|
</DL>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>,
|
|
<BR>
|
|
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for
|
|
Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
|
|
<H2> BUGS </H2>
|
|
Work in progress.
|
|
<H2> AUTHOR </H2>
|
|
Deniz Yuret <dyuret@ku.edu.tr>,
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
<BR>
|
|
Copyright (c) 2007 SRI International
|
|
</BODY>
|
|
</HTML>
|