competition update

This commit is contained in:
nckcard
2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions

View File

@@ -0,0 +1,15 @@
.\" $Id: TEMPLATE.7,v 1.2 2019/09/09 22:35:38 stolcke Exp $
.TH XXX 7 "$Date: 2019/09/09 22:35:38 $" "SRILM Miscellaneous"
.SH NAME
XXX \- XXX
.SH SYNOPSIS
.nf
.B XXX
.fi
.SH DESCRIPTION
.SH "SEE ALSO"
.SH BUGS
.SH AUTHOR
Andreas Stolcke <stolcke@icsi.berkeley.edu>
.br
Copyright (c) 2007 SRI International

View File

@@ -0,0 +1,672 @@
.\" $Id: ngram-discount.7,v 1.5 2019/09/09 22:35:37 stolcke Exp $
.TH ngram-discount 7 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
.SH NAME
ngram-discount \- notes on the N-gram smoothing implementations in SRILM
.SH NOTATION
.TP 10
.IR a _ z
An N-gram where
.I a
is the first word,
.I z
is the last word, and "_" represents 0 or more words in between.
.TP
.IR p ( a _ z )
The estimated conditional probability of the \fIn\fPth word
.I z
given the first \fIn\fP-1 words
.RI ( a _)
of an N-gram.
.TP
.IR a _
The \fIn\fP-1 word prefix of the N-gram
.IR a _ z .
.TP
.RI _ z
The \fIn\fP-1 word suffix of the N-gram
.IR a _ z .
.TP
.IR c ( a _ z )
The count of N-gram
.IR a _ z
in the training data.
.TP
.IR n (*_ z )
The number of unique N-grams that match a given pattern.
``(*)'' represents a wildcard matching a single word.
.TP
.IR n1 , n [1]
The number of unique N-grams with count = 1.
.SH DESCRIPTION
.PP
N-gram models try to estimate the probability of a word
.I z
in the context of the previous \fIn\fP-1 words
.RI ( a _),
i.e.,
.IR Pr ( z | a _).
We will
denote this conditional probability using
.IR p ( a _ z )
for convenience.
One way to estimate
.IR p ( a _ z )
is to look at the number of times word
.I z
has followed the previous \fIn\fP-1 words
.RI ( a _):
.nf
(1) \fIp\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP)/\fIc\fP(\fIa\fP_)
.fi
This is known as the maximum likelihood (ML) estimate.
Unfortunately it does not work very well because it assigns zero probability to
N-grams that have not been observed in the training data.
To avoid the zero probabilities, we take some probability mass from the observed
N-grams and distribute it to unobserved N-grams.
Such redistribution is known as smoothing or discounting.
.PP
Most existing smoothing algorithms can be described by the following equation:
.nf
(2) \fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
.fi
If the N-gram
.IR a _ z
has been observed in the training data, we use the
distribution
.IR f ( a _ z ).
Typically
.IR f ( a _ z )
is discounted to be less than
the ML estimate so we have some leftover probability for the
.I z
words unseen in the context
.RI ( a _).
Different algorithms mainly differ on how
they discount the ML estimate to get
.IR f ( a _ z ).
.PP
If the N-gram
.IR a _ z
has not been observed in the training data, we use
the lower order distribution
.IR p (_ z ).
If the context has never been
observed (\fIc\fP(\fIa\fP_) = 0),
we can use the lower order distribution directly (bow(\fIa\fP_) = 1).
Otherwise we need to compute a backoff weight (bow) to
make sure probabilities are normalized:
.fi
Sum_\fIz\fP \fIp\fP(\fIa\fP_\fIz\fP) = 1
.fi
.PP
Let
.I Z
be the set of all words in the vocabulary,
.I Z0
be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) = 0, and
.I Z1
be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) > 0.
Given
.IR f ( a _ z ),
.RI bow( a _)
can be determined as follows:
.nf
(3) Sum_\fIZ\fP \fIp\fP(\fIa\fP_\fIz\fP) = 1
Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP) + Sum_\fIZ0\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / Sum_\fIZ0\fP \fIp\fP(_\fIz\fP)
= (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIp\fP(_\fIz\fP))
= (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP))
.fi
.PP
Smoothing is generally done in one of two ways.
The backoff models compute
.IR p ( a _ z )
based on the N-gram counts
.IR c ( a _ z )
when \fIc\fP(\fIa\fP_\fIz\fP) > 0, and
only consider lower order counts
.IR c (_ z )
when \fIc\fP(\fIa\fP_\fIz\fP) = 0.
Interpolated models take lower order counts into account when
\fIc\fP(\fIa\fP_\fIz\fP) > 0 as well.
A common way to express an interpolated model is:
.nf
(4) \fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
.fi
Where \fIg\fP(\fIa\fP_\fIz\fP) = 0 when \fIc\fP(\fIa\fP_\fIz\fP) = 0
and it is discounted to be less than
the ML estimate when \fIc\fP(\fIa\fP_\fIz\fP) > 0
to reserve some probability mass for
the unseen
.I z
words.
Given
.IR g ( a _ z ),
.RI bow( a _)
can be determined as follows:
.nf
(5) Sum_\fIZ\fP \fIp(\fP\fIa_\fP\fIz)\fP = 1
Sum_\fIZ1\fP \fIg(\fP\fIa_\fP\fIz\fP) + Sum_\fIZ\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
.fi
.PP
An interpolated model can also be expressed in the form of equation
(2), which is the way it is represented in the ARPA format model files
in SRILM:
.nf
(6) \fIf\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
.fi
.PP
Most algorithms in SRILM have both backoff and interpolated versions.
Empirically, interpolated algorithms usually do better than the backoff
ones, and Kneser-Ney does better than others.
.SH OPTIONS
.PP
This section describes the formulation of each discounting option in
.BR ngram-count (1).
After giving the motivation for each discounting method,
we will give expressions for
.IR f ( a _ z )
and
.RI bow( a _)
of Equation 2 in terms of the counts.
Note that some counts may not be included in the model
file because of the
.B \-gtmin
options; see Warning 4 in the next section.
.PP
Backoff versions are the default but interpolated versions of most
models are available using the
.B \-interpolate
option.
In this case we will express
.IR g ( a _z )
and
.RI bow( a _)
of Equation 4 in terms of the counts as well.
Note that the ARPA format model files store the interpolated
models and the backoff models the same way using
.IR f ( a _ z )
and
.RI bow( a _);
see Warning 3 in the next section.
The conversion between backoff and
interpolated formulations is given in Equation 6.
.PP
The discounting options may be followed by a digit (1-9) to indicate
that only specific N-gram orders be affected.
See
.BR ngram-count (1)
for more details.
.TP
.BI \-cdiscount " D"
Ney's absolute discounting using
.I D
as the constant to subtract.
.I D
should be between 0 and 1.
If
.I Z1
is the set
of all words
.I z
with \fIc\fP(\fIa\fP_\fIz\fP) > 0:
.nf
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
.fi
With the
.B \-interpolate
option we have:
.nf
\fIg\fP(\fIa\fP_\fIz\fP) = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP) ; Eqn.5
= \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
.fi
The suggested discount factor is:
.nf
\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
.fi
where
.I n1
and
.I n2
are the total number of N-grams with exactly one and
two counts, respectively.
Different discounting constants can be
specified for different N-gram orders using options
.BR \-cdiscount1 ,
.BR \-cdiscount2 ,
etc.
.TP
.BR \-kndiscount " and " \-ukndiscount
Kneser-Ney discounting.
This is similar to absolute discounting in
that the discounted probability is computed by subtracting a constant
.I D
from the N-gram count.
The options
.B \-kndiscount
and
.B \-ukndiscount
differ as to how this constant is computed.
.br
The main idea of Kneser-Ney is to use a modified probability estimate
for lower order N-grams used for backoff.
Specifically, the modified
probability for a lower order N-gram is taken to be proportional to the
number of unique words that precede it in the training data.
With discounting and normalization we get:
.nf
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID0\fP) / \fIc\fP(\fIa\fP_) ;; for highest order N-grams
\fIf\fP(_\fIz\fP) = (\fIn\fP(*_\fIz\fP) - \fID1\fP) / \fIn\fP(*_*) ;; for lower order N-grams
.fi
where the
.IR n (*_ z )
notation represents the number of unique N-grams that
match a given pattern with (*) used as a wildcard for a single word.
.I D0
and
.I D1
represent two different discounting constants, as each N-gram
order uses a different discounting constant.
The resulting
conditional probability and the backoff weight is calculated as given
in equations (2) and (3):
.nf
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
.fi
The option
.B \-interpolate
is used to create the interpolated versions of
.B \-kndiscount
and
.BR \-ukndiscount .
In this case we have:
.nf
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
.fi
Let
.I Z1
be the set {\fIz\fP: \fIc\fP(\fIa\fP_\fIz\fP) > 0}.
For highest order N-grams we have:
.nf
\fIg\fP(\fIa\fP_\fIz\fP) = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
= 1 - Sum_\fIZ1\fP \fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_) + Sum_\fIZ1\fP \fID\fP / \fIc\fP(\fIa\fP_)
= \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
.fi
Let
.I Z2
be the set {\fIz\fP: \fIn\fP(*_\fIz\fP) > 0}.
For lower order N-grams we have:
.nf
\fIg\fP(_\fIz\fP) = max(0, \fIn\fP(*_\fIz\fP) - \fID\fP) / \fIn\fP(*_*)
bow(_) = 1 - Sum_\fIZ2\fP \fIg\fP(_\fIz\fP)
= 1 - Sum_\fIZ2\fP \fIn\fP(*_\fIz\fP) / \fIn\fP(*_*) + Sum_\fIZ2\fP \fID\fP / \fIn\fP(*_*)
= \fID\fP \fIn\fP(_*) / \fIn\fP(*_*)
.fi
The original Kneser-Ney discounting
.RB ( \-ukndiscount )
uses one discounting constant for each N-gram order.
These constants are estimated as
.nf
\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
.fi
where
.I n1
and
.I n2
are the total number of N-grams with exactly one and
two counts, respectively.
.br
Chen and Goodman's modified Kneser-Ney discounting
.RB ( \-kndiscount )
uses three discounting constants for each N-gram order, one for one-count
N-grams, one for two-count N-grams, and one for three-plus-count N-grams:
.nf
\fIY\fP = \fIn1\fP/(\fIn1\fP+2*\fIn2\fP)
\fID1\fP = 1 - 2\fIY\fP(\fIn2\fP/\fIn1\fP)
\fID2\fP = 2 - 3\fIY\fP(\fIn3\fP/\fIn2\fP)
\fID3+\fP = 3 - 4\fIY\fP(\fIn4\fP/\fIn3\fP)
.fi
.TP
.B Warning:
SRILM implements Kneser-Ney discounting by actually modifying the
counts of the lower order N-grams. Thus, when the
.B \-write
option is
used to write the counts with
.B \-kndiscount
or
.BR \-ukndiscount ,
only the highest order N-grams and N-grams that start with <s> will have their
regular counts
.IR c ( a _ z ),
all others will have the modified counts
.IR n (*_ z )
instead.
See Warning 2 in the next section.
.TP
.B \-wbdiscount
Witten-Bell discounting.
The intuition is that the weight given
to the lower order model should be proportional to the probability of
observing an unseen word in the current context
.RI ( a _).
Witten-Bell computes this weight as:
.nf
bow(\fIa\fP_) = \fIn\fP(\fIa\fP_*) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
.fi
Here
.IR n ( a _*)
represents the number of unique words following the
context
.RI ( a _)
in the training data.
Witten-Bell is originally an interpolated discounting method.
So with the
.B \-interpolate
option we get:
.nf
\fIg\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
.fi
Without the
.B \-interpolate
option we have the backoff version which is
implemented by taking
.IR f ( a _ z )
to be the same as the interpolated
.IR g ( a _ z ).
.nf
\fIf\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
.fi
.TP
.B \-ndiscount
Ristad's natural discounting law.
See Ristad's technical report "A natural law of succession"
for a justification of the discounting factor.
The
.B \-interpolate
option has no effect, only a backoff version has been implemented.
.nf
\fIc\fP(\fIa\fP_\fIz\fP) \fIc\fP(\fIa\fP_) (\fIc\fP(\fIa\fP_) + 1) + \fIn\fP(\fIa\fP_*) (1 - \fIn\fP(\fIa\fP_*))
\fIf\fP(\fIa\fP_\fIz\fP) = ------ ---------------------------------------
\fIc\fP(\fIa\fP_) \fIc\fP(\fIa\fP_)^2 + \fIc\fP(\fIa\fP_) + 2 \fIn\fP(\fIa\fP_*)
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
.fi
.TP
.B \-count-lm
Estimate a count-based interpolated LM using Jelinek-Mercer smoothing
(Chen & Goodman, 1998), also known as "deleted interpolation."
Note that this does not produce a backoff model; instead of
count-LM parameter file in the format described in
.BR ngram (1)
needs to be specified using
.BR \-init-lm ,
and a reestimated file in the same format is produced.
In the process, the mixture weights that interpolate the ML estimates
at all levels of N-grams are estimated using an expectation-maximization (EM)
algorithm.
The options
.B \-em-iters
and
.B \-em-delta
control termination of the EM algorithm.
Note that the N-gram counts used to estimate the maximum-likelihood
estimates are specified in the
.B \-init-lm
model file.
The counts specified with
.B \-read
or
.B \-text
are used only to estimate the interpolation weights.
\" ???What does this all mean in terms of the math???
.TP
.BI \-addsmooth " D"
Smooth by adding
.I D
to each N-gram count.
This is usually a poor smoothing method,
included mainly for instructional purposes.
.nf
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + \fID\fP) / (\fIc\fP(\fIa\fP_) + \fID\fP \fIn\fP(*))
.fi
.TP
default
If the user does not specify any discounting options,
.B ngram-count
uses Good-Turing discounting (aka Katz smoothing) by default.
The Good-Turing estimate states that for any N-gram that occurs
.I r
times, we should pretend that it occurs
.IR r '
times where
.nf
\fIr\fP' = (\fIr\fP+1) \fIn\fP[\fIr\fP+1]/\fIn\fP[\fIr\fP]
.fi
Here
.IR n [ r ]
is the number of N-grams that occur exactly
.I r
times in the training data.
.br
Large counts are taken to be reliable, thus they are not subject to
any discounting.
By default unigram counts larger than 1 and other N-gram counts larger
than 7 are taken to be reliable and maximum
likelihood estimates are used.
These limits can be modified using the
.BI \-gt n max
options.
.nf
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_)) if \fIc\fP(\fIa\fP_\fIz\fP) > \fIgtmax\fP
.fi
The lower counts are discounted proportional to the Good-Turing
estimate with a small correction
.I A
to account for the high-count N-grams not being discounted.
If 1 <= \fIc\fP(\fIa\fP_\fIz\fP) <= \fIgtmax\fP:
.nf
\fIn\fP[\fIgtmax\fP + 1]
\fIA\fP = (\fIgtmax\fP + 1) --------------
\fIn\fP[1]
\fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP) + 1]
\fIc\fP'(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + 1) ---------------
\fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP)]
\fIc\fP(\fIa\fP_\fIz\fP) (\fIc\fP'(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_\fIz\fP) - \fIA\fP)
\fIf\fP(\fIa\fP_\fIz\fP) = -------- ----------------------
\fIc\fP(\fIa\fP_) (1 - \fIA\fP)
.fi
The
.B \-interpolate
option has no effect in this case, only a backoff
version has been implemented, thus:
.nf
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
.fi
.SH "FILE FORMATS"
SRILM can generate simple N-gram counts from plain text files with the
following command:
.nf
ngram-count -order \fIN\fP -text \fIfile.txt\fP -write \fIfile.cnt\fP
.fi
The
.B \-order
option determines the maximum length of the N-grams.
The file
.I file.txt
should contain one sentence per line with tokens
separated by whitespace.
The output
.I file.cnt
contains the N-gram
tokens followed by a tab and a count on each line:
.nf
\fIa\fP_\fIz\fP <tab> \fIc\fP(\fIa\fP_\fIz\fP)
.fi
A couple of warnings:
.TP
.B "Warning 1"
SRILM implicitly assumes an <s> token in the beginning of each line
and an </s> token at the end of each line and counts N-grams that start
with <s> and end with </s>.
You do not need to include these tags in
.IR file.txt .
.TP
.B "Warning 2"
When
.B \-kndiscount
or
.B \-ukndiscount
options are used, the count file contains modified counts.
Specifically, all N-grams of the maximum
order, and all N-grams that start with <s> have their regular counts
.IR c ( a _ z ),
but shorter N-grams that do not start with <s> have the number
of unique words preceding them
.IR n (* a _ z )
instead.
See the description of
.B \-kndiscount
and
.B \-ukndiscount
for details.
.PP
For most smoothing methods (except
.BR \-count-lm )
SRILM generates and uses N-gram model files in the ARPA format.
A typical command to generate a model file would be:
.nf
ngram-count -order \fIN\fP -text \fIfile.txt\fP -lm \fIfile.lm\fP
.fi
The ARPA format output
.I file.lm
will contain the following information about an N-gram on each line:
.nf
log10(\fIf\fP(\fIa\fP_\fIz\fP)) <tab> \fIa\fP_\fIz\fP <tab> log10(bow(\fIa\fP_\fIz\fP))
.fi
Based on Equation 2, the first entry represents the base 10 logarithm
of the conditional probability (logprob) for the N-gram
.IR a _ z .
This is followed by the actual words in the N-gram separated by spaces.
The last and optional entry is the base-10 logarithm of the backoff weight
for (\fIn\fP+1)-grams starting with
.IR a _ z .
.TP
.B "Warning 3"
Both backoff and interpolated models are represented in the same
format.
This means interpolation is done during model building and
represented in the ARPA format with logprob and backoff weight using
equation (6).
.TP
.B "Warning 4"
Not all N-grams in the count file necessarily end up in the model file.
The options
.BR \-gtmin ,
.BR \-gt1min ,
\&...,
.B \-gt9min
specify the minimum counts
for N-grams to be included in the LM (not only for Good-Turing
discounting but for the other methods as well).
By default all unigrams and bigrams
are included, but for higher order N-grams only those with count >= 2 are
included.
Some exceptions arise, because if one N-gram is included in
the model file, all its prefix N-grams have to be included as well.
This causes some higher order 1-count N-grams to be included when using
KN discounting, which uses modified counts as described in Warning 2.
.TP
.B "Warning 5"
Not all N-grams in the model file have backoff weights.
The highest order N-grams do not need a backoff weight.
For lower order N-grams
backoff weights are only recorded for those that appear as the prefix
of a longer N-gram included in the model.
For other lower order N-grams
the backoff weight is implicitly 1 (or 0, in log representation).
.SH "SEE ALSO"
ngram(1), ngram-count(1), ngram-format(5),
.br
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for
Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
.SH BUGS
Work in progress.
.SH AUTHOR
Deniz Yuret <dyuret@ku.edu.tr>,
Andreas Stolcke <stolcke@icsi.berkeley.edu>
.br
Copyright (c) 2007 SRI International

View File

@@ -0,0 +1,745 @@
.\" $Id: srilm-faq.7,v 1.13 2019/09/09 22:35:37 stolcke Exp $
.TH SRILM-FAQ 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
.SH NAME
SRILM-FAQ \- Frequently asked questions about SRI LM tools
.SH SYNOPSIS
.nf
man srilm-faq
.fi
.SH DESCRIPTION
This document tries to answer some of the most frequently asked questions
about SRILM.
.SS Build issues
.TP 4
.B A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty.
Building the binaries can fail for a variety of reasons.
Check the following:
.RS
.IP a)
Make sure the SRILM environment variable is set, or specified on the
make command line, e.g.:
.nf
make SRILM=$PWD
.fi
.IP b)
Make sure the
.B $SRILM/sbin/machine-type
script returns a valid string for the platform you are trying to build on.
Known platforms have machine-specific makefiles called
.nf
$SRILM/common/Makefile.machine.$MACHINE_TYPE
.fi
If
.B machine-type
does not work for some reason, you can override its output on the command line:
.nf
make MACHINE_TYPE=xyz
.fi
If you are building for an unsupported platform create a new machine-specific
makefile and mail it to stolcke@speech.sri.com.
.IP c)
Make sure your compiler works and is invoked correctly.
You will probably have to edit the
.B CC
and
.B CXX
variables in the platform-specific makefile.
If you have questions about compiler invocation and best options
consult a local expert; these things differ widely between sites.
.IP d)
The default is to compile with Tcl support.
This is in fact only used for some testing binaries (which are
not built by default),
so it can be turned off if Tcl is not available or presents problems.
Edit the machine-specific makefile accordingly.
To use Tcl, locate the
.B tcl.h
header file and the library itself, and set (for example)
.nf
TCL_INCLUDE = -I/path/to/include
TCL_LIBRARY = -L/path/to/lib -ltcl8.4
.fi
To disable Tcl support set
.nf
NO_TCL = X
TCL_INCLUDE =
TCL_LIBRARY =
.fi
.IP e)
Make sure you have the C-shell (/bin/csh) installed on your system.
Otherwise you will see something like
.nf
make: /sbin/machine-type: Command not found
.fi
early in the build process.
On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
as an optional package.
.IP f)
If you cannot get SRILM to build, save the make output to a file
.nf
make World >& make.output
.fi
and look for messages indicating errors.
If you still cannot figure out what the problem is, send the error message
and immediately preceding lines to the srilm-user list.
Also include information about your operating system ("uname -a" output)
and compiler version ("gcc -v" or equivalent for other compilers).
.RE
.TP
.B A2) The regression test outputs differ for all tests. What did I do wrong?
Most likely the binaries didn't get built or aren't executable
for some reason.
Check issue A1).
.TP
.B A3) I get differing outputs for some of the regression tests. Is that OK?
It might be.
The comparison of reference to actual output allows for small numerical
differences, but
some of the algorithms make hard decisions based on floating-point computations
that can result in different outputs as a result of different compiler
optimizations, machine floating point precisions (Intel versus IEEE format),
and math libraries.
Test of this nature include
.BR ngram-class ,
.BR disambig ,
and
.BR nbest-rover .
When encountering differences, diff the output in the
$SRILM/test/outputs/\fITEST\fP.$MACHINE_TYPE.stdout file to the corresponding
$SRILM/test/reference/\fITEST\fP.stdout, where
.I TEST
is the name of the test that failed.
Also compare the corresponding .stderr files;
differences there usually indicate operating-system related problems.
.SS Large data and memory issues
.TP 4
.B B1) I'm getting a message saying ``Assertion `body != 0' failed.''
You are running out of memory.
See subsequent questions depending on what you are trying to do.
.IP Note:
The above message means you are running
out of "virtual" memory on your computer, which could be because of
limits in swap space, administrative resource limits, or limitations of
the machine architecture (a 32-bit machine cannot address more than
4GB no matter how many resources your system has).
Another symptom of not enough memory is that your program runs, but
very, very slowly, i.e., it is "paging" or "swapping" as it tries to
use more memory than the machine has RAM installed.
.TP
.B B2) I am trying to count N-grams in a text file and running out of memory.
Don't use
.B ngram-count
directly to count N-grams.
Instead, use the
.B make-batch-counts
and
.B merge-batch-counts
scripts described in
.BR training-scripts (1).
That way you can create N-gram counts limited only by the maximum file size
on your system.
.TP
.B B3) I am trying to build an N-gram LM and ngram-count runs out of memory.
You are running out of memory either because of the size of ngram counts,
or of the LM being built. The following are strategies for reducing the
memory requirements for training LMs.
.RS
.IP a)
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
.B ngram-count
in "raw" form.
Instead, use the
.B make-big-lm
wrapper script described in the
.BR training-scripts (1)
man page.
.IP b)
Switch to using the "_c" or "_s" versions of the SRI binaries.
For
instructions on how to build them, see the INSTALL file.
Once built, set your executable search path accordingly, and try
.B make-big-lm
again.
.IP c)
Raise the minimum counts for N-grams included in the LM, i.e.,
the values of the options
.BR \-gt2min ,
.BR \-gt3min ,
.BR \-gt4min ,
etc.
The higher order N-grams typically get higher minimum counts.
.IP d)
Get a machine with more memory.
If you are hitting the limitations of a 32-bit machine architecture,
get a 64-bit machine and recompile SRILM to take advantage of the expanded
address space.
(The MACHINE_TYPE=i686-m64 setting is for systems based on
64-bit AMD processors, as well as recent compatibles from Intel.)
Note that 64-bit pointers will require a memory overhead in
themselves, so you will need a machine with significantly, not just a
little, more memory than 4GB.
.RE
.TP
.B B4) I am trying to apply a large LM to some data and am running out of memory.
Again, there are several strategies to reduce memory requirements.
.RS
.IP a)
Use the "_c" or "_s" versions of the SRI binaries.
See 3b) above.
.IP b)
Precompute the vocabulary of your test data and use the
.B "ngram \-limit-vocab"
option to load only the N-gram parameters relevant to your data.
This approach should allow you to use arbitrarily
large LMs provided the data is divided into small enough chunks.
.IP c)
If the LM can be built on a large machine, but then is to be used on
machines with limited memory, use
.B "ngram \-prune"
to remove the less important parameters of the model.
This usually gives huge size reductions with relatively modest performance
degradation.
The tradeoff is adjustable by varying the pruning parameter.
.RE
.TP
.B B5) How can I reduce the time it takes to load large LMs into memory?
The techniques described in 4b) and 4c) above also reduce the load time
of the LM.
Additional steps to try are:
.RS
.IP a)
Convert the LM into binary format, using
.nf
ngram -order \fIN\fP -lm \fIOLDLM\fP -write-bin-lm \fINEWLM\fP
.fi
(This is currently only supported for N-gram-based LMs.)
You can also generate the LM directly in binary format, using
.nf
ngram-count ... -lm \fINEWLM\fP -write-binary-lm
.fi
The resulting
.I NEWLM
file (which should not be compressed) can be used
in place of a textual LM file with all compiled SRILM tools
(but not with
.BR lm-scripts (1)).
The format is machine-independent, i.e., it can be read on machines with
different word sizes or byte-order.
Loading binary LMs is faster because
(1) it reduces the overhead of parsing the input data, and
(2) in combination with
.B \-limit-vocab
(see 4b)
it is much faster to skip sections of the LM that are out-of-vocabulary.
.IP Note:
There is also a binary format for N-gram counts.
It can be generated using
.nf
ngram-count -write-binary \fICOUNTS\fP
.fi
and has similar advantages as binary LM files.
.IP b)
Start a "probability server" that loads the LM ahead of time, and
then have "LM clients" query the server instead of computing the
probabilities themselves.
.br
The server is started on a machine named
.I HOST
using
.nf
ngram \fILMOPTIONS\fP -server-port \fIP\fP &
.fi
where
.I P
is an integer < 2^16 that specifies the TCP/IP port number the
server will listen on, and
.I LMOPTIONS
are whatever options necessary to define the LM to be used.
.br
One or more clients (programs such as
.BR ngram (1),
.BR disambig (1),
.BR lattice-tool (1))
can then query the server using the options
.nf
-use-server \fIP\fP@\fIHOST\fP -cache-served-ngrams
.fi
instead of the usual "-lm \fIFILE\fP".
The
.B \-cache-served-ngrams
option is not required but often speeds up performance dramatically by
saving the results of server lookups in the client for reuse.
Server-based LMs may be combined with file-based LMs by interpolation;
see
.BR ngram (1)
for details.
.RE
.TP
.B B6) How can I use the Google Web N-gram corpus to build an LM?
Google has made a corpus of 5-grams extracted from 1 tera-words of web data
available via LDC.
However, the data is too large to build a standard backoff N-gram, even
using the techniques described above.
Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
Such an LM computes probabilities on the fly from the counts, of which only
the subsets needed for a given test set need to be loaded into memory.
LM construction proceeds in the following steps:
.RS
.IP a)
Make sure you have built SRI binaries either for a 64-bit machine
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
This is necessary because the data contains N-gram counts exceeding
the range of 32-bit integers.
Be sure to invoke all commands below using the path to the appropriate
binary executable directory.
.IP b)
Prepare mapping file for some vocabulary mismatches and call this
.BR google.aliases :
.nf
<S> <s>
</S> </s>
<UNK> <unk>
.fi
.IP c)
Prepare an initial count-LM parameter file
.BR google.countlm.0 :
.nf
order 5
vocabsize 13588391
totalcount 1024908267229
countmodulus 40
mixweights 15
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
google-counts \fIPATH\fP
.fi
where
.I PATH
points to the location of the Google N-grams, i.e., the directory containing
subdirectories "1gms", "2gms", etc.
Note that the
.B vocabsize
and
.B totalcount
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
(Check that they match and modify as needed.)
For an explanation of the parameters see the
.BR ngram (1)
.B \-count-lm
option.
.IP d)
Prepare a text file
.B tune.text
containing data for estimating the mixture weights.
This data should be representative of, but different from your test data.
Compute the vocabulary of this data using
.nf
ngram-count -text tune.text -write-vocab tune.vocab
.fi
The vocabulary size should not exceed a few thousand to keep memory
requirements in the following steps manageable.
.IP e)
Estimate the mixture weights:
.nf
ngram-count -debug 1 -order 5 -count-lm \\
-text tune.text -vocab tune.vocab \\
-vocab-aliases google.aliases \\
-limit-vocab \\
-init-lm google.countlm.0 \\
-em-iters 100 \\
-lm google.countlm
.fi
This will write the estimated LM to
.BR google.countlm .
The output will be identical to the initial LM file, except for the
updated interpolation weights.
.IP f)
Prepare a test data file
.BR test.text ,
and its vocabulary
.B test.vocab
as in Step d) above.
Then apply the LM to the test data:
.nf
ngram -debug 2 -order 5 -count-lm \\
-lm google.countlm \\
-vocab test.vocab \\
-vocab-aliases google.aliases \\
-limit-vocab \\
-ppl test.text > test.ppl
.fi
The perplexity output will appear in
.B test.ppl.
.IP g)
Note that the Google data uses mixed case spellings.
To apply the LM to lowercase data one needs to prepare a much more
extensive vocabulary mapping table for the
.B \-vocab-aliases
option, namely, one that maps all
upper- and mixed-case spellings to lowercase strings.
This mapping file should be restricted to the words appearing in
.B tune.text
and
.BR test.text ,
respectively, to avoid defeating the effect of
.B \-limit-vocab .
.RE
.SS "Smoothing issues"
.TP 4
.B C1) What is smoothing and discounting all about?
.I Smoothing
refers to methods that assign probabilities to events (N-grams) that
do not occur in the training data.
According to a pure maximum-likelihood estimator these events would have
probability zero, which is plainly wrong since previously unseen events
in general do occur in independent test data.
Because the probability mass is redistributed away from the seen events
toward the unseen events the resulting model is "smoother" (closer to uniform)
than the ML model.
.I Discounting
refers to the approach used by many smoothing methods of adjusting the
empirical counts of seen events downwards.
The ML estimator (count divided by total number of events) is then applied
to the discounted count, resulting in a smoother estimate.
.TP
.B C2) What smoothing methods are there?
There are many, and SRILM implements are fairly large selection of the
most popular ones.
A detailed discussion of these is found in a separate document,
.BR ngram-discount (7).
.TP
.B C3) Why am I getting errors or warnings from the smoothing method I'm using?
The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
"count-of-counts", the number of words occurring one, twice, three times, etc.
The formulae for these methods become undefined if the counts-of-counts
are zero, or not strictly decreasing.
Some conditions are fatal (such as when the count of singleton words is zero),
others lead to less smoothing (and warnings).
To avoid these problems, check for the following possibilities:
.RS
.IP a)
The data could be very sparse, i.e., the training corpus very small.
Try using the Witten-Bell discounting method.
.IP b)
The vocabulary could be very small, such as when training an LM based on
characters or parts-of-speech.
Smoothing is less of an issue in those cases, and the Witten-Bell method
should work well.
.IP c)
The data was manipulated in some way, or artificially generated.
For example, duplicating data eliminates the odd-numbered counts-of-counts.
.IP d)
The vocabulary is limited during counts collection using the
.BR ngram-count
.B \-vocab
option, with the effect that many low-frequency N-grams are eliminated.
The proper approach is to compute smoothing parameters on the full vocabulary.
This happens automatically in the
.B make-big-lm
wrapper script, which is preferable to direct use of
.BR ngram-count
for other reasons (see issue B3-a above).
.IP e)
You are estimating an LM from N-gram counts that have been truncated beforehand,
e.g., by removing singleton events.
If you cannot go back to the original data and recompute the counts
there is a heuristic to extrapolate low counts-of-counts from higher ones.
The heuristic is invoked automatically (and an informational message is output)
when
.B make-big-lm
is used to estimate LMs with Kneser-Ney smoothing.
For details see the paper by W. Wang et al. in ASRU-2007, listed under
"SEE ALSO".
.RE
.TP
.B C4) How does discounting work in the case of unigrams?
First, unigrams are discounted using the same method as higher-order
N-grams, using the specified method.
The probability mass freed up in this way
is then either spread evenly over all word types
that would otherwise have zero probability (this is essentially
simulating a backoff to zero-grams), or
if all unigrams already have non-zero probabilities, the
left-over mass is added to
.I all
unigrams.
In either case all unigram probabilty probabilities will sum to 1.
An informational message from
.B ngram-count
will tell which case applies.
.TP
.B C5) Why do I get a different number of trigrams when building a 4gram model compared to just a trigram model?
This can happen when Kneser-Ney smoothing is used and the trigram cut-off
.RB ( \-gt3min )
is greater than 1 (as with the default, 2).
The count cutoffs are applied against the modified counts generated as part of KN smoothing,
so in the case of a 4gram model the trigrams are modified and the set of ngrams above the cutoff will change.
.SS "Out-of-vocabulary, zeroprob, and `unknown' words"
.TP 4
.B D1) What is the perplexity of an OOV (out of vocabulary) word?
By default any word not observed in the training data is considered
OOV and OOV words are silently ignored by the
.BR ngram (1)
during perplexity (ppl) calculation.
For example:
.nf
$ ngram-count -text turkish.train -lm turkish.lm
$ ngram -lm turkish.lm -ppl turkish.test
file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09
.fi
The statistics printed in the last two lines have the following meanings:
.RS
.TP
.B "34153 OOVs"
This is the number of unknown word tokens, i.e. tokens
that appear in
.B turkish.test
but not in
.B turkish.train
from which
.B turkish.lm
was generated.
.TP
.B "logprob= -3.20177e+06"
This gives us the total logprob ignoring the 34153 unknown word tokens.
The logprob does include the probabilities
assigned to </s> tokens which are introduced by
.BR ngram-count (1).
Thus the total number of tokens which this logprob is based on is
.nf
words - OOVs + sentences = 1000015 - 34153 + 61031
.fi
.TP
.B "ppl = 1311.97"
This gives us the geometric average of 1/probability of
each token, i.e., perplexity.
The exact expression is:
.nf
ppl = 10^(-logprob / (words - OOVs + sentences))
.fi
.TP
.B "ppl1 = 2065.09"
This gives us the average perplexity per word excluding the </s> tokens.
The exact expression is:
.nf
ppl1 = 10^(-logprob / (words - OOVs))
.fi
.RE
You can verify these numbers by running the
.B ngram
program with the
.B "\-debug 2"
option, which gives the probability assigned to each token.
.TP
.B D2) What happens when the OOV word is in the context of an N-gram?
Exact details depend on the discounting algorithm used, but typically
the backed-off probability from a lower order N-gram is used. If the
.B \-unk
option is used as explained below, an <unk> token is assumed to
take the place of the OOV word and no back-off may be necessary
if a corresponding N-gram containing <unk> is found in the LM.
.TP
.B D3) Isn't it wrong to assign 0 logprob to OOV words?
That depends on the application.
If you are comparing multiple language
models which all consider the same set of words as OOV it may be OK to
ignore OOV words.
Note that perplexity comparisons are only ever meaningful
if the vocabularies of all LMs are the same.
Therefore, to compare LMs with different sets of OOV words
(such as when using different tokenization strategies for morphologically
complex languages) then it becomes important
to take into account the true cost of the OOV words, or to model all words,
including OOVs.
.TP
.B D4) How do I take into account the true cost of the OOV words?
A simple strategy is to "explode" the OOV words, i.e., split them into
characters in the training and test data.
Typically words that appear more than once in the training data are
considered to be vocabulary words.
All other words are split into their characters and the
individual characters are considered tokens.
Assuming that all characters occur at least once in the training data there
will be no OOV tokens in the test data.
Note that this strategy changes the number of tokens in the data set,
so even though logprob is meaningful be careful when reporting ppl results.
.TP
.B D5) What if I want to model the OOV words explicitly?
Maybe a better strategy is to have a separate "letter" model for OOV words.
This can be easily created using SRILM by using a training
file listing the OOV words one per line with their characters
separated by spaces.
The
.B ngram-count
options
.B \-ukndiscount
and
.B "\-order 7"
seem to work well for this purpose.
The final logprob results are obtained in two steps.
First do regular training and testing on your data using
.B \-vocab
and
.B \-unk
options.
The resulting logprob will include the cost of the vocabulary words and an
<unk> token for each OOV word.
Then apply the letter model to each OOV word in the test set.
Add the logprobs.
Here is an example:
.nf
# Determine vocabulary:
ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
awk '$2>1' turkish.train.1cnt | cut -f1 | sort > turkish.train.vocab
awk '$2==1' turkish.train.1cnt | cut -f1 | sort > turkish.train.oov
# Word model:
ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test > turkish.test.ppl
# Letter model:
perl -C -lne 'print join(" ", split(""))' turkish.train.oov > turkish.train.oov.split
ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
perl -pe 's/\\s+/\\n/g' turkish.test | sort > turkish.test.words
comm -23 turkish.test.words turkish.train.vocab > turkish.test.oov
perl -C -lne 'print join(" ", split(""))' turkish.test.oov > turkish.test.oov.split
ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model > turkish.test.oov.ppl
# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.
.fi
Again, perplexities are not directly meaningful as computed by SRILM, but you
can recompute them by hand using the combined logprob value, and the number of
original word tokens in the test set.
.TP
.B D6) What are zeroprob words and when do they occur?
In-vocabulary words that get zero probability are counted as
"zeroprobs" in the ppl output.
Just as OOV words, they are excluded from the perplexity
computation since otherwise the perplexity value would be infinity.
There are three reasons why zeroprobs could occur in a
closed vocabulary setting (the default for SRILM):
.RS
.IP a)
If the same vocabulary is used at test time as was used during
training, and smoothing is enabled, then the occurrence of zeroprobs
indicates an anomalous condition and, possibly, a broken language model.
.IP b)
If smoothing has been disabled (e.g., by using the option
.BR "\-cdiscount 0" ),
then the LM will use maximum likelihood estimates for
the N-grams and then any unseen N-gram is a zeroprob.
.IP c)
If a different vocabulary file is specified at test time than
the one used in training, then the definition of what counts as an OOV
will change.
In particular, a word that wasn't seen in the training data (but is in the
test vocabulary) will
.I not
be mapped to
.B <unk>
and, therefore, not
count as an OOV in the perplexity computation.
However, it will still get zero probability and, therefore, be tallied
as a zeroprob.
.RE
.TP
.B D7) What is the point of using the \fB<unk>\fP token?
Using
.B <unk>
is a practical convenience employed by SRILM.
Words not in the specified vocabulary are mapped to
.BR <unk> ,
which is equivalent to performing the same mapping
in a data pre-processing step outside of SRILM.
Other than that,
for both LM estimation and evaluation purposes,
.B <unk>
is treated like any other word.
(In particular, in the computation of discounted probabilities
there is no special handling of
.BR <unk> .)
.TP
.B D8) So how do I train an open-vocabulary LM with \fB<unk>\fP?
First, make sure to use the
.B ngram-count
.B \-unk
option, which simply indicates that the
.B <unk>
word should be included in the LM vocabulary, as required for an
open-vocabulary LM.
Without this option, N-grams containing
.B <unk>
would simply be discarded.
An "open vocabulary" LM is simply one that contains
.BR <unk> ,
and can therefore (by virtue of the mapping of OOVs to
.BR <unk> )
assign a non-zero probability to them.
Next, we need to ensure there are actual occurrences of
.B <unk>
N-grams
in the training data so we can obtain meaningful probability estimates
for them
(otherwise
.B <unk>
would only get probabilty via unigram discounting, see item C4).
To get a proper estimate
of the
.B <unk>
probability, we need to explicitly specify a vocabulary that is not
a superset of the training data.
One way to do that is to extract the vocabulary from an independent
data set, or by only including words with some minimum count (greater than 1)
in the training data.
.TP
.B D9) Doesn't ngram-count \-addsmooth deal with OOV words by adding a constant to occurrence counts?
No, all smoothing is applied when building the LM at training time,
so it must use the
.B <unk>
mechanism to assign probability to words that are first seen in the
test data.
Furthermore, even add-constant smoothing requires a fixed, finite
vocabulary to compute the denominator of its estimator.
.SH "SEE ALSO"
ngram(1), ngram-count(1), training-scripts(1), ngram-discount(7).
.br
$SRILM/INSTALL
.br
http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
.br
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
.br
W. Wang, A. Stolcke, & J. Zheng,
Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
.SH BUGS
This document is work in progress.
.SH AUTHOR
Andreas Stolcke <andreas.stolcke@microsoft.com>,
Deniz Yuret <dyuret@ku.edu.tr>,
Nitin Madnani <nmadnani@umiacs.umd.edu>
.br
Copyright (c) 2007\-2010 SRI International
.br
Copyright (c) 2011\-2017 Andreas Stolcke
.br
Copyright (c) 2011\-2017 Microsoft Corp.