competition update
This commit is contained in:
15
language_model/srilm-1.7.3/man/man7/TEMPLATE.7
Normal file
15
language_model/srilm-1.7.3/man/man7/TEMPLATE.7
Normal file
@@ -0,0 +1,15 @@
|
||||
.\" $Id: TEMPLATE.7,v 1.2 2019/09/09 22:35:38 stolcke Exp $
|
||||
.TH XXX 7 "$Date: 2019/09/09 22:35:38 $" "SRILM Miscellaneous"
|
||||
.SH NAME
|
||||
XXX \- XXX
|
||||
.SH SYNOPSIS
|
||||
.nf
|
||||
.B XXX
|
||||
.fi
|
||||
.SH DESCRIPTION
|
||||
.SH "SEE ALSO"
|
||||
.SH BUGS
|
||||
.SH AUTHOR
|
||||
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
||||
.br
|
||||
Copyright (c) 2007 SRI International
|
||||
672
language_model/srilm-1.7.3/man/man7/ngram-discount.7
Normal file
672
language_model/srilm-1.7.3/man/man7/ngram-discount.7
Normal file
@@ -0,0 +1,672 @@
|
||||
.\" $Id: ngram-discount.7,v 1.5 2019/09/09 22:35:37 stolcke Exp $
|
||||
.TH ngram-discount 7 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
|
||||
.SH NAME
|
||||
ngram-discount \- notes on the N-gram smoothing implementations in SRILM
|
||||
.SH NOTATION
|
||||
.TP 10
|
||||
.IR a _ z
|
||||
An N-gram where
|
||||
.I a
|
||||
is the first word,
|
||||
.I z
|
||||
is the last word, and "_" represents 0 or more words in between.
|
||||
.TP
|
||||
.IR p ( a _ z )
|
||||
The estimated conditional probability of the \fIn\fPth word
|
||||
.I z
|
||||
given the first \fIn\fP-1 words
|
||||
.RI ( a _)
|
||||
of an N-gram.
|
||||
.TP
|
||||
.IR a _
|
||||
The \fIn\fP-1 word prefix of the N-gram
|
||||
.IR a _ z .
|
||||
.TP
|
||||
.RI _ z
|
||||
The \fIn\fP-1 word suffix of the N-gram
|
||||
.IR a _ z .
|
||||
.TP
|
||||
.IR c ( a _ z )
|
||||
The count of N-gram
|
||||
.IR a _ z
|
||||
in the training data.
|
||||
.TP
|
||||
.IR n (*_ z )
|
||||
The number of unique N-grams that match a given pattern.
|
||||
``(*)'' represents a wildcard matching a single word.
|
||||
.TP
|
||||
.IR n1 , n [1]
|
||||
The number of unique N-grams with count = 1.
|
||||
.SH DESCRIPTION
|
||||
.PP
|
||||
N-gram models try to estimate the probability of a word
|
||||
.I z
|
||||
in the context of the previous \fIn\fP-1 words
|
||||
.RI ( a _),
|
||||
i.e.,
|
||||
.IR Pr ( z | a _).
|
||||
We will
|
||||
denote this conditional probability using
|
||||
.IR p ( a _ z )
|
||||
for convenience.
|
||||
One way to estimate
|
||||
.IR p ( a _ z )
|
||||
is to look at the number of times word
|
||||
.I z
|
||||
has followed the previous \fIn\fP-1 words
|
||||
.RI ( a _):
|
||||
.nf
|
||||
|
||||
(1) \fIp\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP)/\fIc\fP(\fIa\fP_)
|
||||
|
||||
.fi
|
||||
This is known as the maximum likelihood (ML) estimate.
|
||||
Unfortunately it does not work very well because it assigns zero probability to
|
||||
N-grams that have not been observed in the training data.
|
||||
To avoid the zero probabilities, we take some probability mass from the observed
|
||||
N-grams and distribute it to unobserved N-grams.
|
||||
Such redistribution is known as smoothing or discounting.
|
||||
.PP
|
||||
Most existing smoothing algorithms can be described by the following equation:
|
||||
.nf
|
||||
|
||||
(2) \fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
|
||||
|
||||
.fi
|
||||
If the N-gram
|
||||
.IR a _ z
|
||||
has been observed in the training data, we use the
|
||||
distribution
|
||||
.IR f ( a _ z ).
|
||||
Typically
|
||||
.IR f ( a _ z )
|
||||
is discounted to be less than
|
||||
the ML estimate so we have some leftover probability for the
|
||||
.I z
|
||||
words unseen in the context
|
||||
.RI ( a _).
|
||||
Different algorithms mainly differ on how
|
||||
they discount the ML estimate to get
|
||||
.IR f ( a _ z ).
|
||||
.PP
|
||||
If the N-gram
|
||||
.IR a _ z
|
||||
has not been observed in the training data, we use
|
||||
the lower order distribution
|
||||
.IR p (_ z ).
|
||||
If the context has never been
|
||||
observed (\fIc\fP(\fIa\fP_) = 0),
|
||||
we can use the lower order distribution directly (bow(\fIa\fP_) = 1).
|
||||
Otherwise we need to compute a backoff weight (bow) to
|
||||
make sure probabilities are normalized:
|
||||
.fi
|
||||
|
||||
Sum_\fIz\fP \fIp\fP(\fIa\fP_\fIz\fP) = 1
|
||||
|
||||
.fi
|
||||
.PP
|
||||
Let
|
||||
.I Z
|
||||
be the set of all words in the vocabulary,
|
||||
.I Z0
|
||||
be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) = 0, and
|
||||
.I Z1
|
||||
be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) > 0.
|
||||
Given
|
||||
.IR f ( a _ z ),
|
||||
.RI bow( a _)
|
||||
can be determined as follows:
|
||||
.nf
|
||||
|
||||
(3) Sum_\fIZ\fP \fIp\fP(\fIa\fP_\fIz\fP) = 1
|
||||
Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP) + Sum_\fIZ0\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / Sum_\fIZ0\fP \fIp\fP(_\fIz\fP)
|
||||
= (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIp\fP(_\fIz\fP))
|
||||
= (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP))
|
||||
|
||||
.fi
|
||||
.PP
|
||||
Smoothing is generally done in one of two ways.
|
||||
The backoff models compute
|
||||
.IR p ( a _ z )
|
||||
based on the N-gram counts
|
||||
.IR c ( a _ z )
|
||||
when \fIc\fP(\fIa\fP_\fIz\fP) > 0, and
|
||||
only consider lower order counts
|
||||
.IR c (_ z )
|
||||
when \fIc\fP(\fIa\fP_\fIz\fP) = 0.
|
||||
Interpolated models take lower order counts into account when
|
||||
\fIc\fP(\fIa\fP_\fIz\fP) > 0 as well.
|
||||
A common way to express an interpolated model is:
|
||||
.nf
|
||||
|
||||
(4) \fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
|
||||
|
||||
.fi
|
||||
Where \fIg\fP(\fIa\fP_\fIz\fP) = 0 when \fIc\fP(\fIa\fP_\fIz\fP) = 0
|
||||
and it is discounted to be less than
|
||||
the ML estimate when \fIc\fP(\fIa\fP_\fIz\fP) > 0
|
||||
to reserve some probability mass for
|
||||
the unseen
|
||||
.I z
|
||||
words.
|
||||
Given
|
||||
.IR g ( a _ z ),
|
||||
.RI bow( a _)
|
||||
can be determined as follows:
|
||||
.nf
|
||||
|
||||
(5) Sum_\fIZ\fP \fIp(\fP\fIa_\fP\fIz)\fP = 1
|
||||
Sum_\fIZ1\fP \fIg(\fP\fIa_\fP\fIz\fP) + Sum_\fIZ\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
|
||||
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
|
||||
|
||||
.fi
|
||||
.PP
|
||||
An interpolated model can also be expressed in the form of equation
|
||||
(2), which is the way it is represented in the ARPA format model files
|
||||
in SRILM:
|
||||
.nf
|
||||
|
||||
(6) \fIf\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
|
||||
|
||||
.fi
|
||||
.PP
|
||||
Most algorithms in SRILM have both backoff and interpolated versions.
|
||||
Empirically, interpolated algorithms usually do better than the backoff
|
||||
ones, and Kneser-Ney does better than others.
|
||||
|
||||
.SH OPTIONS
|
||||
.PP
|
||||
This section describes the formulation of each discounting option in
|
||||
.BR ngram-count (1).
|
||||
After giving the motivation for each discounting method,
|
||||
we will give expressions for
|
||||
.IR f ( a _ z )
|
||||
and
|
||||
.RI bow( a _)
|
||||
of Equation 2 in terms of the counts.
|
||||
Note that some counts may not be included in the model
|
||||
file because of the
|
||||
.B \-gtmin
|
||||
options; see Warning 4 in the next section.
|
||||
.PP
|
||||
Backoff versions are the default but interpolated versions of most
|
||||
models are available using the
|
||||
.B \-interpolate
|
||||
option.
|
||||
In this case we will express
|
||||
.IR g ( a _z )
|
||||
and
|
||||
.RI bow( a _)
|
||||
of Equation 4 in terms of the counts as well.
|
||||
Note that the ARPA format model files store the interpolated
|
||||
models and the backoff models the same way using
|
||||
.IR f ( a _ z )
|
||||
and
|
||||
.RI bow( a _);
|
||||
see Warning 3 in the next section.
|
||||
The conversion between backoff and
|
||||
interpolated formulations is given in Equation 6.
|
||||
.PP
|
||||
The discounting options may be followed by a digit (1-9) to indicate
|
||||
that only specific N-gram orders be affected.
|
||||
See
|
||||
.BR ngram-count (1)
|
||||
for more details.
|
||||
.TP
|
||||
.BI \-cdiscount " D"
|
||||
Ney's absolute discounting using
|
||||
.I D
|
||||
as the constant to subtract.
|
||||
.I D
|
||||
should be between 0 and 1.
|
||||
If
|
||||
.I Z1
|
||||
is the set
|
||||
of all words
|
||||
.I z
|
||||
with \fIc\fP(\fIa\fP_\fIz\fP) > 0:
|
||||
.nf
|
||||
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
|
||||
|
||||
.fi
|
||||
With the
|
||||
.B \-interpolate
|
||||
option we have:
|
||||
.nf
|
||||
|
||||
\fIg\fP(\fIa\fP_\fIz\fP) = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
|
||||
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP) ; Eqn.5
|
||||
= \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
|
||||
|
||||
.fi
|
||||
The suggested discount factor is:
|
||||
.nf
|
||||
|
||||
\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
|
||||
|
||||
.fi
|
||||
where
|
||||
.I n1
|
||||
and
|
||||
.I n2
|
||||
are the total number of N-grams with exactly one and
|
||||
two counts, respectively.
|
||||
Different discounting constants can be
|
||||
specified for different N-gram orders using options
|
||||
.BR \-cdiscount1 ,
|
||||
.BR \-cdiscount2 ,
|
||||
etc.
|
||||
.TP
|
||||
.BR \-kndiscount " and " \-ukndiscount
|
||||
Kneser-Ney discounting.
|
||||
This is similar to absolute discounting in
|
||||
that the discounted probability is computed by subtracting a constant
|
||||
.I D
|
||||
from the N-gram count.
|
||||
The options
|
||||
.B \-kndiscount
|
||||
and
|
||||
.B \-ukndiscount
|
||||
differ as to how this constant is computed.
|
||||
.br
|
||||
The main idea of Kneser-Ney is to use a modified probability estimate
|
||||
for lower order N-grams used for backoff.
|
||||
Specifically, the modified
|
||||
probability for a lower order N-gram is taken to be proportional to the
|
||||
number of unique words that precede it in the training data.
|
||||
With discounting and normalization we get:
|
||||
.nf
|
||||
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID0\fP) / \fIc\fP(\fIa\fP_) ;; for highest order N-grams
|
||||
\fIf\fP(_\fIz\fP) = (\fIn\fP(*_\fIz\fP) - \fID1\fP) / \fIn\fP(*_*) ;; for lower order N-grams
|
||||
|
||||
.fi
|
||||
where the
|
||||
.IR n (*_ z )
|
||||
notation represents the number of unique N-grams that
|
||||
match a given pattern with (*) used as a wildcard for a single word.
|
||||
.I D0
|
||||
and
|
||||
.I D1
|
||||
represent two different discounting constants, as each N-gram
|
||||
order uses a different discounting constant.
|
||||
The resulting
|
||||
conditional probability and the backoff weight is calculated as given
|
||||
in equations (2) and (3):
|
||||
.nf
|
||||
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
|
||||
|
||||
.fi
|
||||
The option
|
||||
.B \-interpolate
|
||||
is used to create the interpolated versions of
|
||||
.B \-kndiscount
|
||||
and
|
||||
.BR \-ukndiscount .
|
||||
In this case we have:
|
||||
.nf
|
||||
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
|
||||
|
||||
.fi
|
||||
Let
|
||||
.I Z1
|
||||
be the set {\fIz\fP: \fIc\fP(\fIa\fP_\fIz\fP) > 0}.
|
||||
For highest order N-grams we have:
|
||||
.nf
|
||||
|
||||
\fIg\fP(\fIa\fP_\fIz\fP) = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
|
||||
bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
|
||||
= 1 - Sum_\fIZ1\fP \fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_) + Sum_\fIZ1\fP \fID\fP / \fIc\fP(\fIa\fP_)
|
||||
= \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
|
||||
|
||||
.fi
|
||||
Let
|
||||
.I Z2
|
||||
be the set {\fIz\fP: \fIn\fP(*_\fIz\fP) > 0}.
|
||||
For lower order N-grams we have:
|
||||
.nf
|
||||
|
||||
\fIg\fP(_\fIz\fP) = max(0, \fIn\fP(*_\fIz\fP) - \fID\fP) / \fIn\fP(*_*)
|
||||
bow(_) = 1 - Sum_\fIZ2\fP \fIg\fP(_\fIz\fP)
|
||||
= 1 - Sum_\fIZ2\fP \fIn\fP(*_\fIz\fP) / \fIn\fP(*_*) + Sum_\fIZ2\fP \fID\fP / \fIn\fP(*_*)
|
||||
= \fID\fP \fIn\fP(_*) / \fIn\fP(*_*)
|
||||
|
||||
.fi
|
||||
The original Kneser-Ney discounting
|
||||
.RB ( \-ukndiscount )
|
||||
uses one discounting constant for each N-gram order.
|
||||
These constants are estimated as
|
||||
.nf
|
||||
|
||||
\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
|
||||
|
||||
.fi
|
||||
where
|
||||
.I n1
|
||||
and
|
||||
.I n2
|
||||
are the total number of N-grams with exactly one and
|
||||
two counts, respectively.
|
||||
.br
|
||||
Chen and Goodman's modified Kneser-Ney discounting
|
||||
.RB ( \-kndiscount )
|
||||
uses three discounting constants for each N-gram order, one for one-count
|
||||
N-grams, one for two-count N-grams, and one for three-plus-count N-grams:
|
||||
.nf
|
||||
|
||||
\fIY\fP = \fIn1\fP/(\fIn1\fP+2*\fIn2\fP)
|
||||
\fID1\fP = 1 - 2\fIY\fP(\fIn2\fP/\fIn1\fP)
|
||||
\fID2\fP = 2 - 3\fIY\fP(\fIn3\fP/\fIn2\fP)
|
||||
\fID3+\fP = 3 - 4\fIY\fP(\fIn4\fP/\fIn3\fP)
|
||||
|
||||
.fi
|
||||
.TP
|
||||
.B Warning:
|
||||
SRILM implements Kneser-Ney discounting by actually modifying the
|
||||
counts of the lower order N-grams. Thus, when the
|
||||
.B \-write
|
||||
option is
|
||||
used to write the counts with
|
||||
.B \-kndiscount
|
||||
or
|
||||
.BR \-ukndiscount ,
|
||||
only the highest order N-grams and N-grams that start with <s> will have their
|
||||
regular counts
|
||||
.IR c ( a _ z ),
|
||||
all others will have the modified counts
|
||||
.IR n (*_ z )
|
||||
instead.
|
||||
See Warning 2 in the next section.
|
||||
.TP
|
||||
.B \-wbdiscount
|
||||
Witten-Bell discounting.
|
||||
The intuition is that the weight given
|
||||
to the lower order model should be proportional to the probability of
|
||||
observing an unseen word in the current context
|
||||
.RI ( a _).
|
||||
Witten-Bell computes this weight as:
|
||||
.nf
|
||||
|
||||
bow(\fIa\fP_) = \fIn\fP(\fIa\fP_*) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
|
||||
|
||||
.fi
|
||||
Here
|
||||
.IR n ( a _*)
|
||||
represents the number of unique words following the
|
||||
context
|
||||
.RI ( a _)
|
||||
in the training data.
|
||||
Witten-Bell is originally an interpolated discounting method.
|
||||
So with the
|
||||
.B \-interpolate
|
||||
option we get:
|
||||
.nf
|
||||
|
||||
\fIg\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.4
|
||||
|
||||
.fi
|
||||
Without the
|
||||
.B \-interpolate
|
||||
option we have the backoff version which is
|
||||
implemented by taking
|
||||
.IR f ( a _ z )
|
||||
to be the same as the interpolated
|
||||
.IR g ( a _ z ).
|
||||
.nf
|
||||
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
|
||||
|
||||
.fi
|
||||
.TP
|
||||
.B \-ndiscount
|
||||
Ristad's natural discounting law.
|
||||
See Ristad's technical report "A natural law of succession"
|
||||
for a justification of the discounting factor.
|
||||
The
|
||||
.B \-interpolate
|
||||
option has no effect, only a backoff version has been implemented.
|
||||
.nf
|
||||
|
||||
\fIc\fP(\fIa\fP_\fIz\fP) \fIc\fP(\fIa\fP_) (\fIc\fP(\fIa\fP_) + 1) + \fIn\fP(\fIa\fP_*) (1 - \fIn\fP(\fIa\fP_*))
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = ------ ---------------------------------------
|
||||
\fIc\fP(\fIa\fP_) \fIc\fP(\fIa\fP_)^2 + \fIc\fP(\fIa\fP_) + 2 \fIn\fP(\fIa\fP_*)
|
||||
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
|
||||
|
||||
.fi
|
||||
.TP
|
||||
.B \-count-lm
|
||||
Estimate a count-based interpolated LM using Jelinek-Mercer smoothing
|
||||
(Chen & Goodman, 1998), also known as "deleted interpolation."
|
||||
Note that this does not produce a backoff model; instead of
|
||||
count-LM parameter file in the format described in
|
||||
.BR ngram (1)
|
||||
needs to be specified using
|
||||
.BR \-init-lm ,
|
||||
and a reestimated file in the same format is produced.
|
||||
In the process, the mixture weights that interpolate the ML estimates
|
||||
at all levels of N-grams are estimated using an expectation-maximization (EM)
|
||||
algorithm.
|
||||
The options
|
||||
.B \-em-iters
|
||||
and
|
||||
.B \-em-delta
|
||||
control termination of the EM algorithm.
|
||||
Note that the N-gram counts used to estimate the maximum-likelihood
|
||||
estimates are specified in the
|
||||
.B \-init-lm
|
||||
model file.
|
||||
The counts specified with
|
||||
.B \-read
|
||||
or
|
||||
.B \-text
|
||||
are used only to estimate the interpolation weights.
|
||||
\" ???What does this all mean in terms of the math???
|
||||
.TP
|
||||
.BI \-addsmooth " D"
|
||||
Smooth by adding
|
||||
.I D
|
||||
to each N-gram count.
|
||||
This is usually a poor smoothing method,
|
||||
included mainly for instructional purposes.
|
||||
.nf
|
||||
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + \fID\fP) / (\fIc\fP(\fIa\fP_) + \fID\fP \fIn\fP(*))
|
||||
|
||||
.fi
|
||||
.TP
|
||||
default
|
||||
If the user does not specify any discounting options,
|
||||
.B ngram-count
|
||||
uses Good-Turing discounting (aka Katz smoothing) by default.
|
||||
The Good-Turing estimate states that for any N-gram that occurs
|
||||
.I r
|
||||
times, we should pretend that it occurs
|
||||
.IR r '
|
||||
times where
|
||||
.nf
|
||||
|
||||
\fIr\fP' = (\fIr\fP+1) \fIn\fP[\fIr\fP+1]/\fIn\fP[\fIr\fP]
|
||||
|
||||
.fi
|
||||
Here
|
||||
.IR n [ r ]
|
||||
is the number of N-grams that occur exactly
|
||||
.I r
|
||||
times in the training data.
|
||||
.br
|
||||
Large counts are taken to be reliable, thus they are not subject to
|
||||
any discounting.
|
||||
By default unigram counts larger than 1 and other N-gram counts larger
|
||||
than 7 are taken to be reliable and maximum
|
||||
likelihood estimates are used.
|
||||
These limits can be modified using the
|
||||
.BI \-gt n max
|
||||
options.
|
||||
.nf
|
||||
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_)) if \fIc\fP(\fIa\fP_\fIz\fP) > \fIgtmax\fP
|
||||
|
||||
.fi
|
||||
The lower counts are discounted proportional to the Good-Turing
|
||||
estimate with a small correction
|
||||
.I A
|
||||
to account for the high-count N-grams not being discounted.
|
||||
If 1 <= \fIc\fP(\fIa\fP_\fIz\fP) <= \fIgtmax\fP:
|
||||
.nf
|
||||
|
||||
\fIn\fP[\fIgtmax\fP + 1]
|
||||
\fIA\fP = (\fIgtmax\fP + 1) --------------
|
||||
\fIn\fP[1]
|
||||
|
||||
\fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP) + 1]
|
||||
\fIc\fP'(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + 1) ---------------
|
||||
\fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP)]
|
||||
|
||||
\fIc\fP(\fIa\fP_\fIz\fP) (\fIc\fP'(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_\fIz\fP) - \fIA\fP)
|
||||
\fIf\fP(\fIa\fP_\fIz\fP) = -------- ----------------------
|
||||
\fIc\fP(\fIa\fP_) (1 - \fIA\fP)
|
||||
|
||||
.fi
|
||||
The
|
||||
.B \-interpolate
|
||||
option has no effect in this case, only a backoff
|
||||
version has been implemented, thus:
|
||||
.nf
|
||||
|
||||
\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP) ; Eqn.2
|
||||
bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
|
||||
|
||||
.fi
|
||||
.SH "FILE FORMATS"
|
||||
SRILM can generate simple N-gram counts from plain text files with the
|
||||
following command:
|
||||
.nf
|
||||
ngram-count -order \fIN\fP -text \fIfile.txt\fP -write \fIfile.cnt\fP
|
||||
.fi
|
||||
The
|
||||
.B \-order
|
||||
option determines the maximum length of the N-grams.
|
||||
The file
|
||||
.I file.txt
|
||||
should contain one sentence per line with tokens
|
||||
separated by whitespace.
|
||||
The output
|
||||
.I file.cnt
|
||||
contains the N-gram
|
||||
tokens followed by a tab and a count on each line:
|
||||
.nf
|
||||
|
||||
\fIa\fP_\fIz\fP <tab> \fIc\fP(\fIa\fP_\fIz\fP)
|
||||
|
||||
.fi
|
||||
A couple of warnings:
|
||||
.TP
|
||||
.B "Warning 1"
|
||||
SRILM implicitly assumes an <s> token in the beginning of each line
|
||||
and an </s> token at the end of each line and counts N-grams that start
|
||||
with <s> and end with </s>.
|
||||
You do not need to include these tags in
|
||||
.IR file.txt .
|
||||
.TP
|
||||
.B "Warning 2"
|
||||
When
|
||||
.B \-kndiscount
|
||||
or
|
||||
.B \-ukndiscount
|
||||
options are used, the count file contains modified counts.
|
||||
Specifically, all N-grams of the maximum
|
||||
order, and all N-grams that start with <s> have their regular counts
|
||||
.IR c ( a _ z ),
|
||||
but shorter N-grams that do not start with <s> have the number
|
||||
of unique words preceding them
|
||||
.IR n (* a _ z )
|
||||
instead.
|
||||
See the description of
|
||||
.B \-kndiscount
|
||||
and
|
||||
.B \-ukndiscount
|
||||
for details.
|
||||
.PP
|
||||
For most smoothing methods (except
|
||||
.BR \-count-lm )
|
||||
SRILM generates and uses N-gram model files in the ARPA format.
|
||||
A typical command to generate a model file would be:
|
||||
.nf
|
||||
ngram-count -order \fIN\fP -text \fIfile.txt\fP -lm \fIfile.lm\fP
|
||||
.fi
|
||||
The ARPA format output
|
||||
.I file.lm
|
||||
will contain the following information about an N-gram on each line:
|
||||
.nf
|
||||
|
||||
log10(\fIf\fP(\fIa\fP_\fIz\fP)) <tab> \fIa\fP_\fIz\fP <tab> log10(bow(\fIa\fP_\fIz\fP))
|
||||
|
||||
.fi
|
||||
Based on Equation 2, the first entry represents the base 10 logarithm
|
||||
of the conditional probability (logprob) for the N-gram
|
||||
.IR a _ z .
|
||||
This is followed by the actual words in the N-gram separated by spaces.
|
||||
The last and optional entry is the base-10 logarithm of the backoff weight
|
||||
for (\fIn\fP+1)-grams starting with
|
||||
.IR a _ z .
|
||||
.TP
|
||||
.B "Warning 3"
|
||||
Both backoff and interpolated models are represented in the same
|
||||
format.
|
||||
This means interpolation is done during model building and
|
||||
represented in the ARPA format with logprob and backoff weight using
|
||||
equation (6).
|
||||
.TP
|
||||
.B "Warning 4"
|
||||
Not all N-grams in the count file necessarily end up in the model file.
|
||||
The options
|
||||
.BR \-gtmin ,
|
||||
.BR \-gt1min ,
|
||||
\&...,
|
||||
.B \-gt9min
|
||||
specify the minimum counts
|
||||
for N-grams to be included in the LM (not only for Good-Turing
|
||||
discounting but for the other methods as well).
|
||||
By default all unigrams and bigrams
|
||||
are included, but for higher order N-grams only those with count >= 2 are
|
||||
included.
|
||||
Some exceptions arise, because if one N-gram is included in
|
||||
the model file, all its prefix N-grams have to be included as well.
|
||||
This causes some higher order 1-count N-grams to be included when using
|
||||
KN discounting, which uses modified counts as described in Warning 2.
|
||||
.TP
|
||||
.B "Warning 5"
|
||||
Not all N-grams in the model file have backoff weights.
|
||||
The highest order N-grams do not need a backoff weight.
|
||||
For lower order N-grams
|
||||
backoff weights are only recorded for those that appear as the prefix
|
||||
of a longer N-gram included in the model.
|
||||
For other lower order N-grams
|
||||
the backoff weight is implicitly 1 (or 0, in log representation).
|
||||
|
||||
.SH "SEE ALSO"
|
||||
ngram(1), ngram-count(1), ngram-format(5),
|
||||
.br
|
||||
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for
|
||||
Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
|
||||
.SH BUGS
|
||||
Work in progress.
|
||||
.SH AUTHOR
|
||||
Deniz Yuret <dyuret@ku.edu.tr>,
|
||||
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
||||
.br
|
||||
Copyright (c) 2007 SRI International
|
||||
745
language_model/srilm-1.7.3/man/man7/srilm-faq.7
Normal file
745
language_model/srilm-1.7.3/man/man7/srilm-faq.7
Normal file
@@ -0,0 +1,745 @@
|
||||
.\" $Id: srilm-faq.7,v 1.13 2019/09/09 22:35:37 stolcke Exp $
|
||||
.TH SRILM-FAQ 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
|
||||
.SH NAME
|
||||
SRILM-FAQ \- Frequently asked questions about SRI LM tools
|
||||
.SH SYNOPSIS
|
||||
.nf
|
||||
man srilm-faq
|
||||
.fi
|
||||
.SH DESCRIPTION
|
||||
This document tries to answer some of the most frequently asked questions
|
||||
about SRILM.
|
||||
.SS Build issues
|
||||
.TP 4
|
||||
.B A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty.
|
||||
Building the binaries can fail for a variety of reasons.
|
||||
Check the following:
|
||||
.RS
|
||||
.IP a)
|
||||
Make sure the SRILM environment variable is set, or specified on the
|
||||
make command line, e.g.:
|
||||
.nf
|
||||
make SRILM=$PWD
|
||||
.fi
|
||||
.IP b)
|
||||
Make sure the
|
||||
.B $SRILM/sbin/machine-type
|
||||
script returns a valid string for the platform you are trying to build on.
|
||||
Known platforms have machine-specific makefiles called
|
||||
.nf
|
||||
$SRILM/common/Makefile.machine.$MACHINE_TYPE
|
||||
.fi
|
||||
If
|
||||
.B machine-type
|
||||
does not work for some reason, you can override its output on the command line:
|
||||
.nf
|
||||
make MACHINE_TYPE=xyz
|
||||
.fi
|
||||
If you are building for an unsupported platform create a new machine-specific
|
||||
makefile and mail it to stolcke@speech.sri.com.
|
||||
.IP c)
|
||||
Make sure your compiler works and is invoked correctly.
|
||||
You will probably have to edit the
|
||||
.B CC
|
||||
and
|
||||
.B CXX
|
||||
variables in the platform-specific makefile.
|
||||
If you have questions about compiler invocation and best options
|
||||
consult a local expert; these things differ widely between sites.
|
||||
.IP d)
|
||||
The default is to compile with Tcl support.
|
||||
This is in fact only used for some testing binaries (which are
|
||||
not built by default),
|
||||
so it can be turned off if Tcl is not available or presents problems.
|
||||
Edit the machine-specific makefile accordingly.
|
||||
To use Tcl, locate the
|
||||
.B tcl.h
|
||||
header file and the library itself, and set (for example)
|
||||
.nf
|
||||
TCL_INCLUDE = -I/path/to/include
|
||||
TCL_LIBRARY = -L/path/to/lib -ltcl8.4
|
||||
.fi
|
||||
To disable Tcl support set
|
||||
.nf
|
||||
NO_TCL = X
|
||||
TCL_INCLUDE =
|
||||
TCL_LIBRARY =
|
||||
.fi
|
||||
.IP e)
|
||||
Make sure you have the C-shell (/bin/csh) installed on your system.
|
||||
Otherwise you will see something like
|
||||
.nf
|
||||
make: /sbin/machine-type: Command not found
|
||||
.fi
|
||||
early in the build process.
|
||||
On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
|
||||
as an optional package.
|
||||
.IP f)
|
||||
If you cannot get SRILM to build, save the make output to a file
|
||||
.nf
|
||||
make World >& make.output
|
||||
.fi
|
||||
and look for messages indicating errors.
|
||||
If you still cannot figure out what the problem is, send the error message
|
||||
and immediately preceding lines to the srilm-user list.
|
||||
Also include information about your operating system ("uname -a" output)
|
||||
and compiler version ("gcc -v" or equivalent for other compilers).
|
||||
.RE
|
||||
.TP
|
||||
.B A2) The regression test outputs differ for all tests. What did I do wrong?
|
||||
Most likely the binaries didn't get built or aren't executable
|
||||
for some reason.
|
||||
Check issue A1).
|
||||
.TP
|
||||
.B A3) I get differing outputs for some of the regression tests. Is that OK?
|
||||
It might be.
|
||||
The comparison of reference to actual output allows for small numerical
|
||||
differences, but
|
||||
some of the algorithms make hard decisions based on floating-point computations
|
||||
that can result in different outputs as a result of different compiler
|
||||
optimizations, machine floating point precisions (Intel versus IEEE format),
|
||||
and math libraries.
|
||||
Test of this nature include
|
||||
.BR ngram-class ,
|
||||
.BR disambig ,
|
||||
and
|
||||
.BR nbest-rover .
|
||||
When encountering differences, diff the output in the
|
||||
$SRILM/test/outputs/\fITEST\fP.$MACHINE_TYPE.stdout file to the corresponding
|
||||
$SRILM/test/reference/\fITEST\fP.stdout, where
|
||||
.I TEST
|
||||
is the name of the test that failed.
|
||||
Also compare the corresponding .stderr files;
|
||||
differences there usually indicate operating-system related problems.
|
||||
.SS Large data and memory issues
|
||||
.TP 4
|
||||
.B B1) I'm getting a message saying ``Assertion `body != 0' failed.''
|
||||
You are running out of memory.
|
||||
See subsequent questions depending on what you are trying to do.
|
||||
.IP Note:
|
||||
The above message means you are running
|
||||
out of "virtual" memory on your computer, which could be because of
|
||||
limits in swap space, administrative resource limits, or limitations of
|
||||
the machine architecture (a 32-bit machine cannot address more than
|
||||
4GB no matter how many resources your system has).
|
||||
Another symptom of not enough memory is that your program runs, but
|
||||
very, very slowly, i.e., it is "paging" or "swapping" as it tries to
|
||||
use more memory than the machine has RAM installed.
|
||||
.TP
|
||||
.B B2) I am trying to count N-grams in a text file and running out of memory.
|
||||
Don't use
|
||||
.B ngram-count
|
||||
directly to count N-grams.
|
||||
Instead, use the
|
||||
.B make-batch-counts
|
||||
and
|
||||
.B merge-batch-counts
|
||||
scripts described in
|
||||
.BR training-scripts (1).
|
||||
That way you can create N-gram counts limited only by the maximum file size
|
||||
on your system.
|
||||
.TP
|
||||
.B B3) I am trying to build an N-gram LM and ngram-count runs out of memory.
|
||||
You are running out of memory either because of the size of ngram counts,
|
||||
or of the LM being built. The following are strategies for reducing the
|
||||
memory requirements for training LMs.
|
||||
.RS
|
||||
.IP a)
|
||||
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
|
||||
.B ngram-count
|
||||
in "raw" form.
|
||||
Instead, use the
|
||||
.B make-big-lm
|
||||
wrapper script described in the
|
||||
.BR training-scripts (1)
|
||||
man page.
|
||||
.IP b)
|
||||
Switch to using the "_c" or "_s" versions of the SRI binaries.
|
||||
For
|
||||
instructions on how to build them, see the INSTALL file.
|
||||
Once built, set your executable search path accordingly, and try
|
||||
.B make-big-lm
|
||||
again.
|
||||
.IP c)
|
||||
Raise the minimum counts for N-grams included in the LM, i.e.,
|
||||
the values of the options
|
||||
.BR \-gt2min ,
|
||||
.BR \-gt3min ,
|
||||
.BR \-gt4min ,
|
||||
etc.
|
||||
The higher order N-grams typically get higher minimum counts.
|
||||
.IP d)
|
||||
Get a machine with more memory.
|
||||
If you are hitting the limitations of a 32-bit machine architecture,
|
||||
get a 64-bit machine and recompile SRILM to take advantage of the expanded
|
||||
address space.
|
||||
(The MACHINE_TYPE=i686-m64 setting is for systems based on
|
||||
64-bit AMD processors, as well as recent compatibles from Intel.)
|
||||
Note that 64-bit pointers will require a memory overhead in
|
||||
themselves, so you will need a machine with significantly, not just a
|
||||
little, more memory than 4GB.
|
||||
.RE
|
||||
.TP
|
||||
.B B4) I am trying to apply a large LM to some data and am running out of memory.
|
||||
Again, there are several strategies to reduce memory requirements.
|
||||
.RS
|
||||
.IP a)
|
||||
Use the "_c" or "_s" versions of the SRI binaries.
|
||||
See 3b) above.
|
||||
.IP b)
|
||||
Precompute the vocabulary of your test data and use the
|
||||
.B "ngram \-limit-vocab"
|
||||
option to load only the N-gram parameters relevant to your data.
|
||||
This approach should allow you to use arbitrarily
|
||||
large LMs provided the data is divided into small enough chunks.
|
||||
.IP c)
|
||||
If the LM can be built on a large machine, but then is to be used on
|
||||
machines with limited memory, use
|
||||
.B "ngram \-prune"
|
||||
to remove the less important parameters of the model.
|
||||
This usually gives huge size reductions with relatively modest performance
|
||||
degradation.
|
||||
The tradeoff is adjustable by varying the pruning parameter.
|
||||
.RE
|
||||
.TP
|
||||
.B B5) How can I reduce the time it takes to load large LMs into memory?
|
||||
The techniques described in 4b) and 4c) above also reduce the load time
|
||||
of the LM.
|
||||
Additional steps to try are:
|
||||
.RS
|
||||
.IP a)
|
||||
Convert the LM into binary format, using
|
||||
.nf
|
||||
ngram -order \fIN\fP -lm \fIOLDLM\fP -write-bin-lm \fINEWLM\fP
|
||||
.fi
|
||||
(This is currently only supported for N-gram-based LMs.)
|
||||
You can also generate the LM directly in binary format, using
|
||||
.nf
|
||||
ngram-count ... -lm \fINEWLM\fP -write-binary-lm
|
||||
.fi
|
||||
The resulting
|
||||
.I NEWLM
|
||||
file (which should not be compressed) can be used
|
||||
in place of a textual LM file with all compiled SRILM tools
|
||||
(but not with
|
||||
.BR lm-scripts (1)).
|
||||
The format is machine-independent, i.e., it can be read on machines with
|
||||
different word sizes or byte-order.
|
||||
Loading binary LMs is faster because
|
||||
(1) it reduces the overhead of parsing the input data, and
|
||||
(2) in combination with
|
||||
.B \-limit-vocab
|
||||
(see 4b)
|
||||
it is much faster to skip sections of the LM that are out-of-vocabulary.
|
||||
.IP Note:
|
||||
There is also a binary format for N-gram counts.
|
||||
It can be generated using
|
||||
.nf
|
||||
ngram-count -write-binary \fICOUNTS\fP
|
||||
.fi
|
||||
and has similar advantages as binary LM files.
|
||||
.IP b)
|
||||
Start a "probability server" that loads the LM ahead of time, and
|
||||
then have "LM clients" query the server instead of computing the
|
||||
probabilities themselves.
|
||||
.br
|
||||
The server is started on a machine named
|
||||
.I HOST
|
||||
using
|
||||
.nf
|
||||
ngram \fILMOPTIONS\fP -server-port \fIP\fP &
|
||||
.fi
|
||||
where
|
||||
.I P
|
||||
is an integer < 2^16 that specifies the TCP/IP port number the
|
||||
server will listen on, and
|
||||
.I LMOPTIONS
|
||||
are whatever options necessary to define the LM to be used.
|
||||
.br
|
||||
One or more clients (programs such as
|
||||
.BR ngram (1),
|
||||
.BR disambig (1),
|
||||
.BR lattice-tool (1))
|
||||
can then query the server using the options
|
||||
.nf
|
||||
-use-server \fIP\fP@\fIHOST\fP -cache-served-ngrams
|
||||
.fi
|
||||
instead of the usual "-lm \fIFILE\fP".
|
||||
The
|
||||
.B \-cache-served-ngrams
|
||||
option is not required but often speeds up performance dramatically by
|
||||
saving the results of server lookups in the client for reuse.
|
||||
Server-based LMs may be combined with file-based LMs by interpolation;
|
||||
see
|
||||
.BR ngram (1)
|
||||
for details.
|
||||
.RE
|
||||
.TP
|
||||
.B B6) How can I use the Google Web N-gram corpus to build an LM?
|
||||
Google has made a corpus of 5-grams extracted from 1 tera-words of web data
|
||||
available via LDC.
|
||||
However, the data is too large to build a standard backoff N-gram, even
|
||||
using the techniques described above.
|
||||
Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
|
||||
Such an LM computes probabilities on the fly from the counts, of which only
|
||||
the subsets needed for a given test set need to be loaded into memory.
|
||||
LM construction proceeds in the following steps:
|
||||
.RS
|
||||
.IP a)
|
||||
Make sure you have built SRI binaries either for a 64-bit machine
|
||||
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
|
||||
This is necessary because the data contains N-gram counts exceeding
|
||||
the range of 32-bit integers.
|
||||
Be sure to invoke all commands below using the path to the appropriate
|
||||
binary executable directory.
|
||||
.IP b)
|
||||
Prepare mapping file for some vocabulary mismatches and call this
|
||||
.BR google.aliases :
|
||||
.nf
|
||||
<S> <s>
|
||||
</S> </s>
|
||||
<UNK> <unk>
|
||||
.fi
|
||||
.IP c)
|
||||
Prepare an initial count-LM parameter file
|
||||
.BR google.countlm.0 :
|
||||
.nf
|
||||
order 5
|
||||
vocabsize 13588391
|
||||
totalcount 1024908267229
|
||||
countmodulus 40
|
||||
mixweights 15
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
0.5 0.5 0.5 0.5 0.5
|
||||
google-counts \fIPATH\fP
|
||||
.fi
|
||||
where
|
||||
.I PATH
|
||||
points to the location of the Google N-grams, i.e., the directory containing
|
||||
subdirectories "1gms", "2gms", etc.
|
||||
Note that the
|
||||
.B vocabsize
|
||||
and
|
||||
.B totalcount
|
||||
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
|
||||
(Check that they match and modify as needed.)
|
||||
For an explanation of the parameters see the
|
||||
.BR ngram (1)
|
||||
.B \-count-lm
|
||||
option.
|
||||
.IP d)
|
||||
Prepare a text file
|
||||
.B tune.text
|
||||
containing data for estimating the mixture weights.
|
||||
This data should be representative of, but different from your test data.
|
||||
Compute the vocabulary of this data using
|
||||
.nf
|
||||
ngram-count -text tune.text -write-vocab tune.vocab
|
||||
.fi
|
||||
The vocabulary size should not exceed a few thousand to keep memory
|
||||
requirements in the following steps manageable.
|
||||
.IP e)
|
||||
Estimate the mixture weights:
|
||||
.nf
|
||||
ngram-count -debug 1 -order 5 -count-lm \\
|
||||
-text tune.text -vocab tune.vocab \\
|
||||
-vocab-aliases google.aliases \\
|
||||
-limit-vocab \\
|
||||
-init-lm google.countlm.0 \\
|
||||
-em-iters 100 \\
|
||||
-lm google.countlm
|
||||
.fi
|
||||
This will write the estimated LM to
|
||||
.BR google.countlm .
|
||||
The output will be identical to the initial LM file, except for the
|
||||
updated interpolation weights.
|
||||
.IP f)
|
||||
Prepare a test data file
|
||||
.BR test.text ,
|
||||
and its vocabulary
|
||||
.B test.vocab
|
||||
as in Step d) above.
|
||||
Then apply the LM to the test data:
|
||||
.nf
|
||||
ngram -debug 2 -order 5 -count-lm \\
|
||||
-lm google.countlm \\
|
||||
-vocab test.vocab \\
|
||||
-vocab-aliases google.aliases \\
|
||||
-limit-vocab \\
|
||||
-ppl test.text > test.ppl
|
||||
.fi
|
||||
The perplexity output will appear in
|
||||
.B test.ppl.
|
||||
.IP g)
|
||||
Note that the Google data uses mixed case spellings.
|
||||
To apply the LM to lowercase data one needs to prepare a much more
|
||||
extensive vocabulary mapping table for the
|
||||
.B \-vocab-aliases
|
||||
option, namely, one that maps all
|
||||
upper- and mixed-case spellings to lowercase strings.
|
||||
This mapping file should be restricted to the words appearing in
|
||||
.B tune.text
|
||||
and
|
||||
.BR test.text ,
|
||||
respectively, to avoid defeating the effect of
|
||||
.B \-limit-vocab .
|
||||
.RE
|
||||
.SS "Smoothing issues"
|
||||
.TP 4
|
||||
.B C1) What is smoothing and discounting all about?
|
||||
.I Smoothing
|
||||
refers to methods that assign probabilities to events (N-grams) that
|
||||
do not occur in the training data.
|
||||
According to a pure maximum-likelihood estimator these events would have
|
||||
probability zero, which is plainly wrong since previously unseen events
|
||||
in general do occur in independent test data.
|
||||
Because the probability mass is redistributed away from the seen events
|
||||
toward the unseen events the resulting model is "smoother" (closer to uniform)
|
||||
than the ML model.
|
||||
.I Discounting
|
||||
refers to the approach used by many smoothing methods of adjusting the
|
||||
empirical counts of seen events downwards.
|
||||
The ML estimator (count divided by total number of events) is then applied
|
||||
to the discounted count, resulting in a smoother estimate.
|
||||
.TP
|
||||
.B C2) What smoothing methods are there?
|
||||
There are many, and SRILM implements are fairly large selection of the
|
||||
most popular ones.
|
||||
A detailed discussion of these is found in a separate document,
|
||||
.BR ngram-discount (7).
|
||||
.TP
|
||||
.B C3) Why am I getting errors or warnings from the smoothing method I'm using?
|
||||
The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
|
||||
"count-of-counts", the number of words occurring one, twice, three times, etc.
|
||||
The formulae for these methods become undefined if the counts-of-counts
|
||||
are zero, or not strictly decreasing.
|
||||
Some conditions are fatal (such as when the count of singleton words is zero),
|
||||
others lead to less smoothing (and warnings).
|
||||
To avoid these problems, check for the following possibilities:
|
||||
.RS
|
||||
.IP a)
|
||||
The data could be very sparse, i.e., the training corpus very small.
|
||||
Try using the Witten-Bell discounting method.
|
||||
.IP b)
|
||||
The vocabulary could be very small, such as when training an LM based on
|
||||
characters or parts-of-speech.
|
||||
Smoothing is less of an issue in those cases, and the Witten-Bell method
|
||||
should work well.
|
||||
.IP c)
|
||||
The data was manipulated in some way, or artificially generated.
|
||||
For example, duplicating data eliminates the odd-numbered counts-of-counts.
|
||||
.IP d)
|
||||
The vocabulary is limited during counts collection using the
|
||||
.BR ngram-count
|
||||
.B \-vocab
|
||||
option, with the effect that many low-frequency N-grams are eliminated.
|
||||
The proper approach is to compute smoothing parameters on the full vocabulary.
|
||||
This happens automatically in the
|
||||
.B make-big-lm
|
||||
wrapper script, which is preferable to direct use of
|
||||
.BR ngram-count
|
||||
for other reasons (see issue B3-a above).
|
||||
.IP e)
|
||||
You are estimating an LM from N-gram counts that have been truncated beforehand,
|
||||
e.g., by removing singleton events.
|
||||
If you cannot go back to the original data and recompute the counts
|
||||
there is a heuristic to extrapolate low counts-of-counts from higher ones.
|
||||
The heuristic is invoked automatically (and an informational message is output)
|
||||
when
|
||||
.B make-big-lm
|
||||
is used to estimate LMs with Kneser-Ney smoothing.
|
||||
For details see the paper by W. Wang et al. in ASRU-2007, listed under
|
||||
"SEE ALSO".
|
||||
.RE
|
||||
.TP
|
||||
.B C4) How does discounting work in the case of unigrams?
|
||||
First, unigrams are discounted using the same method as higher-order
|
||||
N-grams, using the specified method.
|
||||
The probability mass freed up in this way
|
||||
is then either spread evenly over all word types
|
||||
that would otherwise have zero probability (this is essentially
|
||||
simulating a backoff to zero-grams), or
|
||||
if all unigrams already have non-zero probabilities, the
|
||||
left-over mass is added to
|
||||
.I all
|
||||
unigrams.
|
||||
In either case all unigram probabilty probabilities will sum to 1.
|
||||
An informational message from
|
||||
.B ngram-count
|
||||
will tell which case applies.
|
||||
.TP
|
||||
.B C5) Why do I get a different number of trigrams when building a 4gram model compared to just a trigram model?
|
||||
This can happen when Kneser-Ney smoothing is used and the trigram cut-off
|
||||
.RB ( \-gt3min )
|
||||
is greater than 1 (as with the default, 2).
|
||||
The count cutoffs are applied against the modified counts generated as part of KN smoothing,
|
||||
so in the case of a 4gram model the trigrams are modified and the set of ngrams above the cutoff will change.
|
||||
.SS "Out-of-vocabulary, zeroprob, and `unknown' words"
|
||||
.TP 4
|
||||
.B D1) What is the perplexity of an OOV (out of vocabulary) word?
|
||||
By default any word not observed in the training data is considered
|
||||
OOV and OOV words are silently ignored by the
|
||||
.BR ngram (1)
|
||||
during perplexity (ppl) calculation.
|
||||
For example:
|
||||
.nf
|
||||
|
||||
$ ngram-count -text turkish.train -lm turkish.lm
|
||||
$ ngram -lm turkish.lm -ppl turkish.test
|
||||
file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
|
||||
0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09
|
||||
|
||||
.fi
|
||||
The statistics printed in the last two lines have the following meanings:
|
||||
.RS
|
||||
.TP
|
||||
.B "34153 OOVs"
|
||||
This is the number of unknown word tokens, i.e. tokens
|
||||
that appear in
|
||||
.B turkish.test
|
||||
but not in
|
||||
.B turkish.train
|
||||
from which
|
||||
.B turkish.lm
|
||||
was generated.
|
||||
.TP
|
||||
.B "logprob= -3.20177e+06"
|
||||
This gives us the total logprob ignoring the 34153 unknown word tokens.
|
||||
The logprob does include the probabilities
|
||||
assigned to </s> tokens which are introduced by
|
||||
.BR ngram-count (1).
|
||||
Thus the total number of tokens which this logprob is based on is
|
||||
.nf
|
||||
words - OOVs + sentences = 1000015 - 34153 + 61031
|
||||
.fi
|
||||
.TP
|
||||
.B "ppl = 1311.97"
|
||||
This gives us the geometric average of 1/probability of
|
||||
each token, i.e., perplexity.
|
||||
The exact expression is:
|
||||
.nf
|
||||
ppl = 10^(-logprob / (words - OOVs + sentences))
|
||||
.fi
|
||||
.TP
|
||||
.B "ppl1 = 2065.09"
|
||||
This gives us the average perplexity per word excluding the </s> tokens.
|
||||
The exact expression is:
|
||||
.nf
|
||||
ppl1 = 10^(-logprob / (words - OOVs))
|
||||
.fi
|
||||
.RE
|
||||
You can verify these numbers by running the
|
||||
.B ngram
|
||||
program with the
|
||||
.B "\-debug 2"
|
||||
option, which gives the probability assigned to each token.
|
||||
.TP
|
||||
.B D2) What happens when the OOV word is in the context of an N-gram?
|
||||
Exact details depend on the discounting algorithm used, but typically
|
||||
the backed-off probability from a lower order N-gram is used. If the
|
||||
.B \-unk
|
||||
option is used as explained below, an <unk> token is assumed to
|
||||
take the place of the OOV word and no back-off may be necessary
|
||||
if a corresponding N-gram containing <unk> is found in the LM.
|
||||
.TP
|
||||
.B D3) Isn't it wrong to assign 0 logprob to OOV words?
|
||||
That depends on the application.
|
||||
If you are comparing multiple language
|
||||
models which all consider the same set of words as OOV it may be OK to
|
||||
ignore OOV words.
|
||||
Note that perplexity comparisons are only ever meaningful
|
||||
if the vocabularies of all LMs are the same.
|
||||
Therefore, to compare LMs with different sets of OOV words
|
||||
(such as when using different tokenization strategies for morphologically
|
||||
complex languages) then it becomes important
|
||||
to take into account the true cost of the OOV words, or to model all words,
|
||||
including OOVs.
|
||||
.TP
|
||||
.B D4) How do I take into account the true cost of the OOV words?
|
||||
A simple strategy is to "explode" the OOV words, i.e., split them into
|
||||
characters in the training and test data.
|
||||
Typically words that appear more than once in the training data are
|
||||
considered to be vocabulary words.
|
||||
All other words are split into their characters and the
|
||||
individual characters are considered tokens.
|
||||
Assuming that all characters occur at least once in the training data there
|
||||
will be no OOV tokens in the test data.
|
||||
Note that this strategy changes the number of tokens in the data set,
|
||||
so even though logprob is meaningful be careful when reporting ppl results.
|
||||
.TP
|
||||
.B D5) What if I want to model the OOV words explicitly?
|
||||
Maybe a better strategy is to have a separate "letter" model for OOV words.
|
||||
This can be easily created using SRILM by using a training
|
||||
file listing the OOV words one per line with their characters
|
||||
separated by spaces.
|
||||
The
|
||||
.B ngram-count
|
||||
options
|
||||
.B \-ukndiscount
|
||||
and
|
||||
.B "\-order 7"
|
||||
seem to work well for this purpose.
|
||||
The final logprob results are obtained in two steps.
|
||||
First do regular training and testing on your data using
|
||||
.B \-vocab
|
||||
and
|
||||
.B \-unk
|
||||
options.
|
||||
The resulting logprob will include the cost of the vocabulary words and an
|
||||
<unk> token for each OOV word.
|
||||
Then apply the letter model to each OOV word in the test set.
|
||||
Add the logprobs.
|
||||
Here is an example:
|
||||
.nf
|
||||
|
||||
# Determine vocabulary:
|
||||
ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
|
||||
awk '$2>1' turkish.train.1cnt | cut -f1 | sort > turkish.train.vocab
|
||||
awk '$2==1' turkish.train.1cnt | cut -f1 | sort > turkish.train.oov
|
||||
|
||||
# Word model:
|
||||
ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
|
||||
ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test > turkish.test.ppl
|
||||
|
||||
# Letter model:
|
||||
perl -C -lne 'print join(" ", split(""))' turkish.train.oov > turkish.train.oov.split
|
||||
ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
|
||||
perl -pe 's/\\s+/\\n/g' turkish.test | sort > turkish.test.words
|
||||
comm -23 turkish.test.words turkish.train.vocab > turkish.test.oov
|
||||
perl -C -lne 'print join(" ", split(""))' turkish.test.oov > turkish.test.oov.split
|
||||
ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model > turkish.test.oov.ppl
|
||||
|
||||
# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.
|
||||
|
||||
.fi
|
||||
Again, perplexities are not directly meaningful as computed by SRILM, but you
|
||||
can recompute them by hand using the combined logprob value, and the number of
|
||||
original word tokens in the test set.
|
||||
.TP
|
||||
.B D6) What are zeroprob words and when do they occur?
|
||||
In-vocabulary words that get zero probability are counted as
|
||||
"zeroprobs" in the ppl output.
|
||||
Just as OOV words, they are excluded from the perplexity
|
||||
computation since otherwise the perplexity value would be infinity.
|
||||
There are three reasons why zeroprobs could occur in a
|
||||
closed vocabulary setting (the default for SRILM):
|
||||
.RS
|
||||
.IP a)
|
||||
If the same vocabulary is used at test time as was used during
|
||||
training, and smoothing is enabled, then the occurrence of zeroprobs
|
||||
indicates an anomalous condition and, possibly, a broken language model.
|
||||
.IP b)
|
||||
If smoothing has been disabled (e.g., by using the option
|
||||
.BR "\-cdiscount 0" ),
|
||||
then the LM will use maximum likelihood estimates for
|
||||
the N-grams and then any unseen N-gram is a zeroprob.
|
||||
.IP c)
|
||||
If a different vocabulary file is specified at test time than
|
||||
the one used in training, then the definition of what counts as an OOV
|
||||
will change.
|
||||
In particular, a word that wasn't seen in the training data (but is in the
|
||||
test vocabulary) will
|
||||
.I not
|
||||
be mapped to
|
||||
.B <unk>
|
||||
and, therefore, not
|
||||
count as an OOV in the perplexity computation.
|
||||
However, it will still get zero probability and, therefore, be tallied
|
||||
as a zeroprob.
|
||||
.RE
|
||||
.TP
|
||||
.B D7) What is the point of using the \fB<unk>\fP token?
|
||||
Using
|
||||
.B <unk>
|
||||
is a practical convenience employed by SRILM.
|
||||
Words not in the specified vocabulary are mapped to
|
||||
.BR <unk> ,
|
||||
which is equivalent to performing the same mapping
|
||||
in a data pre-processing step outside of SRILM.
|
||||
Other than that,
|
||||
for both LM estimation and evaluation purposes,
|
||||
.B <unk>
|
||||
is treated like any other word.
|
||||
(In particular, in the computation of discounted probabilities
|
||||
there is no special handling of
|
||||
.BR <unk> .)
|
||||
.TP
|
||||
.B D8) So how do I train an open-vocabulary LM with \fB<unk>\fP?
|
||||
First, make sure to use the
|
||||
.B ngram-count
|
||||
.B \-unk
|
||||
option, which simply indicates that the
|
||||
.B <unk>
|
||||
word should be included in the LM vocabulary, as required for an
|
||||
open-vocabulary LM.
|
||||
Without this option, N-grams containing
|
||||
.B <unk>
|
||||
would simply be discarded.
|
||||
An "open vocabulary" LM is simply one that contains
|
||||
.BR <unk> ,
|
||||
and can therefore (by virtue of the mapping of OOVs to
|
||||
.BR <unk> )
|
||||
assign a non-zero probability to them.
|
||||
Next, we need to ensure there are actual occurrences of
|
||||
.B <unk>
|
||||
N-grams
|
||||
in the training data so we can obtain meaningful probability estimates
|
||||
for them
|
||||
(otherwise
|
||||
.B <unk>
|
||||
would only get probabilty via unigram discounting, see item C4).
|
||||
To get a proper estimate
|
||||
of the
|
||||
.B <unk>
|
||||
probability, we need to explicitly specify a vocabulary that is not
|
||||
a superset of the training data.
|
||||
One way to do that is to extract the vocabulary from an independent
|
||||
data set, or by only including words with some minimum count (greater than 1)
|
||||
in the training data.
|
||||
.TP
|
||||
.B D9) Doesn't ngram-count \-addsmooth deal with OOV words by adding a constant to occurrence counts?
|
||||
No, all smoothing is applied when building the LM at training time,
|
||||
so it must use the
|
||||
.B <unk>
|
||||
mechanism to assign probability to words that are first seen in the
|
||||
test data.
|
||||
Furthermore, even add-constant smoothing requires a fixed, finite
|
||||
vocabulary to compute the denominator of its estimator.
|
||||
.SH "SEE ALSO"
|
||||
ngram(1), ngram-count(1), training-scripts(1), ngram-discount(7).
|
||||
.br
|
||||
$SRILM/INSTALL
|
||||
.br
|
||||
http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
|
||||
.br
|
||||
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
|
||||
.br
|
||||
W. Wang, A. Stolcke, & J. Zheng,
|
||||
Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
|
||||
http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
|
||||
.SH BUGS
|
||||
This document is work in progress.
|
||||
.SH AUTHOR
|
||||
Andreas Stolcke <andreas.stolcke@microsoft.com>,
|
||||
Deniz Yuret <dyuret@ku.edu.tr>,
|
||||
Nitin Madnani <nmadnani@umiacs.umd.edu>
|
||||
.br
|
||||
Copyright (c) 2007\-2010 SRI International
|
||||
.br
|
||||
Copyright (c) 2011\-2017 Andreas Stolcke
|
||||
.br
|
||||
Copyright (c) 2011\-2017 Microsoft Corp.
|
||||
Reference in New Issue
Block a user