competition update

2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions
--- a/language_model/srilm-1.7.3/man/man7/TEMPLATE.7
+++ b/language_model/srilm-1.7.3/man/man7/TEMPLATE.7
@@ -0,0 +1,15 @@
+.\" $Id: TEMPLATE.7,v 1.2 2019/09/09 22:35:38 stolcke Exp $
+.TH XXX 7 "$Date: 2019/09/09 22:35:38 $" "SRILM Miscellaneous"
+.SH NAME
+XXX \- XXX
+.SH SYNOPSIS
+.nf
+.B XXX
+.fi
+.SH DESCRIPTION
+.SH "SEE ALSO"
+.SH BUGS
+.SH AUTHOR
+Andreas Stolcke <stolcke@icsi.berkeley.edu>
+.br
+Copyright (c) 2007 SRI International
--- a/language_model/srilm-1.7.3/man/man7/ngram-discount.7
+++ b/language_model/srilm-1.7.3/man/man7/ngram-discount.7
@@ -0,0 +1,672 @@
+.\" $Id: ngram-discount.7,v 1.5 2019/09/09 22:35:37 stolcke Exp $
+.TH ngram-discount 7 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
+.SH NAME
+ngram-discount \- notes on the N-gram smoothing implementations in SRILM
+.SH NOTATION
+.TP 10
+.IR a _ z
+An N-gram where
+.I a
+is the first word,
+.I z
+is the last word, and "_" represents 0 or more words in between.
+.TP
+.IR p ( a _ z )
+The estimated conditional probability of the \fIn\fPth word
+.I z
+given the first \fIn\fP-1 words
+.RI ( a _)
+of an N-gram.
+.TP
+.IR a _ 
+The \fIn\fP-1 word prefix of the N-gram
+.IR a _ z .
+.TP
+.RI _ z
+The \fIn\fP-1 word suffix of the N-gram
+.IR a _ z .
+.TP
+.IR c ( a _ z )
+The count of N-gram
+.IR a _ z
+in the training data.
+.TP
+.IR n (*_ z )
+The number of unique N-grams that match a given pattern.
+``(*)'' represents a wildcard matching a single word.
+.TP
+.IR n1 , n [1]
+The number of unique N-grams with count = 1.
+.SH DESCRIPTION
+.PP
+N-gram models try to estimate the probability of a word
+.I z
+in the context of the previous \fIn\fP-1 words
+.RI ( a _),
+i.e.,
+.IR Pr ( z | a _).
+We will
+denote this conditional probability using
+.IR p ( a _ z )
+for convenience.
+One way to estimate
+.IR p ( a _ z )
+is to look at the number of times word
+.I z
+has followed the previous \fIn\fP-1 words
+.RI ( a _):
+.nf
+
+(1)	\fIp\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP)/\fIc\fP(\fIa\fP_)
+
+.fi
+This is known as the maximum likelihood (ML) estimate.
+Unfortunately it does not work very well because it assigns zero probability to
+N-grams that have not been observed in the training data.
+To avoid the zero probabilities, we take some probability mass from the observed
+N-grams and distribute it to unobserved N-grams.
+Such redistribution is known as smoothing or discounting.
+.PP
+Most existing smoothing algorithms can be described by the following equation:
+.nf
+
+(2)	\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
+
+.fi
+If the N-gram
+.IR a _ z
+has been observed in the training data, we use the
+distribution
+.IR f ( a _ z ).
+Typically
+.IR f ( a _ z )
+is discounted to be less than
+the ML estimate so we have some leftover probability for the
+.I z
+words unseen in the context
+.RI ( a _).
+Different algorithms mainly differ on how
+they discount the ML estimate to get
+.IR f ( a _ z ).
+.PP
+If the N-gram
+.IR a _ z
+has not been observed in the training data, we use
+the lower order distribution
+.IR p (_ z ).
+If the context has never been
+observed (\fIc\fP(\fIa\fP_) = 0),
+we can use the lower order distribution directly (bow(\fIa\fP_) = 1).
+Otherwise we need to compute a backoff weight (bow) to
+make sure probabilities are normalized:
+.fi
+
+	Sum_\fIz\fP \fIp\fP(\fIa\fP_\fIz\fP) = 1
+
+.fi
+.PP
+Let
+.I Z
+be the set of all words in the vocabulary,
+.I Z0
+be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) = 0, and
+.I Z1
+be the set of all words with \fIc\fP(\fIa\fP_\fIz\fP) > 0.
+Given
+.IR f ( a _ z ),
+.RI bow( a _)
+can be determined as follows:
+.nf
+
+(3)	Sum_\fIZ\fP  \fIp\fP(\fIa\fP_\fIz\fP) = 1
+	Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP) + Sum_\fIZ0\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / Sum_\fIZ0\fP \fIp\fP(_\fIz\fP)
+	        = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIp\fP(_\fIz\fP))
+	        = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP))
+
+.fi
+.PP
+Smoothing is generally done in one of two ways.
+The backoff models compute
+.IR p ( a _ z )
+based on the N-gram counts
+.IR c ( a _ z )
+when \fIc\fP(\fIa\fP_\fIz\fP) > 0, and
+only consider lower order counts
+.IR c (_ z )
+when \fIc\fP(\fIa\fP_\fIz\fP) = 0.
+Interpolated models take lower order counts into account when
+\fIc\fP(\fIa\fP_\fIz\fP) > 0 as well.
+A common way to express an interpolated model is:
+.nf
+
+(4)	\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
+
+.fi
+Where \fIg\fP(\fIa\fP_\fIz\fP) = 0 when \fIc\fP(\fIa\fP_\fIz\fP) = 0
+and it is discounted to be less than
+the ML estimate when \fIc\fP(\fIa\fP_\fIz\fP) > 0
+to reserve some probability mass for
+the unseen
+.I z
+words.
+Given
+.IR g ( a _ z ),
+.RI bow( a _)
+can be determined as follows:
+.nf
+
+(5)	Sum_\fIZ\fP  \fIp(\fP\fIa_\fP\fIz)\fP = 1
+	Sum_\fIZ1\fP \fIg(\fP\fIa_\fP\fIz\fP) + Sum_\fIZ\fP bow(\fIa\fP_) \fIp\fP(_\fIz\fP) = 1
+	bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
+
+.fi
+.PP
+An interpolated model can also be expressed in the form of equation
+(2), which is the way it is represented in the ARPA format model files
+in SRILM:
+.nf
+
+(6)	\fIf\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
+	\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)
+
+.fi
+.PP
+Most algorithms in SRILM have both backoff and interpolated versions.
+Empirically, interpolated algorithms usually do better than the backoff
+ones, and Kneser-Ney does better than others.
+
+.SH OPTIONS
+.PP
+This section describes the formulation of each discounting option in
+.BR ngram-count (1).
+After giving the motivation for each discounting method,
+we will give expressions for
+.IR f ( a _ z )
+and
+.RI bow( a _)
+of Equation 2 in terms of the counts.
+Note that some counts may not be included in the model
+file because of the
+.B \-gtmin
+options; see Warning 4 in the next section.
+.PP
+Backoff versions are the default but interpolated versions of most
+models are available using the
+.B \-interpolate
+option.
+In this case we will express
+.IR g ( a _z )
+and
+.RI bow( a _)
+of Equation 4 in terms of the counts as well.
+Note that the ARPA format model files store the interpolated
+models and the backoff models the same way using
+.IR f ( a _ z )
+and
+.RI bow( a _);
+see Warning 3 in the next section.
+The conversion between backoff and
+interpolated formulations is given in Equation 6.
+.PP
+The discounting options may be followed by a digit (1-9) to indicate
+that only specific N-gram orders be affected.
+See
+.BR ngram-count (1)
+for more details.
+.TP
+.BI \-cdiscount " D"
+Ney's absolute discounting using
+.I D
+as the constant to subtract.
+.I D
+should be between 0 and 1.
+If
+.I Z1
+is the set
+of all words
+.I z
+with \fIc\fP(\fIa\fP_\fIz\fP) > 0:
+.nf
+
+	\fIf\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
+	\fIp\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)    ; Eqn.2
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
+
+.fi
+With the
+.B \-interpolate
+option we have:
+.nf
+
+	\fIg\fP(\fIa\fP_\fIz\fP)  = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
+	\fIp\fP(\fIa\fP_\fIz\fP)  = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)	; Eqn.4
+	bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)		; Eqn.5
+	        = \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
+
+.fi
+The suggested discount factor is:
+.nf
+
+	\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
+
+.fi
+where
+.I n1
+and
+.I n2
+are the total number of N-grams with exactly one and
+two counts, respectively.
+Different discounting constants can be
+specified for different N-gram orders using options
+.BR \-cdiscount1 ,
+.BR \-cdiscount2 ,
+etc.
+.TP
+.BR \-kndiscount " and " \-ukndiscount
+Kneser-Ney discounting.
+This is similar to absolute discounting in
+that the discounted probability is computed by subtracting a constant
+.I D
+from the N-gram count.
+The options
+.B \-kndiscount
+and
+.B \-ukndiscount
+differ as to how this constant is computed.
+.br
+The main idea of Kneser-Ney is to use a modified probability estimate
+for lower order N-grams used for backoff.
+Specifically, the modified
+probability for a lower order N-gram is taken to be proportional to the
+number of unique words that precede it in the training data.
+With discounting and normalization we get:
+.nf
+
+	\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) - \fID0\fP) / \fIc\fP(\fIa\fP_) 	;; for highest order N-grams
+	\fIf\fP(_\fIz\fP)  = (\fIn\fP(*_\fIz\fP) - \fID1\fP) / \fIn\fP(*_*)	;; for lower order N-grams
+
+.fi
+where the
+.IR n (*_ z )
+notation represents the number of unique N-grams that
+match a given pattern with (*) used as a wildcard for a single word.
+.I D0
+and
+.I D1
+represent two different discounting constants, as each N-gram
+order uses a different discounting constant.
+The resulting
+conditional probability and the backoff weight is calculated as given
+in equations (2) and (3):
+.nf
+
+	\fIp\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)     ; Eqn.2
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP))  ; Eqn.3
+
+.fi
+The option
+.B \-interpolate
+is used to create the interpolated versions of
+.B \-kndiscount
+and
+.BR \-ukndiscount .
+In this case we have:
+.nf
+
+	\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)  ; Eqn.4
+
+.fi
+Let
+.I Z1
+be the set {\fIz\fP: \fIc\fP(\fIa\fP_\fIz\fP) > 0}.
+For highest order N-grams we have:
+.nf
+
+	\fIg\fP(\fIa\fP_\fIz\fP)  = max(0, \fIc\fP(\fIa\fP_\fIz\fP) - \fID\fP) / \fIc\fP(\fIa\fP_)
+	bow(\fIa\fP_) = 1 - Sum_\fIZ1\fP \fIg\fP(\fIa\fP_\fIz\fP)
+	        = 1 - Sum_\fIZ1\fP \fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_) + Sum_\fIZ1\fP \fID\fP / \fIc\fP(\fIa\fP_)
+	        = \fID\fP \fIn\fP(\fIa\fP_*) / \fIc\fP(\fIa\fP_)
+
+.fi
+Let
+.I Z2
+be the set {\fIz\fP: \fIn\fP(*_\fIz\fP) > 0}.
+For lower order N-grams we have:
+.nf
+
+	\fIg\fP(_\fIz\fP)  = max(0, \fIn\fP(*_\fIz\fP) - \fID\fP) / \fIn\fP(*_*)
+	bow(_) = 1 - Sum_\fIZ2\fP \fIg\fP(_\fIz\fP)
+	       = 1 - Sum_\fIZ2\fP \fIn\fP(*_\fIz\fP) / \fIn\fP(*_*) + Sum_\fIZ2\fP \fID\fP / \fIn\fP(*_*)
+	       = \fID\fP \fIn\fP(_*) / \fIn\fP(*_*)
+
+.fi
+The original Kneser-Ney discounting
+.RB ( \-ukndiscount )
+uses one discounting constant for each N-gram order.
+These constants are estimated as
+.nf
+
+	\fID\fP = \fIn1\fP / (\fIn1\fP + 2*\fIn2\fP)
+
+.fi
+where
+.I n1
+and
+.I n2
+are the total number of N-grams with exactly one and
+two counts, respectively.
+.br
+Chen and Goodman's modified Kneser-Ney discounting
+.RB ( \-kndiscount )
+uses three discounting constants for each N-gram order, one for one-count
+N-grams, one for two-count N-grams, and one for three-plus-count N-grams:
+.nf
+
+	\fIY\fP   = \fIn1\fP/(\fIn1\fP+2*\fIn2\fP)
+	\fID1\fP  = 1 - 2\fIY\fP(\fIn2\fP/\fIn1\fP)
+	\fID2\fP  = 2 - 3\fIY\fP(\fIn3\fP/\fIn2\fP)
+	\fID3+\fP = 3 - 4\fIY\fP(\fIn4\fP/\fIn3\fP)
+
+.fi
+.TP
+.B Warning:
+SRILM implements Kneser-Ney discounting by actually modifying the
+counts of the lower order N-grams.  Thus, when the
+.B \-write
+option is
+used to write the counts with
+.B \-kndiscount
+or
+.BR \-ukndiscount ,
+only the highest order N-grams and N-grams that start with <s> will have their
+regular counts
+.IR c ( a _ z ),
+all others will have the modified counts
+.IR n (*_ z )
+instead.
+See Warning 2 in the next section.
+.TP
+.B \-wbdiscount
+Witten-Bell discounting.
+The intuition is that the weight given
+to the lower order model should be proportional to the probability of
+observing an unseen word in the current context
+.RI ( a _).
+Witten-Bell computes this weight as:
+.nf
+
+	bow(\fIa\fP_) = \fIn\fP(\fIa\fP_*) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
+
+.fi
+Here
+.IR n ( a _*)
+represents the number of unique words following the
+context
+.RI ( a _)
+in the training data.
+Witten-Bell is originally an interpolated discounting method.
+So with the
+.B \-interpolate
+option we get:
+.nf
+
+	\fIg\fP(\fIa\fP_\fIz\fP) = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
+	\fIp\fP(\fIa\fP_\fIz\fP) = \fIg\fP(\fIa\fP_\fIz\fP) + bow(\fIa\fP_) \fIp\fP(_\fIz\fP)    ; Eqn.4
+
+.fi
+Without the
+.B \-interpolate
+option we have the backoff version which is
+implemented by taking
+.IR f ( a _ z )
+to be the same as the interpolated
+.IR g ( a _ z ).
+.nf
+
+	\fIf\fP(\fIa\fP_\fIz\fP)  = \fIc\fP(\fIa\fP_\fIz\fP) / (\fIn\fP(\fIa\fP_*) + \fIc\fP(\fIa\fP_))
+	\fIp\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)    ; Eqn.2
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
+
+.fi
+.TP
+.B \-ndiscount
+Ristad's natural discounting law.
+See Ristad's technical report "A natural law of succession"
+for a justification of the discounting factor.
+The
+.B \-interpolate
+option has no effect, only a backoff version has been implemented.
+.nf
+
+	          \fIc\fP(\fIa\fP_\fIz\fP)  \fIc\fP(\fIa\fP_) (\fIc\fP(\fIa\fP_) + 1) + \fIn\fP(\fIa\fP_*) (1 - \fIn\fP(\fIa\fP_*))
+	\fIf\fP(\fIa\fP_\fIz\fP)  = ------  ---------------------------------------
+	          \fIc\fP(\fIa\fP_)        \fIc\fP(\fIa\fP_)^2 + \fIc\fP(\fIa\fP_) + 2 \fIn\fP(\fIa\fP_*)
+
+	\fIp\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)    ; Eqn.2
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP f(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
+
+.fi
+.TP
+.B \-count-lm
+Estimate a count-based interpolated LM using Jelinek-Mercer smoothing
+(Chen & Goodman, 1998), also known as "deleted interpolation."
+Note that this does not produce a backoff model; instead of 
+count-LM parameter file in the format described in 
+.BR ngram (1)
+needs to be specified using
+.BR \-init-lm ,
+and a reestimated file in the same format is produced.
+In the process, the mixture weights that interpolate the ML estimates
+at all levels of N-grams are estimated using an expectation-maximization (EM)
+algorithm.
+The options
+.B \-em-iters
+and
+.B \-em-delta
+control termination of the EM algorithm.
+Note that the N-gram counts used to estimate the maximum-likelihood
+estimates are specified in the 
+.B \-init-lm
+model file.
+The counts specified with
+.B \-read
+or
+.B \-text
+are used only to estimate the interpolation weights.
+\" ???What does this all mean in terms of the math???
+.TP
+.BI \-addsmooth " D"
+Smooth by adding 
+.I D
+to each N-gram count.
+This is usually a poor smoothing method,
+included mainly for instructional purposes.
+.nf
+
+	\fIp\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + \fID\fP) / (\fIc\fP(\fIa\fP_) + \fID\fP \fIn\fP(*))
+
+.fi
+.TP
+default
+If the user does not specify any discounting options,
+.B ngram-count
+uses Good-Turing discounting (aka Katz smoothing) by default.
+The Good-Turing estimate states that for any N-gram that occurs
+.I r
+times, we should pretend that it occurs
+.IR r '
+times where
+.nf
+
+	\fIr\fP' = (\fIr\fP+1) \fIn\fP[\fIr\fP+1]/\fIn\fP[\fIr\fP]
+
+.fi
+Here
+.IR n [ r ]
+is the number of N-grams that occur exactly
+.I r
+times in the training data.  
+.br
+Large counts are taken to be reliable, thus they are not subject to
+any discounting.
+By default unigram counts larger than 1 and other N-gram counts larger
+than 7 are taken to be reliable and maximum
+likelihood estimates are used.
+These limits can be modified using the
+.BI \-gt n max
+options.
+.nf
+
+	\fIf\fP(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_))  if \fIc\fP(\fIa\fP_\fIz\fP) > \fIgtmax\fP
+
+.fi
+The lower counts are discounted proportional to the Good-Turing
+estimate with a small correction
+.I A
+to account for the high-count N-grams not being discounted.
+If 1 <= \fIc\fP(\fIa\fP_\fIz\fP) <= \fIgtmax\fP:
+.nf
+
+                   \fIn\fP[\fIgtmax\fP + 1]
+  \fIA\fP = (\fIgtmax\fP + 1) --------------
+                      \fIn\fP[1]
+
+                          \fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP) + 1]
+  \fIc\fP'(\fIa\fP_\fIz\fP) = (\fIc\fP(\fIa\fP_\fIz\fP) + 1) ---------------
+                            \fIn\fP[\fIc\fP(\fIa\fP_\fIz\fP)]
+
+            \fIc\fP(\fIa\fP_\fIz\fP)   (\fIc\fP'(\fIa\fP_\fIz\fP) / \fIc\fP(\fIa\fP_\fIz\fP) - \fIA\fP)
+  \fIf\fP(\fIa\fP_\fIz\fP) = --------  ----------------------
+             \fIc\fP(\fIa\fP_)         (1 - \fIA\fP)
+
+.fi
+The
+.B \-interpolate
+option has no effect in this case, only a backoff
+version has been implemented, thus:
+.nf
+
+	\fIp\fP(\fIa\fP_\fIz\fP)  = (\fIc\fP(\fIa\fP_\fIz\fP) > 0) ? \fIf\fP(\fIa\fP_\fIz\fP) : bow(\fIa\fP_) \fIp\fP(_\fIz\fP)    ; Eqn.2
+	bow(\fIa\fP_) = (1 - Sum_\fIZ1\fP \fIf\fP(\fIa\fP_\fIz\fP)) / (1 - Sum_\fIZ1\fP \fIf\fP(_\fIz\fP)) ; Eqn.3
+
+.fi
+.SH "FILE FORMATS"
+SRILM can generate simple N-gram counts from plain text files with the
+following command:
+.nf
+	ngram-count -order \fIN\fP -text \fIfile.txt\fP -write \fIfile.cnt\fP
+.fi
+The
+.B \-order
+option determines the maximum length of the N-grams.
+The file
+.I file.txt
+should contain one sentence per line with tokens
+separated by whitespace.
+The output
+.I file.cnt
+contains the N-gram
+tokens followed by a tab and a count on each line:
+.nf
+
+	\fIa\fP_\fIz\fP <tab> \fIc\fP(\fIa\fP_\fIz\fP)
+
+.fi
+A couple of warnings:
+.TP
+.B "Warning 1"
+SRILM implicitly assumes an <s> token in the beginning of each line
+and an </s> token at the end of each line and counts N-grams that start
+with <s> and end with </s>.
+You do not need to include these tags in
+.IR file.txt .
+.TP
+.B "Warning 2"
+When
+.B \-kndiscount
+or
+.B \-ukndiscount
+options are used, the count file contains modified counts.
+Specifically, all N-grams of the maximum
+order, and all N-grams that start with <s> have their regular counts
+.IR c ( a _ z ),
+but shorter N-grams that do not start with <s> have the number
+of unique words preceding them
+.IR n (* a _ z )
+instead.
+See the description of
+.B \-kndiscount
+and
+.B \-ukndiscount
+for details.
+.PP
+For most smoothing methods (except
+.BR \-count-lm )
+SRILM generates and uses N-gram model files in the ARPA format.
+A typical command to generate a model file would be:
+.nf
+	ngram-count -order \fIN\fP -text \fIfile.txt\fP -lm \fIfile.lm\fP
+.fi
+The ARPA format output
+.I file.lm
+will contain the following information about an N-gram on each line:
+.nf
+
+	log10(\fIf\fP(\fIa\fP_\fIz\fP)) <tab> \fIa\fP_\fIz\fP <tab> log10(bow(\fIa\fP_\fIz\fP))
+
+.fi
+Based on Equation 2, the first entry represents the base 10 logarithm
+of the conditional probability (logprob) for the N-gram
+.IR a _ z .
+This is followed by the actual words in the N-gram separated by spaces.
+The last and optional entry is the base-10 logarithm of the backoff weight
+for (\fIn\fP+1)-grams starting with
+.IR a _ z .
+.TP
+.B "Warning 3"
+Both backoff and interpolated models are represented in the same
+format.
+This means interpolation is done during model building and
+represented in the ARPA format with logprob and backoff weight using
+equation (6).
+.TP
+.B "Warning 4"
+Not all N-grams in the count file necessarily end up in the model file.
+The options
+.BR \-gtmin ,
+.BR \-gt1min ,
+\&...,
+.B \-gt9min
+specify the minimum counts
+for N-grams to be included in the LM (not only for Good-Turing
+discounting but for the other methods as well).
+By default all unigrams and bigrams
+are included, but for higher order N-grams only those with count >= 2 are
+included.
+Some exceptions arise, because if one N-gram is included in
+the model file, all its prefix N-grams have to be included as well.
+This causes some higher order 1-count N-grams to be included when using
+KN discounting, which uses modified counts as described in Warning 2.
+.TP
+.B "Warning 5"
+Not all N-grams in the model file have backoff weights.
+The highest order N-grams do not need a backoff weight.
+For lower order N-grams
+backoff weights are only recorded for those that appear as the prefix
+of a longer N-gram included in the model.
+For other lower order N-grams
+the backoff weight is implicitly 1 (or 0, in log representation).
+
+.SH "SEE ALSO"
+ngram(1), ngram-count(1), ngram-format(5),
+.br
+S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for
+Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
+.SH BUGS
+Work in progress.
+.SH AUTHOR
+Deniz Yuret <dyuret@ku.edu.tr>,
+Andreas Stolcke <stolcke@icsi.berkeley.edu>
+.br
+Copyright (c) 2007 SRI International
--- a/language_model/srilm-1.7.3/man/man7/srilm-faq.7
+++ b/language_model/srilm-1.7.3/man/man7/srilm-faq.7
@@ -0,0 +1,745 @@
+.\" $Id: srilm-faq.7,v 1.13 2019/09/09 22:35:37 stolcke Exp $
+.TH SRILM-FAQ 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Miscellaneous"
+.SH NAME
+SRILM-FAQ \- Frequently asked questions about SRI LM tools
+.SH SYNOPSIS
+.nf
+man srilm-faq
+.fi
+.SH DESCRIPTION
+This document tries to answer some of the most frequently asked questions
+about SRILM.
+.SS Build issues
+.TP 4
+.B A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty.
+Building the binaries can fail for a variety of reasons.
+Check the following:
+.RS
+.IP a)
+Make sure the SRILM environment variable is set, or specified on the 
+make command line, e.g.:
+.nf
+	make SRILM=$PWD
+.fi
+.IP b)
+Make sure the
+.B $SRILM/sbin/machine-type
+script returns a valid string for the platform you are trying to build on.
+Known platforms have machine-specific makefiles called 
+.nf
+	$SRILM/common/Makefile.machine.$MACHINE_TYPE
+.fi
+If
+.B machine-type
+does not work for some reason, you can override its output on the command line:
+.nf
+	make MACHINE_TYPE=xyz
+.fi
+If you are building for an unsupported platform create a new machine-specific
+makefile and mail it to stolcke@speech.sri.com.
+.IP c)
+Make sure your compiler works and is invoked correctly.
+You will probably have to edit the
+.B CC
+and
+.B CXX
+variables in the platform-specific makefile.
+If you have questions about compiler invocation and best options
+consult a local expert; these things differ widely between sites.
+.IP d)
+The default is to compile with Tcl support.
+This is in fact only used for some testing binaries (which are
+not built by default),
+so it can be turned off if Tcl is not available or presents problems.
+Edit the machine-specific makefile accordingly.
+To use Tcl, locate the
+.B tcl.h 
+header file and the library itself, and set (for example)
+.nf
+	TCL_INCLUDE = -I/path/to/include
+	TCL_LIBRARY = -L/path/to/lib -ltcl8.4
+.fi
+To disable Tcl support set
+.nf
+	NO_TCL = X
+	TCL_INCLUDE = 
+	TCL_LIBRARY = 
+.fi
+.IP e)
+Make sure you have the C-shell (/bin/csh) installed on your system.
+Otherwise you will see something like
+.nf
+	make: /sbin/machine-type: Command not found
+.fi
+early in the build process.
+On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
+as an optional package.
+.IP f)
+If you cannot get SRILM to build, save the make output to a file
+.nf
+	make World >& make.output
+.fi
+and look for messages indicating errors.
+If you still cannot figure out what the problem is, send the error message
+and immediately preceding lines to the srilm-user list.
+Also include information about your operating system ("uname -a" output) 
+and compiler version ("gcc -v" or equivalent for other compilers).
+.RE
+.TP
+.B A2) The regression test outputs differ for all tests.  What did I do wrong?
+Most likely the binaries didn't get built or aren't executable
+for some reason.
+Check issue A1).
+.TP
+.B A3) I get differing outputs for some of the regression tests. Is that OK?
+It might be.
+The comparison of reference to actual output allows for small numerical
+differences, but
+some of the algorithms make hard decisions based on floating-point computations
+that can result in different outputs as a result of different compiler
+optimizations, machine floating point precisions (Intel versus IEEE format),
+and math libraries.
+Test of this nature include 
+.BR ngram-class ,
+.BR disambig ,
+and
+.BR nbest-rover .
+When encountering differences, diff the output in the
+$SRILM/test/outputs/\fITEST\fP.$MACHINE_TYPE.stdout file to the corresponding
+$SRILM/test/reference/\fITEST\fP.stdout, where 
+.I TEST
+is the name of the test that failed.
+Also compare the corresponding .stderr files;
+differences there usually indicate operating-system related problems.
+.SS Large data and memory issues
+.TP 4
+.B B1) I'm getting a message saying ``Assertion `body != 0' failed.''
+You are running out of memory.
+See subsequent questions depending on what you are trying to do.
+.IP Note:
+The above message means you are running
+out of "virtual" memory on your computer, which could be because of 
+limits in swap space, administrative resource limits, or limitations of 
+the machine architecture (a 32-bit machine cannot address more than
+4GB no matter how many resources your system has).
+Another symptom of not enough memory is that your program runs, but 
+very, very slowly, i.e., it is "paging" or "swapping" as it tries to
+use more memory than the machine has RAM installed.
+.TP
+.B B2) I am trying to count N-grams in a text file and running out of memory.
+Don't use
+.B ngram-count
+directly to count N-grams.
+Instead, use the
+.B make-batch-counts
+and
+.B merge-batch-counts
+scripts described in
+.BR training-scripts (1).
+That way you can create N-gram counts limited only by the maximum file size
+on your system.
+.TP
+.B B3) I am trying to build an N-gram LM and ngram-count runs out of memory.
+You are running out of memory either because of the size of ngram counts,
+or of the LM being built. The following are strategies for reducing the
+memory requirements for training LMs.
+.RS
+.IP a)
+Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
+.B ngram-count
+in "raw" form.
+Instead, use the
+.B make-big-lm
+wrapper script described in the
+.BR training-scripts (1)
+man page.
+.IP b)
+Switch to using the "_c" or "_s" versions of the SRI binaries.
+For
+instructions on how to build them, see the INSTALL file.
+Once built, set your executable search path accordingly, and try 
+.B make-big-lm
+again.
+.IP c)
+Raise the minimum counts for N-grams included in the LM, i.e.,
+the values of the options
+.BR \-gt2min ,
+.BR \-gt3min ,
+.BR \-gt4min ,
+etc.
+The higher order N-grams typically get higher minimum counts.
+.IP d)
+Get a machine with more memory.
+If you are hitting the limitations of a 32-bit machine architecture,
+get a 64-bit machine and recompile SRILM to take advantage of the expanded
+address space.
+(The MACHINE_TYPE=i686-m64 setting is for systems based on
+64-bit AMD processors, as well as recent compatibles from Intel.)
+Note that 64-bit pointers will require a memory overhead in 
+themselves, so you will need a machine with significantly, not just a
+little, more memory than 4GB.
+.RE
+.TP
+.B B4) I am trying to apply a large LM to some data and am running out of memory.
+Again, there are several strategies to reduce memory requirements.
+.RS
+.IP a)
+Use the "_c" or "_s" versions of the SRI binaries.
+See 3b) above.
+.IP b)
+Precompute the vocabulary of your test data and use the
+.B "ngram \-limit-vocab"
+option to load only the N-gram parameters relevant to your data.
+This approach should allow you to use arbitrarily 
+large LMs provided the data is divided into small enough chunks.
+.IP c)
+If the LM can be built on a large machine, but then is to be used on
+machines with limited memory, use
+.B "ngram \-prune"
+to remove the less important parameters of the model.
+This usually gives huge size reductions with relatively modest performance
+degradation.
+The tradeoff is adjustable by varying the pruning parameter.
+.RE
+.TP
+.B B5) How can I reduce the time it takes to load large LMs into memory?
+The techniques described in 4b) and 4c) above also reduce the load time
+of the LM.
+Additional steps to try are:
+.RS
+.IP a)
+Convert the LM into binary format, using 
+.nf
+		ngram -order \fIN\fP -lm \fIOLDLM\fP -write-bin-lm \fINEWLM\fP
+.fi
+(This is currently only supported for N-gram-based LMs.)
+You can also generate the LM directly in binary format, using
+.nf
+		ngram-count ... -lm \fINEWLM\fP -write-binary-lm
+.fi
+The resulting
+.I NEWLM
+file (which should not be compressed) can be used
+in place of a textual LM file with all compiled SRILM tools
+(but not with
+.BR lm-scripts (1)).
+The format is machine-independent, i.e., it can be read on machines with
+different word sizes or byte-order.
+Loading binary LMs is faster because
+(1) it reduces the overhead of parsing the input data, and
+(2) in combination with
+.B \-limit-vocab 
+(see 4b)
+it is much faster to skip sections of the LM that are out-of-vocabulary.
+.IP Note:
+There is also a binary format for N-gram counts.
+It can be generated using 
+.nf
+		ngram-count -write-binary \fICOUNTS\fP
+.fi
+and has similar advantages as binary LM files.
+.IP b)
+Start a "probability server" that loads the LM ahead of time, and
+then have "LM clients" query the server instead of computing the 
+probabilities themselves.
+.br
+The server is started on a machine named
+.I HOST
+using 
+.nf
+		ngram \fILMOPTIONS\fP -server-port \fIP\fP &
+.fi
+where
+.I P
+is an integer < 2^16 that specifies the TCP/IP port number the
+server will listen on, and
+.I LMOPTIONS
+are whatever options necessary to define the LM to be used.
+.br
+One or more clients (programs such as
+.BR ngram (1),
+.BR disambig (1),
+.BR lattice-tool (1))
+can then query the server using the options
+.nf
+		-use-server \fIP\fP@\fIHOST\fP -cache-served-ngrams
+.fi
+instead of the usual "-lm \fIFILE\fP".
+The
+.B \-cache-served-ngrams
+option is not required but often speeds up performance dramatically by
+saving the results of server lookups in the client for reuse.
+Server-based LMs may be combined with file-based LMs by interpolation;
+see 
+.BR ngram (1)
+for details.
+.RE
+.TP
+.B B6) How can I use the Google Web N-gram corpus to build an LM?
+Google has made a corpus of 5-grams extracted from 1 tera-words of web data
+available via LDC.
+However, the data is too large to build a standard backoff N-gram, even
+using the techniques described above.
+Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
+Such an LM computes probabilities on the fly from the counts, of which only
+the subsets needed for a given test set need to be loaded into memory.
+LM construction proceeds in the following steps:
+.RS
+.IP a)
+Make sure you have built SRI binaries either for a 64-bit machine 
+(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
+This is necessary because the data contains N-gram counts exceeding
+the range of 32-bit integers.
+Be sure to invoke all commands below using the path to the appropriate
+binary executable directory.
+.IP b)
+Prepare mapping file for some vocabulary mismatches and call this
+.BR google.aliases :
+.nf
+	<S> <s>
+	</S> </s>
+	<UNK> <unk>
+.fi
+.IP c)
+Prepare an initial count-LM parameter file 
+.BR google.countlm.0 :
+.nf
+	order 5
+	vocabsize 13588391
+	totalcount 1024908267229
+	countmodulus 40
+	mixweights 15
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	 0.5 0.5 0.5 0.5 0.5
+	google-counts \fIPATH\fP
+.fi
+where
+.I PATH
+points to the location of the Google N-grams, i.e., the directory containing 
+subdirectories "1gms", "2gms", etc.
+Note that the
+.B vocabsize
+and
+.B totalcount
+were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
+(Check that they match and modify as needed.)
+For an explanation of the parameters see the
+.BR ngram (1)
+.B \-count-lm 
+option.
+.IP d)
+Prepare a text file 
+.B tune.text 
+containing data for estimating the mixture weights.
+This data should be representative of, but different from your test data.
+Compute the vocabulary of this data using
+.nf
+	ngram-count -text tune.text -write-vocab tune.vocab
+.fi
+The vocabulary size should not exceed a few thousand to keep memory 
+requirements in the following steps manageable.
+.IP e)
+Estimate the mixture weights:
+.nf
+	ngram-count -debug 1 -order 5 -count-lm  \\
+		-text tune.text -vocab tune.vocab \\
+		-vocab-aliases google.aliases \\
+		-limit-vocab \\
+		-init-lm google.countlm.0 \\
+		-em-iters 100 \\
+		-lm google.countlm
+.fi
+This will write the estimated LM to 
+.BR google.countlm .
+The output will be identical to the initial LM file, except for the 
+updated interpolation weights.
+.IP f)
+Prepare a test data file 
+.BR test.text ,
+and its vocabulary
+.B test.vocab
+as in Step d) above.
+Then apply the LM to the test data:
+.nf
+	ngram -debug 2 -order 5 -count-lm \\
+		-lm google.countlm \\
+		-vocab test.vocab \\
+		-vocab-aliases google.aliases \\
+		-limit-vocab \\
+		-ppl test.text > test.ppl
+.fi
+The perplexity output will appear in 
+.B test.ppl.
+.IP g)
+Note that the Google data uses mixed case spellings.
+To apply the LM to lowercase data one needs to prepare a much more 
+extensive vocabulary mapping table for the
+.B \-vocab-aliases
+option, namely, one that maps all 
+upper- and mixed-case spellings to lowercase strings.
+This mapping file should be restricted to the words appearing in 
+.B tune.text
+and
+.BR test.text ,
+respectively, to avoid defeating the effect of 
+.B \-limit-vocab .
+.RE
+.SS "Smoothing issues"
+.TP 4
+.B C1) What is smoothing and discounting all about?
+.I Smoothing
+refers to methods that assign probabilities to events (N-grams) that
+do not occur in the training data.
+According to a pure maximum-likelihood estimator these events would have 
+probability zero, which is plainly wrong since previously unseen events
+in general do occur in independent test data.
+Because the probability mass is redistributed away from the seen events
+toward the unseen events the resulting model is "smoother" (closer to uniform)
+than the ML model.
+.I Discounting
+refers to the approach used by many smoothing methods of adjusting the 
+empirical counts of seen events downwards.
+The ML estimator (count divided by total number of events) is then applied
+to the discounted count, resulting in a smoother estimate.
+.TP
+.B C2) What smoothing methods are there?
+There are many, and SRILM implements are fairly large selection of the 
+most popular ones.
+A detailed discussion of these is found in a separate document,
+.BR ngram-discount (7).
+.TP
+.B C3) Why am I getting errors or warnings from the smoothing method I'm using?
+The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
+"count-of-counts", the number of words occurring one, twice, three times, etc.
+The formulae for these methods become undefined if the counts-of-counts
+are zero, or not strictly decreasing.
+Some conditions are fatal (such as when the count of singleton words is zero),
+others lead to less smoothing (and warnings).
+To avoid these problems, check for the following possibilities:
+.RS
+.IP a)
+The data could be very sparse, i.e., the training corpus very small.
+Try using the Witten-Bell discounting method.
+.IP b)
+The vocabulary could be very small, such as when training an LM based on
+characters or parts-of-speech.
+Smoothing is less of an issue in those cases, and the Witten-Bell method
+should work well.
+.IP c)
+The data was manipulated in some way, or artificially generated.
+For example, duplicating data eliminates the odd-numbered counts-of-counts.
+.IP d)
+The vocabulary is limited during counts collection using the 
+.BR ngram-count
+.B \-vocab
+option, with the effect that many low-frequency N-grams are eliminated.
+The proper approach is to compute smoothing parameters on the full vocabulary.
+This happens automatically in the 
+.B make-big-lm
+wrapper script, which is preferable to direct use of 
+.BR ngram-count 
+for other reasons (see issue B3-a above).
+.IP e)
+You are estimating an LM from N-gram counts that have been truncated beforehand,
+e.g., by removing singleton events.
+If you cannot go back to the original data and recompute the counts
+there is a heuristic to extrapolate low counts-of-counts from higher ones.
+The heuristic is invoked automatically (and an informational message is output)
+when 
+.B make-big-lm 
+is used to estimate LMs with Kneser-Ney smoothing.
+For details see the paper by W. Wang et al. in ASRU-2007, listed under
+"SEE ALSO".
+.RE
+.TP
+.B C4) How does discounting work in the case of unigrams?
+First, unigrams are discounted using the same method as higher-order 
+N-grams, using the specified method.
+The probability mass freed up in this way
+is then either spread evenly over all word types  
+that would otherwise have zero probability (this is essentially  
+simulating a backoff to zero-grams), or
+if all unigrams already have non-zero probabilities, the  
+left-over mass is added to
+.I all
+unigrams.
+In either case all unigram probabilty probabilities will sum to 1.
+An informational message from
+.B ngram-count
+will tell which case applies.
+.TP
+.B C5) Why do I get a different number of trigrams when building a 4gram model compared to just a trigram model?
+This can happen when Kneser-Ney smoothing is used and the trigram cut-off
+.RB ( \-gt3min )
+is greater than 1 (as with the default, 2).
+The count cutoffs are applied against the modified counts generated as part of KN smoothing,
+so in the case of a 4gram model the trigrams are modified and the set of ngrams above the cutoff will change.
+.SS "Out-of-vocabulary, zeroprob, and `unknown' words"
+.TP 4
+.B D1) What is the perplexity of an OOV (out of vocabulary) word?
+By default any word not observed in the training data is considered
+OOV and OOV words are silently ignored by the
+.BR ngram (1)
+during perplexity (ppl) calculation.
+For example:
+.nf
+
+	$ ngram-count -text turkish.train -lm turkish.lm
+	$ ngram -lm turkish.lm -ppl turkish.test
+	file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
+	0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09
+
+.fi
+The statistics printed in the last two lines have the following meanings:
+.RS
+.TP
+.B "34153 OOVs"
+This is the number of unknown word tokens, i.e. tokens
+that appear in
+.B turkish.test
+but not in
+.B turkish.train
+from which
+.B turkish.lm
+was generated.
+.TP
+.B "logprob= -3.20177e+06"
+This gives us the total logprob ignoring the 34153 unknown word tokens.
+The logprob does include the probabilities
+assigned to </s> tokens which are introduced by
+.BR ngram-count (1).
+Thus the total number of tokens which this logprob is based on is 
+.nf
+	words - OOVs + sentences = 1000015 - 34153 + 61031
+.fi
+.TP
+.B "ppl = 1311.97"
+This gives us the geometric average of 1/probability of
+each token, i.e., perplexity.
+The exact expression is: 
+.nf
+	ppl = 10^(-logprob / (words - OOVs + sentences))
+.fi
+.TP
+.B "ppl1 = 2065.09"
+This gives us the average perplexity per word excluding the </s> tokens.
+The exact expression is:
+.nf
+	ppl1 = 10^(-logprob / (words - OOVs))
+.fi
+.RE
+You can verify these numbers by running the
+.B ngram
+program with the
+.B "\-debug 2"
+option, which gives the probability assigned to each token.
+.TP
+.B D2) What happens when the OOV word is in the context of an N-gram?
+Exact details depend on the discounting algorithm used, but typically
+the backed-off probability from a lower order N-gram is used.  If the
+.B \-unk
+option is used as explained below, an <unk> token is assumed to
+take the place of the OOV word and no back-off may be necessary
+if a corresponding N-gram containing <unk> is found in the LM.
+.TP
+.B D3) Isn't it wrong to assign 0 logprob to OOV words?
+That depends on the application.
+If you are comparing multiple language
+models which all consider the same set of words as OOV it may be OK to
+ignore OOV words.
+Note that perplexity comparisons are only ever meaningful
+if the vocabularies of all LMs are the same.
+Therefore, to compare LMs with different sets of OOV words
+(such as when using different tokenization strategies for morphologically
+complex languages) then it becomes important
+to take into account the true cost of the OOV words, or to model all words,
+including OOVs.
+.TP
+.B D4) How do I take into account the true cost of the OOV words?
+A simple strategy is to "explode" the OOV words, i.e., split them into
+characters in the training and test data.
+Typically words that appear more than once in the training data are
+considered to be vocabulary words.
+All other words are split into their characters and the
+individual characters are considered tokens.
+Assuming that all characters occur at least once in the training data there
+will be no OOV tokens in the test data.
+Note that this strategy changes the number of tokens in the data set,
+so even though logprob is meaningful be careful when reporting ppl results.
+.TP
+.B D5) What if I want to model the OOV words explicitly?
+Maybe a better strategy is to have a separate "letter" model for OOV words.
+This can be easily created using SRILM by using a training
+file listing the OOV words one per line with their characters
+separated by spaces.
+The
+.B ngram-count
+options
+.B \-ukndiscount
+and
+.B "\-order 7"
+seem to work well for this purpose.
+The final logprob results are obtained in two steps.
+First do regular training and testing on your data using
+.B \-vocab
+and
+.B \-unk
+options.
+The resulting logprob will include the cost of the vocabulary words and an
+<unk> token for each OOV word.
+Then apply the letter model to each OOV word in the test set.
+Add the logprobs.
+Here is an example:
+.nf
+
+	# Determine vocabulary:
+	ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
+	awk '$2>1'  turkish.train.1cnt | cut -f1 | sort > turkish.train.vocab
+	awk '$2==1' turkish.train.1cnt | cut -f1 | sort > turkish.train.oov
+
+	# Word model:
+	ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
+	ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test > turkish.test.ppl
+
+	# Letter model:
+	perl -C -lne 'print join(" ", split(""))' turkish.train.oov > turkish.train.oov.split
+	ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
+	perl -pe 's/\\s+/\\n/g' turkish.test | sort > turkish.test.words
+	comm -23 turkish.test.words turkish.train.vocab > turkish.test.oov
+	perl -C -lne 'print join(" ", split(""))' turkish.test.oov > turkish.test.oov.split
+	ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model > turkish.test.oov.ppl
+
+	# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.
+
+.fi
+Again, perplexities are not directly meaningful as computed by SRILM, but you
+can recompute them by hand using the combined logprob value, and the number of 
+original word tokens in the test set.
+.TP
+.B D6) What are zeroprob words and when do they occur?
+In-vocabulary words that get zero probability are counted as  
+"zeroprobs" in the ppl output.
+Just as OOV words, they are excluded from the perplexity  
+computation since otherwise the perplexity value would be infinity.  
+There are three reasons why zeroprobs could occur in a
+closed vocabulary setting (the default for SRILM):
+.RS
+.IP a)
+If the same vocabulary is used at test time as was used during  
+training, and smoothing is enabled, then the occurrence of zeroprobs  
+indicates an anomalous condition and, possibly, a broken language model.
+.IP b)
+If smoothing has been disabled (e.g., by using the option
+.BR "\-cdiscount 0" ),
+then the LM will use maximum likelihood estimates for  
+the N-grams and then any unseen N-gram is a zeroprob.
+.IP c)
+If a different vocabulary file is specified at test time than  
+the one used in training, then the definition of what counts as an OOV  
+will change.
+In particular, a word that wasn't seen in the training data (but is in the  
+test vocabulary) will
+.I not
+be mapped to
+.B <unk>
+and, therefore, not  
+count as an OOV in the perplexity computation.
+However, it will still get zero probability and, therefore, be tallied
+as a zeroprob.
+.RE
+.TP
+.B D7) What is the point of using the \fB<unk>\fP token?
+Using
+.B <unk>
+is a practical convenience employed by SRILM.
+Words not in the specified vocabulary are mapped to
+.BR <unk> ,
+which is equivalent to performing the same mapping 
+in a data pre-processing step outside of SRILM.
+Other than that,
+for both LM estimation and evaluation purposes,
+.B <unk>
+is treated like any other word.
+(In particular, in the computation of discounted probabilities
+there is no special handling of
+.BR <unk> .)
+.TP
+.B D8) So how do I train an open-vocabulary LM with \fB<unk>\fP?
+First, make sure to use the
+.B ngram-count 
+.B \-unk
+option, which simply indicates that the
+.B <unk> 
+word should be included in the LM vocabulary, as required for an
+open-vocabulary LM.
+Without this option, N-grams containing 
+.B <unk>
+would simply be discarded.
+An "open vocabulary" LM is simply one that contains 
+.BR <unk> ,
+and can therefore (by virtue of the mapping of OOVs to
+.BR <unk> )
+assign a non-zero probability to them.
+Next, we need to ensure there are actual occurrences of 
+.B <unk> 
+N-grams 
+in the training data so we can obtain meaningful probability estimates
+for them
+(otherwise 
+.B <unk>
+would only get probabilty via unigram discounting, see item C4).
+To get a proper estimate  
+of the
+.B <unk>
+probability, we need to explicitly specify a vocabulary that is not 
+a superset of the training data.
+One way to do that is to extract the vocabulary from an independent
+data set, or by only including words with some minimum count (greater than 1)
+in the training data.
+.TP
+.B D9) Doesn't ngram-count \-addsmooth deal with OOV words by adding a constant to occurrence counts?
+No, all smoothing is applied when building the LM at training time,
+so it must use the
+.B <unk>
+mechanism to assign probability to words that are first seen in the 
+test data.
+Furthermore, even add-constant smoothing requires a fixed, finite
+vocabulary to compute the denominator of its estimator.
+.SH "SEE ALSO"
+ngram(1), ngram-count(1), training-scripts(1), ngram-discount(7).
+.br
+$SRILM/INSTALL
+.br
+http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
+.br
+http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
+.br
+W. Wang, A. Stolcke, & J. Zheng,
+Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
+http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
+.SH BUGS
+This document is work in progress.
+.SH AUTHOR
+Andreas Stolcke <andreas.stolcke@microsoft.com>,
+Deniz Yuret <dyuret@ku.edu.tr>,
+Nitin Madnani <nmadnani@umiacs.umd.edu>
+.br
+Copyright (c) 2007\-2010 SRI International
+.br
+Copyright (c) 2011\-2017 Andreas Stolcke
+.br
+Copyright (c) 2011\-2017 Microsoft Corp.