competition update
This commit is contained in:
184
language_model/srilm-1.7.3/doc/lm-intro
Normal file
184
language_model/srilm-1.7.3/doc/lm-intro
Normal file
@@ -0,0 +1,184 @@
|
||||
|
||||
SRI Object-oriented Language Modeling Libraries and Tools
|
||||
|
||||
BUILD AND INSTALL
|
||||
|
||||
See the INSTALL file in the top-level directory.
|
||||
|
||||
FILES
|
||||
|
||||
All mixed-case .cc and .h files contain C++ code for the liboolm
|
||||
library.
|
||||
All lower-case .cc files in this directory correspond to executable commands.
|
||||
Each program is option driven. The -help option prints a list of
|
||||
available options and their meaning.
|
||||
|
||||
N-GRAM MODELS
|
||||
|
||||
Here we only cover arbitary-order n-grams (no classes yet, sorry).
|
||||
|
||||
ngram-count manipulates n-gram counts and
|
||||
estimates backoff models
|
||||
ngram-merge merges count files (only needed for large
|
||||
corpora/vocabularies)
|
||||
ngram computes probabilities given a backoff model
|
||||
(also does mixing of backoff models)
|
||||
|
||||
Below are some typical command lines.
|
||||
|
||||
NGRAM COUNT MANIPULATION
|
||||
|
||||
ngram-count -order 4 -text corpus -write corpus.ngrams
|
||||
|
||||
Counts all n-grams up to length 4 in corpus and writes them to
|
||||
corpus.ngrams. The default n-gram order (if not specified) is 3.
|
||||
|
||||
ngram-count -vocab 30k.vocab -order 4 -text corpus -write corpus.ngrams
|
||||
|
||||
Same, but restricts the vocabulary to what is listed in 30k.vocab
|
||||
(one word per line). Other words are replaced with the <unk> token.
|
||||
|
||||
ngram-count -read corpus.ngrams -vocab 20k.vocab -write-order 3 \
|
||||
-write corpus.20k.3grams
|
||||
|
||||
Reads the counts from corpus.ngrams, maps them to the indicated
|
||||
vocabulary, and writes only the trigrams back to a file.
|
||||
|
||||
ngram-count -text corpus -write1 corpus.1grams -write2 corpus.2grams \
|
||||
-write3 corpus.3grams
|
||||
|
||||
writes unigrams, bigrams, and trigrams to separate files, in one
|
||||
pass. Usually there is no reason to keep these separate, which is
|
||||
why by default ngrams of all lengths are written out together.
|
||||
The -recompute flag will regenerate the lower-order counts from the
|
||||
highest order one by summation by prefixes.
|
||||
|
||||
The -read and -text options are additive, and can be used to merge
|
||||
new counts with old. Furthermore, repeated n-grams read with -read
|
||||
are also additive. Thus,
|
||||
|
||||
cat counts1 counts2 ... | ngram-count -read -
|
||||
|
||||
will merge the counts from counts1, counts2, ...
|
||||
|
||||
All file reading and writing uses the zio routines, so argument "-"
|
||||
stands for stdin/stdout, and .Z and .gz files are handled correctly.
|
||||
|
||||
For very large count files (due to large corpus/vocabulary) this
|
||||
method of merging counts in-memory is not suitable. Alternatively,
|
||||
counts can sorted and then merged.
|
||||
First, generate counts for portions of the training corpus and
|
||||
save them with -sort:
|
||||
|
||||
ngram-count -text part1.text -sort -write part1.ngrams.Z
|
||||
ngram-count -text part2.text -sort -write part2.ngrams.Z
|
||||
|
||||
Then combine these with
|
||||
|
||||
ngram-merge part?.ngrams.Z > all.ngrams.Z
|
||||
|
||||
(Although ngram-merge can deal with any number of files >= 2 ,
|
||||
it is most efficient to combined two parts at a time, then
|
||||
from the resulting new count files, again two at a time, using
|
||||
a binary tree merging scheme.)
|
||||
|
||||
BACKOFF MODEL ESTIMATION
|
||||
|
||||
ngram-count -order 2 -read corpus.counts -lm corpus.bo
|
||||
|
||||
generates a bigram backoff model in ARPA (aka Doug Paul) format
|
||||
from corpus.counts and writes is to corpus.bo.
|
||||
|
||||
If counts fit into memory (and hence there is no reason for the
|
||||
merging schemes described above) it is more convenient and faster
|
||||
to go directly from training text to model:
|
||||
|
||||
ngram-count -text corpus -lm corpus.bo
|
||||
|
||||
The built-in discounting method used in building backoff models
|
||||
is Good Turing. The lower exclusion cutoffs can be set with
|
||||
options (-gt1min ... -gt6min), the upper discounting cutoffs are
|
||||
selected with -gt1max ... -gt6max. Reasonable defaults are
|
||||
provided that can be displayed as part of the -help output.
|
||||
|
||||
When using limited vocabularies it is recommended to compute the
|
||||
discount coeffiecients on the unlimited vocabulary (at least for
|
||||
the unigrams) and then apply them to the limited vocabulary
|
||||
(otherwise the vocabulary truncation would produce badly skewed
|
||||
counts frequencies at the low end that would break the GT algorithm.)
|
||||
For this reason, discounting parameters can be saved to files and
|
||||
read back in.
|
||||
For example
|
||||
|
||||
ngram-count -text corpus -gt1 gt1.params -gt2 gt2.params \
|
||||
-gt3 gt3.params
|
||||
|
||||
saves the discounting parameters for unigrams, bigrams and trigams
|
||||
in the files as indicated. These are short files that can be
|
||||
edited, e.g., to adjust the lower and upper discounting cutoffs.
|
||||
Then the limited vocabulary backoff model is estimated using these
|
||||
saved parameters
|
||||
|
||||
ngram-count -text corpus -vocab 20k.vocab \
|
||||
-gt1 gt1.params -gt2 gt2.params gt3.params -lm corpus.20k.bo
|
||||
|
||||
MODEL EVALUATION
|
||||
|
||||
The ngram program uses a backoff model to compute probabilities and
|
||||
perplexity on test data.
|
||||
|
||||
ngram -lm some.bo -ppl test.corpus
|
||||
|
||||
computes the perplexity on test.corpus according to model some.bo.
|
||||
The flag -debug controls the amount of information output:
|
||||
|
||||
-debug 0 only overall statistics
|
||||
-debug 1 statistics for each test sentence
|
||||
-debug 2 probabilities for each word
|
||||
-debug 3 verify that word probabilities over the
|
||||
entire vocabulary sum to 1 for each context
|
||||
|
||||
ngram also understands the -order flag to set the maximum ngram
|
||||
order effectively used by the model. The default is 3.
|
||||
It has to be explicitly reset to use ngrams of higher order, even
|
||||
if the file specified with -lm contains higher order ngrams.
|
||||
|
||||
The flag -skipoovs establishes compatibility with broken behavior
|
||||
in some old software. It should only be used with bo model files
|
||||
produced with the old tools. It will
|
||||
|
||||
- let OOVs be counted as such even when the model has a probability
|
||||
for <unk>
|
||||
- skip not just the OOV but the entire n-gram context in which any
|
||||
OOVs occur (instead of backing off on OOV contexts).
|
||||
|
||||
OTHER MODEL OPERATIONS
|
||||
|
||||
ngram performs a few other operations on backoff models.
|
||||
|
||||
ngram -lm bo1 -mix-lm bo2 -lambda 0.2 -write-lm bo3
|
||||
|
||||
produces a new model in bo3 that is the interpolation of bo1 and bo2
|
||||
with a weight of 0.2 (for bo1).
|
||||
|
||||
ngram -lm bo -renorm -write-lm bo.new
|
||||
|
||||
recomputes the backoff weights in the model bo (thus normalizing
|
||||
probabilities to 1) and leaves the result in bo.new.
|
||||
|
||||
API FOR LANGUAGE MODELS
|
||||
|
||||
These programs are just examples of how to use the object-oriented
|
||||
language model library currently under construction. To use the API
|
||||
one would have to read the various .h files and how the interfaces
|
||||
are used in the example progams. No comprehensive documentation is
|
||||
available as yet. Sorry.
|
||||
|
||||
AVAILABILITY
|
||||
|
||||
This code is Copyright SRI International, but is available free of
|
||||
charge for non-profit use. See the License file in the top-level
|
||||
direcory for the terms of use.
|
||||
|
||||
Andreas Stolcke
|
||||
$Date: 1999/07/31 18:48:33 $
|
||||
Reference in New Issue
Block a user