400 lines
14 KiB
Groff
400 lines
14 KiB
Groff
.\" $Id: training-scripts.1,v 1.29 2019/09/09 22:35:37 stolcke Exp $
|
|
.TH training-scripts 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"
|
|
.SH NAME
|
|
training-scripts, compute-oov-rate, continuous-ngram-count, get-gt-counts, make-abs-discount, make-batch-counts, make-big-lm, make-diacritic-map, make-google-ngrams, make-gt-discounts, make-kn-counts, make-kn-discounts, merge-batch-counts, replace-unk-words, replace-words-with-classes, reverse-ngram-counts, split-tagged-ngrams, reverse-text, tolower-ngram-counts, uniform-classes, uniq-ngram-counts, vp2text \- miscellaneous conveniences for language model training
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBget-gt-counts\fP \fBmax=\fP\fIK\fP \fBout=\fP\fIname\fP [ \fIcounts\fP ... ] \fB>\fP \fIgtcounts\fP
|
|
\fBmake-abs-discount\fP \fIgtcounts\fP
|
|
\fBmake-gt-discounts\fP \fBmin=\fP\fImin\fP \fBmax=\fP\fImax\fP \fIgtcounts\fP
|
|
\fBmake-kn-counts\fP \fBorder=\fP\fIN\fP \fBmax_per_file=\fP\fIM\fP \fBoutput=\fP\fIfile\fP \\
|
|
[ \fBno_max_order=1\fP ] \fB>\fP \fIcounts\fP
|
|
\fBmake-kn-discounts\fP \fBmin=\fP\fImin\fP \fIgtcounts\fP
|
|
\fBmake-batch-counts\fP \fIfile-list\fP \\
|
|
[ \fIbatch-size\fP [ \fIfilter\fP [ \fIcount-dir\fP [ \fIoptions\fP ... ] ] ] ]
|
|
\fBmerge-batch-counts\fP [ \fB\-float-counts\fP ] [ \fB\-l\fP \fIN\fP ] \fIcount-dir\fP [ \fIfile-list\fP|\fIstart-iter\fP ]
|
|
\fBmake-google-ngrams\fP [ \fBdir=\fP\fIDIR\fP ] [ \fBper_file=\fP\fIN\fP ] [ \fBgzip=0\fP ] \\
|
|
[ \fByahoo=1\fP ] [ \fIcounts-file\fP ... ]
|
|
\fBcontinuous-ngram-count\fP [ \fBorder=\fP\fIN\fP ] [ \fItextfile\fP ... ]
|
|
\fBtolower-ngram-counts\fP [ \fIcounts-file\fP ... ]
|
|
\fBuniq-ngram-counts\fP [ \fIcounts-file\fP ... ]
|
|
\fBreverse-ngram-counts\fP [ \fIcounts-file\fP ... ]
|
|
\fBreverse-text\fP [ \fItextfile\fP ... ]
|
|
\fBsplit-tagged-ngrams\fP [ \fBseparator=\fP\fIS\fP ] [ \fIcounts-file\fP ... ]
|
|
\fBmake-big-lm\fP \fB\-name\fP \fIname\fP \fB\-read\fP \fIcounts\fP \fB\-lm\fP \fInew-model\fP \\
|
|
[ \fB\-trust-totals\fP ] [ \fB\-max-per-file\fP \fIM\fP ] [ \fB\-ngram-filter\fP \fIfilter\fP ] \\
|
|
[ \fB\-text \fIfile\fP ] [ \fIngram-options\fP ... ]
|
|
\fBreplace-unk-words\fP \fBvocab=\fP\fIvocab\fP [ \fItextfile\fP ... ]
|
|
\fBreplace-words-with-classes\fP \fBclasses=\fP\fIclasses\fP [ \fBoutfile=\fP\fIcounts\fP ] \\
|
|
[ \fBnormalize=0\fP|\fB1\fP ] [ \fBaddone=\fP\fIK\fP ] [ \fBhave_counts=1\fP ] [ \fBpartial=1\fP ] \\
|
|
[ \fItextfile\fP ... ]
|
|
\fBuniform-classes\fP \fIclasses\fP \fB>\fP \fInew-classes\fP
|
|
\fBmake-diacritic-map\fP \fIvocab\fP
|
|
\fBvp2text\fP [ \fItextfile\fP ... ]
|
|
\fBcompute-oov-rate\fP \fIvocab\fP [ \fIcounts\fP ... ]
|
|
.fi
|
|
.SH DESCRIPTION
|
|
These scripts perform convenience tasks associated with the training of
|
|
language models.
|
|
They complement and extend the basic N-gram model estimator in
|
|
.BR ngram-count (1).
|
|
.PP
|
|
Since these tools are implemented as scripts they don't automatically
|
|
input or output compressed data files correctly, unlike the main
|
|
SRILM tools.
|
|
However, since most scripts work with data from standard input or
|
|
to standard output (by leaving out the file argument, or specifying it
|
|
as ``-'') it is easy to combine them with
|
|
.BR gunzip (1)
|
|
or
|
|
.BR gzip (1)
|
|
on the command line.
|
|
.PP
|
|
Also note that many of the scripts take their options with the
|
|
.BR gawk (1)
|
|
syntax
|
|
.IB option = value
|
|
instead of the more common
|
|
.BI - option
|
|
.IR value .
|
|
.PP
|
|
.B get-gt-counts
|
|
computes the counts-of-counts statistics needed in Good-Turing smoothing.
|
|
The frequencies of counts up to
|
|
.I K
|
|
are computed (default is 10).
|
|
The results are stored in a series of files with root
|
|
.IR name ,
|
|
.BR \fIname\fP.gt1counts ,
|
|
.BR \fIname\fP.gt2counts ,
|
|
\&...,
|
|
.BR \fIname\fP.gt\fIN\fPcounts .
|
|
It is assumed that the input counts have been properly merged, i.e.,
|
|
that there are no duplicated N-grams.
|
|
.PP
|
|
.B make-gt-discounts
|
|
takes one of the output files of
|
|
.B get-gt-counts
|
|
and computes the corresponding Good-Turing discounting factors.
|
|
The output can then be passed to
|
|
.BR ngram-count (1)
|
|
via the
|
|
.BI \-gt n
|
|
options to control the smoothing during model estimation.
|
|
Precomputing the GT discounting in this fashion has the advantage that the
|
|
GT statistics are not affected by restricting N-grams to a limited vocabulary.
|
|
Also,
|
|
.BR get-gt-counts / make-gt-discounts
|
|
can process arbitrarily large count files, since they do not need to
|
|
read the counts into memory (unlike
|
|
.BR ngram-count ).
|
|
.PP
|
|
.B make-abs-discount
|
|
computes the absolute discounting constant needed for the
|
|
.B ngram-count
|
|
.BI \-cdiscount n
|
|
options.
|
|
Input is one of the files produced by
|
|
.BR get-gt-counts .
|
|
.PP
|
|
.B make-kn-discount
|
|
computes the discounting constants used by the modified Kneser-Ney
|
|
smoothing method.
|
|
Input is one of the files produced by
|
|
.BR get-gt-counts .
|
|
This script also implements a method for extrapolating missing
|
|
counts of counts as described in
|
|
Wang et al. (2007).
|
|
.PP
|
|
.B make-batch-counts
|
|
performs the first stage in the construction of very large N-gram count
|
|
files.
|
|
.I file-list
|
|
is a list of input text files.
|
|
Lines starting with a `#' character are ignored.
|
|
These files will be grouped into batches of size
|
|
.I batch-size
|
|
(default 10)
|
|
that are then processed in one run of
|
|
.B ngram-count
|
|
each.
|
|
For maximum performance,
|
|
.I batch-size
|
|
should be as large as possible without triggering paging.
|
|
Optionally, a
|
|
.I filter
|
|
script or program can be given to condition the input texts.
|
|
The N-gram count files are left in directory
|
|
.I count-dir
|
|
(``counts'' by default), where they can be found by a subsequent
|
|
run of
|
|
.BR merge-batch-counts .
|
|
All following
|
|
.I options
|
|
are passed to
|
|
.BR ngram-count ,
|
|
e.g., to control N-gram order, vocabulary, etc.
|
|
(no options triggering model estimation should be included).
|
|
.PP
|
|
.B merge-batch-counts
|
|
completes the construction of large count files by merging the
|
|
batched counts left in
|
|
.I count-dir
|
|
until a single count file is produced.
|
|
Optionally, a
|
|
.I file-list
|
|
of count files to combine can be specified; otherwise all count files
|
|
in
|
|
.I count-dir
|
|
from a prior run of
|
|
.B make-batch-counts
|
|
will be merged.
|
|
A number as second argument restarts the merging process at iteration
|
|
.IR start-iter .
|
|
This is convenient if merging fails to complete for some reason
|
|
(e.g., for temporary lack of disk space).
|
|
The
|
|
.B \-float-counts
|
|
option should be specific if the counts are real-valued.
|
|
The
|
|
.B \-l
|
|
option specifies the number of files to merge in each iteration;
|
|
the default is 2.
|
|
.PP
|
|
.B make-google-ngrams
|
|
takes a sorted count file as input and creates an indexed directory
|
|
structure, in a format developed by Google to store very large N-gram
|
|
collections.
|
|
The resulting directory can then be used with the
|
|
.BR ngram-count (1)
|
|
.B \-read-google
|
|
option.
|
|
Optional arguments specify the output directory
|
|
.I dir
|
|
and the size
|
|
.I N
|
|
of individual N-gram files
|
|
(default is 10 million N-grams per file).
|
|
The
|
|
.B gzip=0
|
|
option writes plain, as opposed to compressed, files.
|
|
The
|
|
.B yahoo=1
|
|
option may be used to read N-gram count files in Yahoo-GALE format.
|
|
Note that the count files have to first be sorted lexicographically
|
|
in a separate invocation of sort.
|
|
.PP
|
|
.B continuous-ngram-count
|
|
generates N-grams that span line breaks (which are usually taken to
|
|
be sentence boundaries).
|
|
To count N-grams across line breaks use
|
|
.nf
|
|
continuous-ngram-count \fItextfile\fP | ngram-count -read -
|
|
.fi
|
|
The argument
|
|
.I N
|
|
controls the order of N-grams counted (default 3), and
|
|
should match the argument of
|
|
.B ngram-count
|
|
.BR \-order .
|
|
.PP
|
|
.B tolower-ngram-counts
|
|
maps an N-gram counts file to all-lowercase.
|
|
No merging of N-grams that become identical in the process is done.
|
|
.PP
|
|
.B uniq-ngram-counts
|
|
combines successive counts that pertain to the same N-gram into a single
|
|
line.
|
|
This only affects repeated N-grams that appear on successive lines, so a prior
|
|
sort command is typically used, e.g.,
|
|
.br
|
|
tolower-ngram-counts \fIINPUT\fP | sort | uniq-ngram-counts > \fiOUTPUT\fP
|
|
.br
|
|
would do much of the same thing as
|
|
.br
|
|
ngram-counts -read \fIINPUT\fP -tolower -sort -write \fIOUTPUT\fP
|
|
.br
|
|
but in a more memory-efficient manner (without reading all counts into memory).
|
|
.PP
|
|
.B reverse-ngram-counts
|
|
reverses the word order of N-grams in a counts file or stream.
|
|
For example, to recompute lower-order counts from higher-order ones,
|
|
but do the summation over preceding words (rather than following words,
|
|
as in
|
|
.BR ngram-count (1)),
|
|
use
|
|
.br
|
|
reverse-ngram-counts \fIcount-file\fP | \\
|
|
.br
|
|
ngram-count -read - -recompute -write - | \\
|
|
.br
|
|
reverse-ngram-counts > \fInew-counts\fP
|
|
.br
|
|
Also, start-sentence tags are replaced with end-sentence tags, and vice-versa,
|
|
so that reverse-direction LMs can be trained from forward-direction N-gram
|
|
counts.
|
|
.PP
|
|
.B reverse-text
|
|
reverses the word order in text files, line-by-line.
|
|
Start- and end-sentence tags, if present, will be preserved.
|
|
This reversal is appropriate for preprocessing training data
|
|
for LMs that are meant to be used with the
|
|
.B ngram
|
|
.BR \-reverse
|
|
option.
|
|
.PP
|
|
.B split-tagged-ngrams
|
|
expands N-gram count of word/tag pairs into mixed N-grams
|
|
of words and tags.
|
|
The optional
|
|
.BI separator= S
|
|
argument allows the delimiting character, which defaults to "/",
|
|
to be modified.
|
|
.PP
|
|
.B make-big-lm
|
|
constructs large N-gram models in a more memory-efficient way than
|
|
.B ngram-count
|
|
by itself.
|
|
It does so by precomputing the Good-Turing or Kneser-Ney smoothing parameters
|
|
from the full set of counts, and then instructing
|
|
.B ngram-count
|
|
to store only a subset of the counts in memory,
|
|
namely those of N-grams to be retained in the model.
|
|
The
|
|
.I name
|
|
parameter is used to name various auxiliary files.
|
|
.I counts
|
|
contains the raw N-gram counts; it may be (and usually is) a compressed file.
|
|
Unlike with
|
|
.BR ngram-count ,
|
|
the
|
|
.B \-read
|
|
option can be repeated to concatenate multiple count files, but the arguments
|
|
must be regular files; reading from stdin is not supported.
|
|
If Good-Turing smoothing is used and the file contains complete lower-order
|
|
counts corresponding to the
|
|
sums of higher-order counts, then the
|
|
.B \-trust-totals
|
|
options may be given for efficiency.
|
|
The
|
|
.B \-text
|
|
option specifies a test set to which the LM is to be applied, and
|
|
builds the LM in such a way that only N-gram context occurring in the
|
|
test data are included in the model, this saving space at the expense of
|
|
generality.
|
|
All other
|
|
.I options
|
|
are passed to
|
|
.B ngram-count
|
|
(only options affecting model estimation should be given).
|
|
Smoothing methods other than Good-Turing, modified Kneser-Ney and Witten-Bell are not
|
|
supported by
|
|
.BR make-big-lm .
|
|
Kneser-Ney smoothing also requires enough disk space to compute and store the
|
|
modified lower-order counts used by the KN method.
|
|
This is done using the
|
|
.B merge-batch-counts
|
|
command, and the
|
|
.B \-max-per-file
|
|
option controls how many counts are to be stored per batch, and
|
|
should be chosen so that these batches fit in real memory.
|
|
The
|
|
.B \-ngram-filter
|
|
option allows specification of a command through which the input N-gram
|
|
counts are piped, e.g., to convert from some non-standard format.
|
|
.PP
|
|
.B make-kn-counts
|
|
computes the modified lower-order counts used by the KN smoothing method.
|
|
It is invoked as a helper scripts by
|
|
.B make-big-lm .
|
|
.PP
|
|
.B replace-unk-words
|
|
replaces words not appearing in the
|
|
.I vocab
|
|
file with the unknown word tag
|
|
.BR <unk> .
|
|
This is useful for preparing text data for LM training.
|
|
Only the first token on each line in the
|
|
.I vocab
|
|
file is significant, so both word lists and unigram count files may be used.
|
|
.PP
|
|
.B replace-words-with-classes
|
|
replaces expansions of word classes with the corresponding class labels.
|
|
.I classes
|
|
specifies class expansions in
|
|
.BR classes-format (5).
|
|
Substitutions are performed at each word position in left to right order,
|
|
with the longest matching right-hand-side of any class expansion.
|
|
If several classes match a pseudo-random choice is made.
|
|
Optionally, the file
|
|
.I counts
|
|
will receive the expansion counts resulting from the replacements.
|
|
.B normalize=0
|
|
or
|
|
.B 1
|
|
indicates whether the counts should be normalized to probabilities
|
|
(default is 1).
|
|
The
|
|
.B addone
|
|
option may be used to smooth the expansion probabilities by adding
|
|
.I K
|
|
to each count (default 1).
|
|
The option
|
|
.B have_counts=1
|
|
indicates that the input consists of N-gram counts and that replacement
|
|
should be performed on them.
|
|
Note this will not merge counts that have been mapped to identical N-grams,
|
|
since this is done automatically when
|
|
.BR ngram-count (1)
|
|
reads count data.
|
|
The option
|
|
.B partial=1
|
|
prevents multi-word class expansions from being replaced when more than
|
|
one space character occurs inbetween the words.
|
|
.PP
|
|
.B uniform-classes
|
|
takes a file in
|
|
.BR classes-format (5)
|
|
and adds uniform probabilities to expansions that don't have a probability
|
|
explicitly stated.
|
|
.PP
|
|
.B make-diacritic-map
|
|
constructs a map file that pairs an ASCII-fied version of the words in
|
|
.I vocab
|
|
with all the occurring non-ASCII word forms.
|
|
Such a map file can then be used with
|
|
.BR disambig (1)
|
|
and a language model
|
|
to reconstruct the non-ASCII word form with diacritics from an ASCII
|
|
text.
|
|
.PP
|
|
.B vp2text
|
|
is a reimplementation of the filter used in the DARPA Hub-3 and Hub-4
|
|
CSR evaluations to convert ``verbalized punctuation'' texts to
|
|
language model training data.
|
|
.PP
|
|
.B compute-oov-rate
|
|
determines the out-of-vocabulary rate of a corpus from its unigram
|
|
.I counts
|
|
and a target vocabulary list in
|
|
.IR vocab .
|
|
.SH "SEE ALSO"
|
|
ngram-count(1), ngram(1), classes-format(5), disambig(1), select-vocab(1).
|
|
.br
|
|
W. Wang, A. Stolcke, J. Zheng, ``Reranking machine translation hypotheses with structured
|
|
and web-based language models'', \fIProc. IEEE ASRU Workshop\fP, pp. 159\-164, 2007.
|
|
.SH BUGS
|
|
Some of the tools could be generalized and/or made more robust to
|
|
misuse.
|
|
.br
|
|
Several of these tools are gawk scripts and depending on prevailing locale
|
|
settings might require an LC_NUMERIC=C environment variable.
|
|
.SH AUTHOR
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
.br
|
|
Copyright (c) 1995\-2008 SRI International
|
|
.br
|
|
Copyright (c) 2013\-2017 Andreas Stolcke
|
|
.br
|
|
Copyright (c) 2013\-2017 Microsoft Corp.
|