152 lines
5.2 KiB
Groff
152 lines
5.2 KiB
Groff
.\" $Id: lm-scripts.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $
|
|
.TH lm-scripts 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
|
|
.SH NAME
|
|
lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm \- manipulate N-gram language models
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBadd-dummy-bows\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
|
|
\fBchange-lm-vocab\fP \fB\-vocab\fP \fIvocab\fP \fB\-lm\fP \fIlm-file\fP \fB\-write-lm\fP \fInew-lm-file\fP \\
|
|
[ \fB\-tolower\fP ] [ \fB\-subset\fP ] [ \fIngram-options\fP ... ]
|
|
\fBempty-sentence-lm\fP \fB\-prob\fP \fIp\fP \fB\-lm\fP \fIlm-file\fP \fB\-write-lm\fP \fInew-lm-file\fP \\
|
|
[ \fIngram-options\fP ... ]
|
|
\fBget-unigram-probs\fP [ \fBlinear=1\fP ] [ \fIlm-file\fP ]
|
|
\fBmake-hiddens-lm\fP [ \fIlm-file\fP ] \fB>\fP \fIhiddens-lm-file\fP
|
|
\fBmake-lm-subset\fP \fIcount-file\fP|\fB-\fP [ lm-file |\fB-\fP ] \fB>\fP \fInew-lm-file\fP
|
|
\fBmake-sub-lm\fP [ \fBmaxorder=\fP\fIN\fP ] [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
|
|
\fBremove-lowprob-ngrams\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
|
|
\fBreverse-lm\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
|
|
\fBsort-lm\fP [ \fIlm-file\fP ] \fB>\fP \fIsorted-lm-file\fP
|
|
.fi
|
|
.SH DESCRIPTION
|
|
These scripts perform various useful manipulations on N-gram models
|
|
in their textual representation.
|
|
Most operate on backoff N-grams in ARPA
|
|
.BR ngram-format (5).
|
|
.PP
|
|
Since these tools are implemented as scripts they don't automatically
|
|
input or output compressed model files correctly, unlike the main
|
|
SRILM tools.
|
|
However, since most scripts work with data from standard input or
|
|
to standard output (by leaving out the file argument, or specifying it
|
|
as ``-'') it is easy to combine them with
|
|
.BR gunzip (1)
|
|
or
|
|
.BR gzip (1)
|
|
on the command line.
|
|
.PP
|
|
Also note that many of the scripts take their options with the
|
|
.BR gawk (1)
|
|
syntax
|
|
.IB option = value
|
|
instead of the more common
|
|
.BI - option
|
|
.IR value .
|
|
.PP
|
|
.B add-dummy-bows
|
|
adds dummy backoff weights to N-grams, even where they
|
|
are not required, to satisfy some broken software that expects
|
|
backoff weights on all N-grams (except those of highest order).
|
|
.PP
|
|
.B change-lm-vocab
|
|
modifies the vocabulary of an LM to be that in
|
|
.IR vocab .
|
|
Any N-grams containing out-of-vocabulary words are removed,
|
|
new words receive a unigram probability, and the model
|
|
is renormalized.
|
|
The
|
|
.B \-tolower
|
|
option causes case distinctions to be ignored.
|
|
.B \-subset
|
|
only removes words from the LM vocabulary, without adding any.
|
|
Any remaining
|
|
.I ngram-options
|
|
are passes to
|
|
.BR ngram (1),
|
|
and can be used to set debugging level, N-gram order, etc.
|
|
.PP
|
|
.B empty-sentence-lm
|
|
modifies an LM so that it allows the empty sentence with
|
|
probability
|
|
.IR p .
|
|
This is useful to modify existing LMs that are trained on non-empty
|
|
sentences only.
|
|
.I ngram-options
|
|
are passes to
|
|
.BR ngram (1),
|
|
and can be used to set debugging level, N-gram order, etc.
|
|
.PP
|
|
.B make-hiddens-lm
|
|
constructs an N-gram model that can be used with the
|
|
.B ngram \-hiddens
|
|
option.
|
|
The new model contains intra-utterance sentence boundary
|
|
tags ``<#s>'' with the same probability as the original model
|
|
had final sentence tags </s>.
|
|
Also, utterance-initial words are not conditioned on <s> and
|
|
there is no penalty associated with utterance-final </s>.
|
|
Such as model might work better it the test corpus is segmented
|
|
at places other than proper <s> boundaries.
|
|
.PP
|
|
.B make-lm-subset
|
|
forms a new LM containing only the N-grams found in the
|
|
.IR count-file ,
|
|
in
|
|
.BR ngram-count (1)
|
|
format.
|
|
The result still needs to be renormalized with
|
|
.B ngram -renorm
|
|
(which will also adjust the N-gram counts in the header).
|
|
.PP
|
|
.B make-sub-lm
|
|
removes N-grams of order exceeding
|
|
.IR N .
|
|
This function is now redundant, since
|
|
all SRILM tools can do this implicitly (without using extra memory
|
|
and very small time overhead) when reading N-gram models
|
|
with the appropriate
|
|
.B \-order
|
|
parameter.
|
|
.PP
|
|
.B remove-lowprob-ngrams
|
|
eliminates N-grams whose probability is lower than that which they
|
|
would receive through backoff.
|
|
This is useful when building finite-state networks for N-gram
|
|
models.
|
|
However, this function is now performed much faster by
|
|
.BR ngram (1)
|
|
with the
|
|
.B \-prune-lowprobs
|
|
option.
|
|
.PP
|
|
.B reverse-lm
|
|
produces a new LM that generates sentences with probabilities equal
|
|
to the reversed sentences in the input model.
|
|
.PP
|
|
.B sort-lm
|
|
sorts the n-grams in an LM in lexicographic order (left-most words being
|
|
the most significant).
|
|
This is not a requirement for SRILM, but might be necessary for some
|
|
other LM software.
|
|
(The LMs output by SRILM are sorted somewhat differently, reflecting
|
|
the internal data structures used; that is also the order that should give
|
|
best cache utilization when using SRILM to read models.)
|
|
.PP
|
|
.B get-unigram-probs
|
|
extracts the unigram probabilities in a simple table format
|
|
from a backoff language model.
|
|
The
|
|
.B linear=1
|
|
option causes probabilities to be output on a linear (instead of log) scale.
|
|
.SH "SEE ALSO"
|
|
ngram-format(5), ngram(1).
|
|
.SH BUGS
|
|
These are quick-and-dirty scripts, what do you expect?
|
|
.br
|
|
.B reverse-lm
|
|
supports only bigram LMs, and can produce improper probability estimates
|
|
as a result of inconsistent marginals in the input model.
|
|
.SH AUTHOR
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
.br
|
|
Copyright (c) 1995\-2006 SRI International
|