Files
b2txt25/language_model/srilm-1.7.3/man/man1/lm-scripts.1
2025-07-02 12:18:09 -07:00

152 lines
5.2 KiB
Groff

.\" $Id: lm-scripts.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $
.TH lm-scripts 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
.SH NAME
lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm \- manipulate N-gram language models
.SH SYNOPSIS
.nf
\fBadd-dummy-bows\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
\fBchange-lm-vocab\fP \fB\-vocab\fP \fIvocab\fP \fB\-lm\fP \fIlm-file\fP \fB\-write-lm\fP \fInew-lm-file\fP \\
[ \fB\-tolower\fP ] [ \fB\-subset\fP ] [ \fIngram-options\fP ... ]
\fBempty-sentence-lm\fP \fB\-prob\fP \fIp\fP \fB\-lm\fP \fIlm-file\fP \fB\-write-lm\fP \fInew-lm-file\fP \\
[ \fIngram-options\fP ... ]
\fBget-unigram-probs\fP [ \fBlinear=1\fP ] [ \fIlm-file\fP ]
\fBmake-hiddens-lm\fP [ \fIlm-file\fP ] \fB>\fP \fIhiddens-lm-file\fP
\fBmake-lm-subset\fP \fIcount-file\fP|\fB-\fP [ lm-file |\fB-\fP ] \fB>\fP \fInew-lm-file\fP
\fBmake-sub-lm\fP [ \fBmaxorder=\fP\fIN\fP ] [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
\fBremove-lowprob-ngrams\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
\fBreverse-lm\fP [ \fIlm-file\fP ] \fB>\fP \fInew-lm-file\fP
\fBsort-lm\fP [ \fIlm-file\fP ] \fB>\fP \fIsorted-lm-file\fP
.fi
.SH DESCRIPTION
These scripts perform various useful manipulations on N-gram models
in their textual representation.
Most operate on backoff N-grams in ARPA
.BR ngram-format (5).
.PP
Since these tools are implemented as scripts they don't automatically
input or output compressed model files correctly, unlike the main
SRILM tools.
However, since most scripts work with data from standard input or
to standard output (by leaving out the file argument, or specifying it
as ``-'') it is easy to combine them with
.BR gunzip (1)
or
.BR gzip (1)
on the command line.
.PP
Also note that many of the scripts take their options with the
.BR gawk (1)
syntax
.IB option = value
instead of the more common
.BI - option
.IR value .
.PP
.B add-dummy-bows
adds dummy backoff weights to N-grams, even where they
are not required, to satisfy some broken software that expects
backoff weights on all N-grams (except those of highest order).
.PP
.B change-lm-vocab
modifies the vocabulary of an LM to be that in
.IR vocab .
Any N-grams containing out-of-vocabulary words are removed,
new words receive a unigram probability, and the model
is renormalized.
The
.B \-tolower
option causes case distinctions to be ignored.
.B \-subset
only removes words from the LM vocabulary, without adding any.
Any remaining
.I ngram-options
are passes to
.BR ngram (1),
and can be used to set debugging level, N-gram order, etc.
.PP
.B empty-sentence-lm
modifies an LM so that it allows the empty sentence with
probability
.IR p .
This is useful to modify existing LMs that are trained on non-empty
sentences only.
.I ngram-options
are passes to
.BR ngram (1),
and can be used to set debugging level, N-gram order, etc.
.PP
.B make-hiddens-lm
constructs an N-gram model that can be used with the
.B ngram \-hiddens
option.
The new model contains intra-utterance sentence boundary
tags ``<#s>'' with the same probability as the original model
had final sentence tags </s>.
Also, utterance-initial words are not conditioned on <s> and
there is no penalty associated with utterance-final </s>.
Such as model might work better it the test corpus is segmented
at places other than proper <s> boundaries.
.PP
.B make-lm-subset
forms a new LM containing only the N-grams found in the
.IR count-file ,
in
.BR ngram-count (1)
format.
The result still needs to be renormalized with
.B ngram -renorm
(which will also adjust the N-gram counts in the header).
.PP
.B make-sub-lm
removes N-grams of order exceeding
.IR N .
This function is now redundant, since
all SRILM tools can do this implicitly (without using extra memory
and very small time overhead) when reading N-gram models
with the appropriate
.B \-order
parameter.
.PP
.B remove-lowprob-ngrams
eliminates N-grams whose probability is lower than that which they
would receive through backoff.
This is useful when building finite-state networks for N-gram
models.
However, this function is now performed much faster by
.BR ngram (1)
with the
.B \-prune-lowprobs
option.
.PP
.B reverse-lm
produces a new LM that generates sentences with probabilities equal
to the reversed sentences in the input model.
.PP
.B sort-lm
sorts the n-grams in an LM in lexicographic order (left-most words being
the most significant).
This is not a requirement for SRILM, but might be necessary for some
other LM software.
(The LMs output by SRILM are sorted somewhat differently, reflecting
the internal data structures used; that is also the order that should give
best cache utilization when using SRILM to read models.)
.PP
.B get-unigram-probs
extracts the unigram probabilities in a simple table format
from a backoff language model.
The
.B linear=1
option causes probabilities to be output on a linear (instead of log) scale.
.SH "SEE ALSO"
ngram-format(5), ngram(1).
.SH BUGS
These are quick-and-dirty scripts, what do you expect?
.br
.B reverse-lm
supports only bigram LMs, and can produce improper probability estimates
as a result of inconsistent marginals in the input model.
.SH AUTHOR
Andreas Stolcke <stolcke@icsi.berkeley.edu>
.br
Copyright (c) 1995\-2006 SRI International