95 lines
3.2 KiB
Groff
95 lines
3.2 KiB
Groff
.\" $Id: multi-ngram.1,v 1.5 2019/09/09 22:35:36 stolcke Exp $
|
|
.TH multi-ngram 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
|
|
.SH NAME
|
|
multi-ngram \- build multiword N-gram models
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBmulti-ngram\fP [ \fB\-help\fP ] \fIoption\fP ...
|
|
.fi
|
|
.SH DESCRIPTION
|
|
.B multi-ngram
|
|
builds N-gram language models that contain multiwords, i.e., compound words
|
|
that are a concatenation of words from some prior given model.
|
|
It will optionally generate multiword N-grams and insert them into
|
|
an existing, reference N-gram model, so as to cover multiwords occuring
|
|
in a specified vocabulary.
|
|
It will then assign probabilities to the multiword N-grams so that word
|
|
strings containing multiwords have the same probabilities as the strings
|
|
of component words in the reference model.
|
|
.PP
|
|
Note that the inverse operation (expanding a multiword N-gram to contain
|
|
only regular words) is subsumed by the
|
|
.B "ngram -expand-classes"
|
|
function.
|
|
.SH OPTIONS
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
.TP
|
|
.B \-help
|
|
Print option summary.
|
|
.TP
|
|
.B \-version
|
|
Print version information.
|
|
.TP
|
|
.BI \-order " n"
|
|
Set the maximal N-gram order to be used from the reference model.
|
|
NOTE: The order of the model is not set automatically when a model
|
|
file is read, so the same file can be used at various orders.
|
|
To use models of order higher than 3 it is always necessary to specify this
|
|
option.
|
|
.TP
|
|
.BI \-multi-order " n"
|
|
The maximal N-gram order in the multiword-based model.
|
|
.TP
|
|
.BI \-debug " level"
|
|
Set the debugging output level (0 means no debugging output).
|
|
.TP
|
|
.BI \-vocab " file"
|
|
Words to be added to the model.
|
|
In particular, this should include all the multiwords to be added.
|
|
.TP
|
|
.BI \-multi-char " C"
|
|
Character used to delimit component words in multiwords
|
|
(an underscore character by default).
|
|
.TP
|
|
.BI \-lm " file"
|
|
Reference N-gram model.
|
|
.TP
|
|
.BI \-multi-lm " file"
|
|
Model containing multiwords; the N-grams in this model will be assigned
|
|
new probabilities based on the reference model.
|
|
If this option is
|
|
.I not
|
|
given then the multiword model will be generated by adding multiword
|
|
N-grams to the reference model.
|
|
.TP
|
|
.B \-prune-unseen-ngrams
|
|
This option prevents the insertion of multiword N-grams whose component
|
|
N-grams are not contained in the reference model.
|
|
For example, for a multiword bigram "a_b c_d" to be inserted, a trigram
|
|
reference model must contain the trigrams "a b c" and "b c d".
|
|
If the reference model were a bigram LM, it would have to contain
|
|
"a b", "b c", and "c d".
|
|
This option is important to control the size of the multiword LM for
|
|
large vocabularies.
|
|
.TP
|
|
.BI \-write-lm " file"
|
|
Output location of the generated multiword model.
|
|
.SH "SEE ALSO"
|
|
ngram(1), ngram-format(5).
|
|
.SH BUGS
|
|
This program is a hack for cases were the original training data is
|
|
not available and a multiword model has to be generated from an existing
|
|
model.
|
|
.br
|
|
The resulting model is no longer properly normalized, since the
|
|
same word string can potentially be represented with or without multiwords.
|
|
.br
|
|
The generation of multiword N-grams uses a heuristic algorithm that
|
|
works well for bigrams and trigrams, but is not exhaustive.
|
|
.SH AUTHOR
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
.br
|
|
Copyright (c) 2000\-2004 SRI International
|