185 lines
5.2 KiB
Groff
185 lines
5.2 KiB
Groff
.\" $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $
|
|
.TH ngram-class 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"
|
|
.SH NAME
|
|
ngram-class \- induce word classes from N-gram statistics
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBngram-class\fP [ \fB\-help\fP ] \fIoption\fP ...
|
|
.fi
|
|
.SH DESCRIPTION
|
|
.B ngram-class
|
|
induces word classes from distributional statistics,
|
|
so as to minimize perplexity of a class-based N-gram model
|
|
given the provided word N-gram counts.
|
|
Presently, only bigram statistics are used, i.e., the induced classes
|
|
are best suited for a class-bigram language model.
|
|
.PP
|
|
The program generates the class N-gram counts and class expansions
|
|
needed by
|
|
.BR ngram-count (1)
|
|
and
|
|
.BR ngram (1),
|
|
respectively to train and to apply the class N-gram model.
|
|
.SH OPTIONS
|
|
.PP
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
.TP
|
|
.B \-help
|
|
Print option summary.
|
|
.TP
|
|
.B \-version
|
|
Print version information.
|
|
.TP
|
|
.BI \-debug " level"
|
|
Set debugging output at
|
|
.IR level .
|
|
Level 0 means no debugging.
|
|
Debugging messages are written to stderr.
|
|
A useful level to trace the formation of classes is 2.
|
|
.SS Input Options
|
|
.TP
|
|
.BI \-vocab " file"
|
|
Read a vocabulary from file.
|
|
Subsequently, out-of-vocabulary words in both counts or text are
|
|
replaced with the unknown-word token.
|
|
If this option is not specified all words found are implicitly added
|
|
to the vocabulary.
|
|
.TP
|
|
.B \-tolower
|
|
Map the vocabulary to lowercase.
|
|
.TP
|
|
.BI \-counts " file"
|
|
Read N-gram counts from a file.
|
|
Each line contains an N-gram of
|
|
words, followed by an integer count, all separated by whitespace.
|
|
Repeated counts for the same N-gram are added.
|
|
Counts collected by
|
|
.B \-text
|
|
and
|
|
.B \-counts
|
|
are additive as well.
|
|
.br
|
|
Note that the input should contain consistent lower- and higher-order
|
|
counts (i.e., unigrams and bigrams), as would be generated by
|
|
.BR ngram-count (1).
|
|
.TP
|
|
.BI \-text " textfile"
|
|
Generate N-gram counts from text file.
|
|
.I textfile
|
|
should contain one sentence unit per line.
|
|
Begin/end sentence tokens are added if not already present.
|
|
Empty lines are ignored.
|
|
.SS Class Merging
|
|
.TP
|
|
.BI \-numclasses " C"
|
|
The target number of classes to induce.
|
|
A zero argument suppresses automatic class merging altogether
|
|
(e.g., for use with
|
|
.B \-interact).
|
|
.TP
|
|
.B \-full
|
|
Perform full greedy merging over all classes starting with one class per
|
|
word.
|
|
This is the O(V^3) algorithm described in Brown et al. (1992).
|
|
.TP
|
|
.B \-incremental
|
|
Perform incremental greedy merging, starting with
|
|
one class each for the
|
|
.I C
|
|
most frequent words, and then adding one word at a time.
|
|
This is the O(V*C^2) algorithm described in Brown et al. (1992);
|
|
it is the default.
|
|
.TP
|
|
.B \-maxwordsperclass " M"
|
|
Limits the number of words in a class to
|
|
.I M
|
|
in incremental merging.
|
|
By default there is no such limit.
|
|
.TP
|
|
.B \-interact
|
|
Enter a primitive interactive interface when done with automatic class
|
|
induction, allowing manual specification of additional merging steps.
|
|
.TP
|
|
.BI \-noclass-vocab " file"
|
|
Read a list of vocabulary items from
|
|
.I file
|
|
that are to be excluded from classes.
|
|
These words or tags do no undergo class merging, but their
|
|
N-gram counts still affect the optimization of model perplexity.
|
|
.br
|
|
The default is to exclude the sentence begin/end tags (<s> and </s>)
|
|
from class merging; this can be suppressed by specifying
|
|
.BR "\-noclass-vocab /dev/null" .
|
|
.TP
|
|
.BI \-read " file"
|
|
Read initial class memberships from
|
|
.IR file .
|
|
Class memberships need to be stored in
|
|
.BR classes-format (5)
|
|
with the additional condition that probabilities are obligatory
|
|
and that each membership definition must specify exactly one word.
|
|
.SS Output Options
|
|
.TP
|
|
.BI \-class-counts " file"
|
|
Write class N-gram counts to
|
|
.I file
|
|
when done.
|
|
The format is the same as for word N-gram counts, and can be
|
|
read by
|
|
.BR ngram-count (1)
|
|
to estimate a class-N-gram model.
|
|
.TP
|
|
.BI \-classes " file"
|
|
Write class definitions (member words and their probabilities) to
|
|
.I file
|
|
when done.
|
|
The output format is the same as required by the
|
|
.B \-classes
|
|
option of
|
|
.BR ngram (1).
|
|
.TP
|
|
.BI \-save " S"
|
|
Save the class counts and/or class definitions every
|
|
.I S
|
|
iterations during induction.
|
|
The filenames are obtained from the
|
|
.B \-class-counts
|
|
and
|
|
.B \-classes
|
|
options, respectively, by appending the iteration number.
|
|
This is convenient for producing sets of classes at different granularities
|
|
during the same run.
|
|
The saved class memberships can also be used with the
|
|
.B \-read
|
|
option to restart class merging at a later time.
|
|
.IR S =0
|
|
(the default) suppresses the saving actions.
|
|
.TP
|
|
.BI \-save-maxclasses " K"
|
|
Modifies the action of
|
|
.B \-save
|
|
so as to only start saving once the number of classes reaches
|
|
.IR K .
|
|
(The iteration numbers embedded in filenames will start at 0 from that point.)
|
|
.SH "SEE ALSO"
|
|
ngram-count(1), ngram(1), classes-format(5).
|
|
.br
|
|
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
|
|
``Class-Based n-gram Models of Natural Language,''
|
|
\fIComputational Linguistics\fP 18(4), 467\-479, 1992.
|
|
.SH BUGS
|
|
Classes are optimized only for bigram models at present.
|
|
.SH AUTHOR
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>,
|
|
Seppo Enarvi <seppo.enarvi@aalto.fi>
|
|
.br
|
|
Copyright (c) 1999\-2010 SRI International
|
|
.br
|
|
Copyright (c) 2013\-2014 Seppo Enarvi
|
|
.br
|
|
Copyright (c) 2011\-2014 Andreas Stolcke
|
|
.br
|
|
Copyright (c) 2012\-2014 Microsoft Corp.
|