96 lines
3.0 KiB
Groff
96 lines
3.0 KiB
Groff
.\" $Id: segment.1,v 1.8 2019/09/09 22:35:37 stolcke Exp $
|
|
.TH segment 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"
|
|
.SH NAME
|
|
segment \- segment text using N-gram language model
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBsegment\fP [ \fB\-help\fP ] \fIoption\fP ...
|
|
.fi
|
|
.SH DESCRIPTION
|
|
.B segment
|
|
infers a most likely segmentation (location of segment boundaries)
|
|
from a text, based on a segment language model.
|
|
The language model is a standard backoff N-gram model in ARPA
|
|
.BR ngram-format (5),
|
|
modeling segmentation using the boundary tags <s> and </s>.
|
|
The program reads in a word sequence, finds the most likely locations
|
|
of segment boundaries according to the language model, and
|
|
outputs the word sequence with segment boundaries marked by <s> tags.
|
|
.SH OPTIONS
|
|
.PP
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
.TP
|
|
.B \-help
|
|
Print option summary.
|
|
.TP
|
|
.B \-version
|
|
Print version information.
|
|
.TP
|
|
.BI \-order " n"
|
|
Set the maximal N-gram order to be used, by default 3.
|
|
NOTE: The order of the model is not set automatically when a model
|
|
file is read, so the same file can be used at various orders.
|
|
.TP
|
|
.BI \-debug " level"
|
|
Set the debugging output level (0 means no debugging output).
|
|
Debugging messages are sent to stderr.
|
|
.TP
|
|
.BI \-lm " file"
|
|
Read the N-gram model from
|
|
.IR file .
|
|
.TP
|
|
.BI \-text " file"
|
|
Find the text to be segmented in
|
|
.IR file .
|
|
Default input is stdin.
|
|
.TP
|
|
.B \-continuous
|
|
Process all words in the input as one sequence of words, irrespective of
|
|
line breaks.
|
|
Normally each line is processed separately as a word sequence.
|
|
.TP
|
|
.B \-posteriors
|
|
Use a forward-backward algorithm to compute the posterior probabilities
|
|
of a segment boundary at each word transition, and hypothesize a boundary
|
|
whenever the probability exceeds 0.5.
|
|
By default a Viterbi algorithm is used that computes
|
|
the globally most likely segmentation.
|
|
.br
|
|
If
|
|
.B \-continuous
|
|
is specified as well,
|
|
then this option will produce one line of output per word, containing,
|
|
respectively, the <s> tag (if appropriate), the word itself, and the
|
|
posterior probability for a boundary preceding the word.
|
|
.TP
|
|
.B \-unk
|
|
Output the unknown word token <unk> for each input word not in the
|
|
language model vocabulary.
|
|
The default is to output the input word unchanged.
|
|
.TP
|
|
.BI \-stag " string"
|
|
Use
|
|
.I string
|
|
to mark segment boundaries in the output.
|
|
Default is the start-of-sentence symbol defined in the language model (<s>).
|
|
.TP
|
|
.BI \-bias " b"
|
|
Make a segment boundary a priori more likely by a factor of
|
|
.IR b .
|
|
This allows balancing of false detection/rejection errors.
|
|
The default is 1.
|
|
.SH "SEE ALSO"
|
|
ngram-count(1), ngram-format(5).
|
|
.br
|
|
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
|
|
Spontaneous Speech,'' \fIProc. ICSLP\fP, 1005\-1008, 1996.
|
|
.SH BUGS
|
|
Only N-grams models up to trigram order are used accurately.
|
|
For higher-order models use the more general
|
|
.BR hidden-ngram (1).
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
.br
|
|
Copyright (c) 1997\-2004 SRI International
|