b2txt25/language_model/srilm-1.7.3/man/man1/segment.1

.\" $Id: segment.1,v 1.8 2019/09/09 22:35:37 stolcke Exp $
.TH segment 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"
.SH NAME
segment \- segment text using N-gram language model
.SH SYNOPSIS
.nf
\fBsegment\fP [ \fB\-help\fP ] \fIoption\fP ...
.fi
.SH DESCRIPTION
.B segment
infers a most likely segmentation (location of segment boundaries)
from a text, based on a segment language model.
The language model is a standard backoff N-gram model in ARPA
.BR ngram-format (5),
modeling segmentation using the boundary tags <s> and </s>.
The program reads in a word sequence, finds the most likely locations 
of segment boundaries according to the language model, and 
outputs the word sequence with segment boundaries marked by <s> tags.
.SH OPTIONS
.PP
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
.TP
.B \-help
Print option summary.
.TP
.B \-version
Print version information.
.TP
.BI \-order " n"
Set the maximal N-gram order to be used, by default 3.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
.TP
.BI \-debug " level"
Set the debugging output level (0 means no debugging output).
Debugging messages are sent to stderr.
.TP
.BI \-lm " file"
Read the N-gram model from
.IR file .
.TP
.BI \-text " file"
Find the text to be segmented in 
.IR file .
Default input is stdin.
.TP
.B \-continuous
Process all words in the input as one sequence of words, irrespective of
line breaks.
Normally each line is processed separately as a word sequence.
.TP
.B \-posteriors
Use a forward-backward algorithm to compute the posterior probabilities
of a segment boundary at each word transition, and hypothesize a boundary
whenever the probability exceeds 0.5.
By default a Viterbi algorithm is used that computes
the globally most likely segmentation.
.br
If
.B \-continuous 
is specified as well,
then this option will produce one line of output per word, containing,
respectively, the <s> tag (if appropriate), the word itself, and the 
posterior probability for a boundary preceding the word.
.TP
.B \-unk
Output the unknown word token <unk> for each input word not in the 
language model vocabulary.
The default is to output the input word unchanged.
.TP
.BI \-stag " string"
Use
.I string
to mark segment boundaries in the output.
Default is the start-of-sentence symbol defined in the language model (<s>).
.TP
.BI \-bias " b"
Make a segment boundary a priori more likely by a factor of
.IR b .
This allows balancing of false detection/rejection errors.
The default is 1.
.SH "SEE ALSO"
ngram-count(1), ngram-format(5).
.br
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
Spontaneous Speech,'' \fIProc. ICSLP\fP, 1005\-1008, 1996.
.SH BUGS
Only N-grams models up to trigram order are used accurately.
For higher-order models use the more general 
.BR hidden-ngram (1).
Andreas Stolcke <stolcke@icsi.berkeley.edu>
.br
Copyright (c) 1997\-2004 SRI International
competition update 2025-07-02 12:18:09 -07:00			`.\" $Id: segment.1,v 1.8 2019/09/09 22:35:37 stolcke Exp $`
			`.TH segment 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"`
			`.SH NAME`
			`segment \- segment text using N-gram language model`
			`.SH SYNOPSIS`
			`.nf`
			`\fBsegment\fP [ \fB\-help\fP ] \fIoption\fP ...`
			`.fi`
			`.SH DESCRIPTION`
			`.B segment`
			`infers a most likely segmentation (location of segment boundaries)`
			`from a text, based on a segment language model.`
			`The language model is a standard backoff N-gram model in ARPA`
			`.BR ngram-format (5),`
			`modeling segmentation using the boundary tags <s> and </s>.`
			`The program reads in a word sequence, finds the most likely locations`
			`of segment boundaries according to the language model, and`
			`outputs the word sequence with segment boundaries marked by <s> tags.`
			`.SH OPTIONS`
			`.PP`
			`Each filename argument can be an ASCII file, or a`
			compressed file (name ending in .Z or .gz), or ``-'' to indicate
			`stdin/stdout.`
			`.TP`
			`.B \-help`
			`Print option summary.`
			`.TP`
			`.B \-version`
			`Print version information.`
			`.TP`
			`.BI \-order " n"`
			`Set the maximal N-gram order to be used, by default 3.`
			`NOTE: The order of the model is not set automatically when a model`
			`file is read, so the same file can be used at various orders.`
			`.TP`
			`.BI \-debug " level"`
			`Set the debugging output level (0 means no debugging output).`
			`Debugging messages are sent to stderr.`
			`.TP`
			`.BI \-lm " file"`
			`Read the N-gram model from`
			`.IR file .`
			`.TP`
			`.BI \-text " file"`
			`Find the text to be segmented in`
			`.IR file .`
			`Default input is stdin.`
			`.TP`
			`.B \-continuous`
			`Process all words in the input as one sequence of words, irrespective of`
			`line breaks.`
			`Normally each line is processed separately as a word sequence.`
			`.TP`
			`.B \-posteriors`
			`Use a forward-backward algorithm to compute the posterior probabilities`
			`of a segment boundary at each word transition, and hypothesize a boundary`
			`whenever the probability exceeds 0.5.`
			`By default a Viterbi algorithm is used that computes`
			`the globally most likely segmentation.`
			`.br`
			`If`
			`.B \-continuous`
			`is specified as well,`
			`then this option will produce one line of output per word, containing,`
			`respectively, the <s> tag (if appropriate), the word itself, and the`
			`posterior probability for a boundary preceding the word.`
			`.TP`
			`.B \-unk`
			`Output the unknown word token <unk> for each input word not in the`
			`language model vocabulary.`
			`The default is to output the input word unchanged.`
			`.TP`
			`.BI \-stag " string"`
			`Use`
			`.I string`
			`to mark segment boundaries in the output.`
			`Default is the start-of-sentence symbol defined in the language model (<s>).`
			`.TP`
			`.BI \-bias " b"`
			`Make a segment boundary a priori more likely by a factor of`
			`.IR b .`
			`This allows balancing of false detection/rejection errors.`
			`The default is 1.`
			`.SH "SEE ALSO"`
			`ngram-count(1), ngram-format(5).`
			`.br`
			A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
			`Spontaneous Speech,'' \fIProc. ICSLP\fP, 1005\-1008, 1996.`
			`.SH BUGS`
			`Only N-grams models up to trigram order are used accurately.`
			`For higher-order models use the more general`
			`.BR hidden-ngram (1).`
			`Andreas Stolcke <stolcke@icsi.berkeley.edu>`
			`.br`
			`Copyright (c) 1997\-2004 SRI International`