105 lines
3.6 KiB
HTML
105 lines
3.6 KiB
HTML
<! $Id: segment.1,v 1.8 2019/09/09 22:35:37 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>segment</TITLE>
|
|
<BODY>
|
|
<H1>segment</H1>
|
|
<H2> NAME </H2>
|
|
segment - segment text using N-gram language model
|
|
<H2> SYNOPSIS </H2>
|
|
<PRE>
|
|
<B>segment</B> [ <B>-help</B> ] <I>option</I> ...
|
|
</PRE>
|
|
<H2> DESCRIPTION </H2>
|
|
<B> segment </B>
|
|
infers a most likely segmentation (location of segment boundaries)
|
|
from a text, based on a segment language model.
|
|
The language model is a standard backoff N-gram model in ARPA
|
|
<A HREF="ngram-format.5.html">ngram-format(5)</A>,
|
|
modeling segmentation using the boundary tags <s> and </s>.
|
|
The program reads in a word sequence, finds the most likely locations
|
|
of segment boundaries according to the language model, and
|
|
outputs the word sequence with segment boundaries marked by <s> tags.
|
|
<H2> OPTIONS </H2>
|
|
<P>
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
<DL>
|
|
<DT><B> -help </B>
|
|
<DD>
|
|
Print option summary.
|
|
<DT><B> -version </B>
|
|
<DD>
|
|
Print version information.
|
|
<DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Set the maximal N-gram order to be used, by default 3.
|
|
NOTE: The order of the model is not set automatically when a model
|
|
file is read, so the same file can be used at various orders.
|
|
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Set the debugging output level (0 means no debugging output).
|
|
Debugging messages are sent to stderr.
|
|
<DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Read the N-gram model from
|
|
<I>file</I>.<I></I><I></I><I></I>
|
|
<DT><B>-text</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Find the text to be segmented in
|
|
<I>file</I>.<I></I><I></I><I></I>
|
|
Default input is stdin.
|
|
<DT><B> -continuous </B>
|
|
<DD>
|
|
Process all words in the input as one sequence of words, irrespective of
|
|
line breaks.
|
|
Normally each line is processed separately as a word sequence.
|
|
<DT><B> -posteriors </B>
|
|
<DD>
|
|
Use a forward-backward algorithm to compute the posterior probabilities
|
|
of a segment boundary at each word transition, and hypothesize a boundary
|
|
whenever the probability exceeds 0.5.
|
|
By default a Viterbi algorithm is used that computes
|
|
the globally most likely segmentation.
|
|
<BR>
|
|
If
|
|
<B> -continuous </B>
|
|
is specified as well,
|
|
then this option will produce one line of output per word, containing,
|
|
respectively, the <s> tag (if appropriate), the word itself, and the
|
|
posterior probability for a boundary preceding the word.
|
|
<DT><B> -unk </B>
|
|
<DD>
|
|
Output the unknown word token <unk> for each input word not in the
|
|
language model vocabulary.
|
|
The default is to output the input word unchanged.
|
|
<DT><B>-stag</B><I> string</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Use
|
|
<I> string </I>
|
|
to mark segment boundaries in the output.
|
|
Default is the start-of-sentence symbol defined in the language model (<s>).
|
|
<DT><B>-bias</B><I> b</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Make a segment boundary a priori more likely by a factor of
|
|
<I>b</I>.<I></I><I></I><I></I>
|
|
This allows balancing of false detection/rejection errors.
|
|
The default is 1.
|
|
</DD>
|
|
</DL>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>.
|
|
<BR>
|
|
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
|
|
Spontaneous Speech,'' <I>Proc. ICSLP</I>, 1005-1008, 1996.
|
|
<H2> BUGS </H2>
|
|
Only N-grams models up to trigram order are used accurately.
|
|
For higher-order models use the more general
|
|
<A HREF="hidden-ngram.1.html">hidden-ngram(1)</A>.
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
<BR>
|
|
Copyright (c) 1997-2004 SRI International
|
|
</BODY>
|
|
</HTML>
|