.\" $Id: segment.1,v 1.8 2019/09/09 22:35:37 stolcke Exp $ .TH segment 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools" .SH NAME segment \- segment text using N-gram language model .SH SYNOPSIS .nf \fBsegment\fP [ \fB\-help\fP ] \fIoption\fP ... .fi .SH DESCRIPTION .B segment infers a most likely segmentation (location of segment boundaries) from a text, based on a segment language model. The language model is a standard backoff N-gram model in ARPA .BR ngram-format (5), modeling segmentation using the boundary tags and . The program reads in a word sequence, finds the most likely locations of segment boundaries according to the language model, and outputs the word sequence with segment boundaries marked by tags. .SH OPTIONS .PP Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout. .TP .B \-help Print option summary. .TP .B \-version Print version information. .TP .BI \-order " n" Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set automatically when a model file is read, so the same file can be used at various orders. .TP .BI \-debug " level" Set the debugging output level (0 means no debugging output). Debugging messages are sent to stderr. .TP .BI \-lm " file" Read the N-gram model from .IR file . .TP .BI \-text " file" Find the text to be segmented in .IR file . Default input is stdin. .TP .B \-continuous Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a word sequence. .TP .B \-posteriors Use a forward-backward algorithm to compute the posterior probabilities of a segment boundary at each word transition, and hypothesize a boundary whenever the probability exceeds 0.5. By default a Viterbi algorithm is used that computes the globally most likely segmentation. .br If .B \-continuous is specified as well, then this option will produce one line of output per word, containing, respectively, the tag (if appropriate), the word itself, and the posterior probability for a boundary preceding the word. .TP .B \-unk Output the unknown word token for each input word not in the language model vocabulary. The default is to output the input word unchanged. .TP .BI \-stag " string" Use .I string to mark segment boundaries in the output. Default is the start-of-sentence symbol defined in the language model (). .TP .BI \-bias " b" Make a segment boundary a priori more likely by a factor of .IR b . This allows balancing of false detection/rejection errors. The default is 1. .SH "SEE ALSO" ngram-count(1), ngram-format(5). .br A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of Spontaneous Speech,'' \fIProc. ICSLP\fP, 1005\-1008, 1996. .SH BUGS Only N-grams models up to trigram order are used accurately. For higher-order models use the more general .BR hidden-ngram (1). Andreas Stolcke .br Copyright (c) 1997\-2004 SRI International