competition update
This commit is contained in:
12
language_model/srilm-1.7.3/lm/test/doc/FAQ
Normal file
12
language_model/srilm-1.7.3/lm/test/doc/FAQ
Normal file
@@ -0,0 +1,12 @@
|
||||
|
||||
This file has been superceded by the "srilm-faq" manual page.
|
||||
View with
|
||||
|
||||
man -M$SRILM/man srilm-faq
|
||||
|
||||
or online at
|
||||
|
||||
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html
|
||||
|
||||
Please send corrections and additions to the author.
|
||||
|
||||
21
language_model/srilm-1.7.3/lm/test/doc/FSM
Normal file
21
language_model/srilm-1.7.3/lm/test/doc/FSM
Normal file
@@ -0,0 +1,21 @@
|
||||
|
||||
Conversion between Decipher PFSG (probabilistic finite state grammars) and
|
||||
AT&T FSA (finite-state acceptor) formats
|
||||
|
||||
1. Convert PFSG to FSA format
|
||||
|
||||
pfsg-to-fsm symbolfile=foo.syms foo.pfsg > foo.fsm
|
||||
|
||||
2. Compile FSA
|
||||
|
||||
fsmcompile foo.fsm > foo.fsmc
|
||||
|
||||
3. Operate on FSA, e.g., determinize and minimize
|
||||
|
||||
fsmdeterminize foo.fsmc | fsmminimize > foo.min.fsmc
|
||||
|
||||
4. Print and convert back to PFSG
|
||||
|
||||
fsmprint -i foo.syms foo.min.fsmc | \
|
||||
fsm-to-pfsg > foo.min.pfsg
|
||||
|
||||
42
language_model/srilm-1.7.3/lm/test/doc/README.linux
Normal file
42
language_model/srilm-1.7.3/lm/test/doc/README.linux
Normal file
@@ -0,0 +1,42 @@
|
||||
|
||||
The default Linux Makefile (common/Makefile.machine.i686) uses /usr/bin/gawk
|
||||
as the gawk interpreter. To ensure correct working in UTF-8 locales run
|
||||
/usr/bin/gawk --version and make sure it is at least version 3.1.5.
|
||||
If not, it is recommended to install the latest version of gawk in
|
||||
/usr/local/bin and update the GAWK variable accordingly.
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
From: Alex Franz <alex@google.com>
|
||||
Date: Fri, 06 Oct 2000 16:14:03 PDT
|
||||
|
||||
And, here are the details of the malloc problem that I had with the SRI
|
||||
LM toolkit:
|
||||
|
||||
I compiled it with gcc under Redhat Linux V. 6.2 (or thereabouts).
|
||||
The malloc routine has problems allocating large numbers of
|
||||
small pieces of memory. For me, it usually refuses to allocate
|
||||
any more memory once it has allocated about 600 MBytes of memory,
|
||||
even though the machine has 2 GBytes of real memory.
|
||||
|
||||
This causes a problem when you are trying to build language models
|
||||
with large vocabularies. Even though I used -DUSE_SARRAY_TRIE -DUSE_SARRAY
|
||||
to use arrays instead of hash tables, it ran out of memory when I was
|
||||
trying to use very large vocabulary sizes.
|
||||
|
||||
The solution that worked for me was to use Wolfram Gloger's ptmalloc package
|
||||
for memory management instead. You can download it from
|
||||
|
||||
http://malloc.de/en/index.html
|
||||
|
||||
(The page suggests that it is part of the Gnu C library, but I had to
|
||||
compile it myself and explicitly link it with the executables.)
|
||||
|
||||
One more thing you can do is call the function
|
||||
|
||||
mallopt(M_MMAP_MAX, n);
|
||||
|
||||
with a sufficiently large n; this tells malloc to allow you to
|
||||
obtain a large amount of memory.
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
50
language_model/srilm-1.7.3/lm/test/doc/README.windows-cygwin
Normal file
50
language_model/srilm-1.7.3/lm/test/doc/README.windows-cygwin
Normal file
@@ -0,0 +1,50 @@
|
||||
|
||||
SRILM version 1.3 and higher has been successfully built and tested using
|
||||
the CYGWIN environment (http://www.cygwin.com) as of Feb 11, 2002.
|
||||
The test used CYGWIN DLL 1.3.9 and gcc 2.95.3 (but note the warning
|
||||
in doc/README.x86 regarding a comiler bug), and ran successfully
|
||||
on a Windows 98 and a Windows 2000 Professional system.
|
||||
|
||||
The following special measures were taken to ensure a successful build:
|
||||
|
||||
- Make sure the make, gcc, binutils, libiconv, gzip, tcltk, and gawk packages
|
||||
are installed with CYGWIN. To run the tests you will also need the
|
||||
diffutils and time packages.
|
||||
|
||||
After installation, set your bash environment as follows
|
||||
|
||||
export SRILM=/cygdrive/c/srilm13 # or similar
|
||||
# do NOT use backslash in path names SRILM=C:\...
|
||||
export MACHINE_TYPE=cygwin
|
||||
export PATH=$PATH:$SRILM/bin:$SRILM/bin/cygwin # mentioned in INSTALL
|
||||
export MANPATH=$MANPATH:$SRILM/man # mentioned in INSTALL
|
||||
|
||||
or the equivalent for other shells.
|
||||
|
||||
As of version 1.4.5, SRILM can also be built in the MinGW environment
|
||||
(http://www.mingw.org). For this the default (cygwin) has to be overridden
|
||||
using
|
||||
|
||||
make MACHINE_TYPE=win32
|
||||
|
||||
For 64bit binaries use
|
||||
|
||||
make MACHINE_TYPE=win64
|
||||
|
||||
Of course the corresponding versions of the MinGW development environment
|
||||
(C, C++, binutils) have to be installed in Cygwin. Make sure the Cygwin
|
||||
installation is generally up-to-date.
|
||||
|
||||
It may be necessary to include the following directories in the PATH
|
||||
environment variable for the runtime dynamic libraries to be found:
|
||||
|
||||
/usr/i686-pc-mingw32/sys-root/mingw/bin (win32)
|
||||
/usr/x86_64-w64-mingw32/sys-root/mingw/bin (win64)
|
||||
|
||||
Some functionality is not supported under MinGW:
|
||||
|
||||
- compressed file I/O
|
||||
- nbest-optimize and lattice-tool -max-time option
|
||||
|
||||
A. Stolcke
|
||||
$Id: README.windows-cygwin,v 1.10 2013/01/31 18:03:33 stolcke Exp $
|
||||
63
language_model/srilm-1.7.3/lm/test/doc/README.windows-msvc
Normal file
63
language_model/srilm-1.7.3/lm/test/doc/README.windows-msvc
Normal file
@@ -0,0 +1,63 @@
|
||||
|
||||
Recommendations for compiling with Microsoft Visual C++
|
||||
The build procedure has been tested with the freely available
|
||||
Visual C++ 8 that can be downloaded from www.microsoft.com as
|
||||
"Visual C++ 2005 Express Edition".
|
||||
|
||||
0) Install the cygwin environment, as described in README.windows-cygwin .
|
||||
Cygwin tools are needed to run the build process and generate program
|
||||
dependencies.
|
||||
|
||||
1) SRILM can be set to the cygwin path of the SRILM root directory
|
||||
(e.g., /home/username/srilm)
|
||||
|
||||
2) Make sure environment variables are set to locate MSVC tools and files:
|
||||
|
||||
PATH should include MSVC_INSTALL_DIR/bin and
|
||||
MSVC_INSTALL_DIR/Common7/IDE (for dll search)
|
||||
MSVCDIR should be set to MSVC_INSTALL_DIR
|
||||
INCLUDE should be set to MSVC_INSTALL_DIR/include
|
||||
LIB should be set to MSVC_INSTALL_DIR/lib
|
||||
|
||||
Note: PATH needs to use cygwin pathname conventions, but MSVCDIR,
|
||||
INCLUDE and LIB must use Windows paths. For example:
|
||||
|
||||
PATH="/cygdrive/c/Program Files/Microsoft Visual Studio 8/VC/bin:/cygdrive/c/Program Files/Microsoft Visual Studio 8/Common7/IDE:$PATH"
|
||||
MSVCDIR="c:\\Program Files\\Microsoft Visual Studio 8\\VC"
|
||||
INCLUDE="$MSVCDIR\\include"
|
||||
LIB="$MSVCDIR\\lib"
|
||||
export PATH MSVCDIR INCLUDE LIB
|
||||
|
||||
could be used in bash given the default installation location of Visual
|
||||
C++ 2005 Express Edition under c:\Program Files\Microsoft Visual Studio 8.
|
||||
|
||||
Alternatively, you could use the vcvars32.bat script that comes with
|
||||
MSVC to set these environment variables.
|
||||
|
||||
3) Build in a cygwin shell with
|
||||
|
||||
make MACHINE_TYPE=msvc
|
||||
|
||||
or
|
||||
|
||||
make MACHINE_TYPE=msvc64
|
||||
|
||||
to generate 64bit binaries.
|
||||
|
||||
As with MinGW, some functionality is not supported:
|
||||
|
||||
- compressed file I/O other than gzip files
|
||||
- nbest-optimize and lattice-tool -max-time option
|
||||
|
||||
Also note that make will try to determine if certain libraries
|
||||
are installed on your system and enable the /openmp option if so.
|
||||
This means that binaries built with the full Visual Studio compiler
|
||||
might not run on systems that have only Visual Studio Express.
|
||||
To avoid this disable /openmp by commenting out the corresponding
|
||||
line containing "/openmp" in common/Makefile.machine.msvc.
|
||||
|
||||
4) Run test suite with
|
||||
|
||||
cd test
|
||||
make MACHINE_TYPE=msvc try
|
||||
|
||||
71
language_model/srilm-1.7.3/lm/test/doc/README.windows-msvc-visual-studio
Executable file
71
language_model/srilm-1.7.3/lm/test/doc/README.windows-msvc-visual-studio
Executable file
@@ -0,0 +1,71 @@
|
||||
|
||||
Recommendations for compiling with Microsoft Visual Studio 2005.
|
||||
|
||||
Microsoft Visual Studio projects are available for Microsoft Visual
|
||||
Studio 2005.
|
||||
|
||||
The advantages of building directly in Visual Studio are:
|
||||
- Build is faster.
|
||||
- Integrated debugging.
|
||||
- Library projects can be included as dependencies of your own projects.
|
||||
|
||||
The disadvantages of building directly in Visual Studio are:
|
||||
- Not all parts of the SRILM build process can be completed inside Visual Studio.
|
||||
- More cumbersome to build from command line.
|
||||
|
||||
The build procedure has been tested with the Visual Studio 2005
|
||||
Professional.
|
||||
|
||||
0) Open SRILM/visual_studio/vs2005/srilm.sln in Visual Studio 2005.
|
||||
|
||||
1) Select either the "Release" or "Debug" targets.
|
||||
|
||||
2) Build the solution. The libraries will be built into
|
||||
SRILM/lib/Release or SRILM/lib/Debug, and the executables will be built
|
||||
into SRILM/bin/Release or SRILM/bin/Debug.
|
||||
|
||||
This procedure will NOT produce the central include directory, release
|
||||
the SRILM scripts, or build the test environment.
|
||||
|
||||
To produce the central include directory or release the SRILM scripts,
|
||||
you will have install the cygwin environment and use the included
|
||||
Makefiles as described in (3):
|
||||
|
||||
(3) Follow steps (0) and (1) from README.windows-msvc:
|
||||
|
||||
(3a) Install the cygwin environment, as described in README.windows-cygwin.
|
||||
Cygwin tools are needed to run the build process and generate program
|
||||
dependencies.
|
||||
|
||||
(3b) SRILM can be set to the cygwin path of the SRILM root directory
|
||||
(e.g., /home/username/srilm, /cygdrive/c/SRILM)
|
||||
|
||||
(3c) To produce the central include directory and release the SRILM
|
||||
scripts, do either:
|
||||
|
||||
% make MACHINE_TYPE=msvc dirs
|
||||
% make MACHINE_TYPE=msvc init
|
||||
% make MACHINE_TYPE=msvc release-headers
|
||||
% make MACHINE_TYPE=msvc release-scripts
|
||||
% cd utils/src; make MACHINE_TYPE=msvc release
|
||||
|
||||
or
|
||||
|
||||
% make MACHINE_TYPE=msvc msvc
|
||||
|
||||
If you want to run the SRILM tests, you will have to follow the full
|
||||
instructions in README.windows-msvc.
|
||||
|
||||
If you want to compile using a different version of MSVC, I suggest
|
||||
creating a new subdirectory of SRILM/visual_studio, e.g.,
|
||||
"SRILM/visual_studio/vs2010", and copying the contents of the "vs2005"
|
||||
directory into your new directory. Then use MSVC to update your
|
||||
projects. If you need to downgrade to MSVC 2003, it is possible to do
|
||||
this by manually editing the project files.
|
||||
|
||||
I would like to thank Keith Vertanen (http://www.keithv.com) for
|
||||
contributing the original versions of these Microsoft Visual Studio
|
||||
projects to SRILM.
|
||||
|
||||
Victor Abrash
|
||||
victor@speech.sri.com
|
||||
21
language_model/srilm-1.7.3/lm/test/doc/README.x86
Normal file
21
language_model/srilm-1.7.3/lm/test/doc/README.x86
Normal file
@@ -0,0 +1,21 @@
|
||||
|
||||
Note for Intel x86 platforms:
|
||||
|
||||
The function KneserNey::lowerOrderWeight() seems to trigger a compiler
|
||||
bug in gcc 2.95.3 with optimization, on the i686-pc-linux-gnu,
|
||||
i386-pc-solaris2, and i686-pc-cygwin platforms (and therefore probably
|
||||
on all Intel targets). The problem manifests itself by the
|
||||
"ngram-count-kn-int" test in the test/ directory not terminating.
|
||||
|
||||
To work around this problem, compile lm/src/Discount.cc without global
|
||||
optimization:
|
||||
|
||||
cd $SRILM/lm/src
|
||||
rm ../obj/$MACHINE_TYPE/Discount.o
|
||||
gnumake OPTIMIZE_FLAGS=-O1 ngram-count
|
||||
gnumake release
|
||||
|
||||
As of Feb 2002, Cygwin ships with gcc 2.95.3 and therefore suffers from
|
||||
this bug. gcc 2.95.2 or lower and gcc 3.x versions of the compiler
|
||||
don't seem to be affected, though.
|
||||
|
||||
9
language_model/srilm-1.7.3/lm/test/doc/TODO
Normal file
9
language_model/srilm-1.7.3/lm/test/doc/TODO
Normal file
@@ -0,0 +1,9 @@
|
||||
|
||||
compute local entropies, stddev in perplexity output
|
||||
|
||||
implement TaggedNgram backoff-weight computation
|
||||
|
||||
fix some cases of back-to-back DFs
|
||||
FP-REP2
|
||||
|
||||
general mechanism to skip words in ngram contexts, handle non-event tags.
|
||||
BIN
language_model/srilm-1.7.3/lm/test/doc/asru2011-srilm.pdf
Normal file
BIN
language_model/srilm-1.7.3/lm/test/doc/asru2011-srilm.pdf
Normal file
Binary file not shown.
47
language_model/srilm-1.7.3/lm/test/doc/c++porting-notes
Normal file
47
language_model/srilm-1.7.3/lm/test/doc/c++porting-notes
Normal file
@@ -0,0 +1,47 @@
|
||||
|
||||
C++ porting notes
|
||||
-----------------
|
||||
|
||||
I originally wrote this code using gcc/g++ as the compiler.
|
||||
|
||||
Below is a list of changes I had to make to accomodate SGI, Sun and Intel C++
|
||||
compilers.
|
||||
|
||||
o Explicitly instantiate templates only in g++ (#ifdef __GNUG__).
|
||||
|
||||
o Avoid global static data referenced by template code. These symbols
|
||||
become undefined in the instantiated template functions.
|
||||
I use local static data, or global extern data instead (the latter
|
||||
#ifndef __GNUG__).
|
||||
|
||||
o Avoid non-static inline functions in templates (.cc files). They
|
||||
don't get properly instantiated by Sun CC and result in undefined
|
||||
symbols.
|
||||
|
||||
o Use Sun CC -xar option to build libraries. This includes needed
|
||||
template instances in the library (see the new $(ARCHIVE) macro to
|
||||
refer to the appropriate archive command).
|
||||
|
||||
o To work around an internal compiler error in gcc 2.7.2, I had to
|
||||
add empty destructors in a few classes. Howver, I no longer use gcc 2.7.2
|
||||
in testing, so current version are no longer guaranteed to work with it.
|
||||
|
||||
o Intel C++ doesn't support arrays of non-POD objects on the stack. I had
|
||||
to replace those with matching calls to new/delete.
|
||||
|
||||
Compilers that work
|
||||
-------------------
|
||||
|
||||
(as of last checking, meaning current versions may not work out of the box)
|
||||
|
||||
+ gcc/g++ 2.8.1, 2.95.3, 3.2 and 3.3
|
||||
+ Sun SC4.0
|
||||
+ SGI NCC
|
||||
+ Intel C++ 7.1
|
||||
|
||||
Compilers that don't work on this code
|
||||
--------------------------------------
|
||||
|
||||
- Sun SC3.0.1
|
||||
- CenterLine (Sun version, as of 6/96)
|
||||
|
||||
184
language_model/srilm-1.7.3/lm/test/doc/lm-intro
Normal file
184
language_model/srilm-1.7.3/lm/test/doc/lm-intro
Normal file
@@ -0,0 +1,184 @@
|
||||
|
||||
SRI Object-oriented Language Modeling Libraries and Tools
|
||||
|
||||
BUILD AND INSTALL
|
||||
|
||||
See the INSTALL file in the top-level directory.
|
||||
|
||||
FILES
|
||||
|
||||
All mixed-case .cc and .h files contain C++ code for the liboolm
|
||||
library.
|
||||
All lower-case .cc files in this directory correspond to executable commands.
|
||||
Each program is option driven. The -help option prints a list of
|
||||
available options and their meaning.
|
||||
|
||||
N-GRAM MODELS
|
||||
|
||||
Here we only cover arbitary-order n-grams (no classes yet, sorry).
|
||||
|
||||
ngram-count manipulates n-gram counts and
|
||||
estimates backoff models
|
||||
ngram-merge merges count files (only needed for large
|
||||
corpora/vocabularies)
|
||||
ngram computes probabilities given a backoff model
|
||||
(also does mixing of backoff models)
|
||||
|
||||
Below are some typical command lines.
|
||||
|
||||
NGRAM COUNT MANIPULATION
|
||||
|
||||
ngram-count -order 4 -text corpus -write corpus.ngrams
|
||||
|
||||
Counts all n-grams up to length 4 in corpus and writes them to
|
||||
corpus.ngrams. The default n-gram order (if not specified) is 3.
|
||||
|
||||
ngram-count -vocab 30k.vocab -order 4 -text corpus -write corpus.ngrams
|
||||
|
||||
Same, but restricts the vocabulary to what is listed in 30k.vocab
|
||||
(one word per line). Other words are replaced with the <unk> token.
|
||||
|
||||
ngram-count -read corpus.ngrams -vocab 20k.vocab -write-order 3 \
|
||||
-write corpus.20k.3grams
|
||||
|
||||
Reads the counts from corpus.ngrams, maps them to the indicated
|
||||
vocabulary, and writes only the trigrams back to a file.
|
||||
|
||||
ngram-count -text corpus -write1 corpus.1grams -write2 corpus.2grams \
|
||||
-write3 corpus.3grams
|
||||
|
||||
writes unigrams, bigrams, and trigrams to separate files, in one
|
||||
pass. Usually there is no reason to keep these separate, which is
|
||||
why by default ngrams of all lengths are written out together.
|
||||
The -recompute flag will regenerate the lower-order counts from the
|
||||
highest order one by summation by prefixes.
|
||||
|
||||
The -read and -text options are additive, and can be used to merge
|
||||
new counts with old. Furthermore, repeated n-grams read with -read
|
||||
are also additive. Thus,
|
||||
|
||||
cat counts1 counts2 ... | ngram-count -read -
|
||||
|
||||
will merge the counts from counts1, counts2, ...
|
||||
|
||||
All file reading and writing uses the zio routines, so argument "-"
|
||||
stands for stdin/stdout, and .Z and .gz files are handled correctly.
|
||||
|
||||
For very large count files (due to large corpus/vocabulary) this
|
||||
method of merging counts in-memory is not suitable. Alternatively,
|
||||
counts can sorted and then merged.
|
||||
First, generate counts for portions of the training corpus and
|
||||
save them with -sort:
|
||||
|
||||
ngram-count -text part1.text -sort -write part1.ngrams.Z
|
||||
ngram-count -text part2.text -sort -write part2.ngrams.Z
|
||||
|
||||
Then combine these with
|
||||
|
||||
ngram-merge part?.ngrams.Z > all.ngrams.Z
|
||||
|
||||
(Although ngram-merge can deal with any number of files >= 2 ,
|
||||
it is most efficient to combined two parts at a time, then
|
||||
from the resulting new count files, again two at a time, using
|
||||
a binary tree merging scheme.)
|
||||
|
||||
BACKOFF MODEL ESTIMATION
|
||||
|
||||
ngram-count -order 2 -read corpus.counts -lm corpus.bo
|
||||
|
||||
generates a bigram backoff model in ARPA (aka Doug Paul) format
|
||||
from corpus.counts and writes is to corpus.bo.
|
||||
|
||||
If counts fit into memory (and hence there is no reason for the
|
||||
merging schemes described above) it is more convenient and faster
|
||||
to go directly from training text to model:
|
||||
|
||||
ngram-count -text corpus -lm corpus.bo
|
||||
|
||||
The built-in discounting method used in building backoff models
|
||||
is Good Turing. The lower exclusion cutoffs can be set with
|
||||
options (-gt1min ... -gt6min), the upper discounting cutoffs are
|
||||
selected with -gt1max ... -gt6max. Reasonable defaults are
|
||||
provided that can be displayed as part of the -help output.
|
||||
|
||||
When using limited vocabularies it is recommended to compute the
|
||||
discount coeffiecients on the unlimited vocabulary (at least for
|
||||
the unigrams) and then apply them to the limited vocabulary
|
||||
(otherwise the vocabulary truncation would produce badly skewed
|
||||
counts frequencies at the low end that would break the GT algorithm.)
|
||||
For this reason, discounting parameters can be saved to files and
|
||||
read back in.
|
||||
For example
|
||||
|
||||
ngram-count -text corpus -gt1 gt1.params -gt2 gt2.params \
|
||||
-gt3 gt3.params
|
||||
|
||||
saves the discounting parameters for unigrams, bigrams and trigams
|
||||
in the files as indicated. These are short files that can be
|
||||
edited, e.g., to adjust the lower and upper discounting cutoffs.
|
||||
Then the limited vocabulary backoff model is estimated using these
|
||||
saved parameters
|
||||
|
||||
ngram-count -text corpus -vocab 20k.vocab \
|
||||
-gt1 gt1.params -gt2 gt2.params gt3.params -lm corpus.20k.bo
|
||||
|
||||
MODEL EVALUATION
|
||||
|
||||
The ngram program uses a backoff model to compute probabilities and
|
||||
perplexity on test data.
|
||||
|
||||
ngram -lm some.bo -ppl test.corpus
|
||||
|
||||
computes the perplexity on test.corpus according to model some.bo.
|
||||
The flag -debug controls the amount of information output:
|
||||
|
||||
-debug 0 only overall statistics
|
||||
-debug 1 statistics for each test sentence
|
||||
-debug 2 probabilities for each word
|
||||
-debug 3 verify that word probabilities over the
|
||||
entire vocabulary sum to 1 for each context
|
||||
|
||||
ngram also understands the -order flag to set the maximum ngram
|
||||
order effectively used by the model. The default is 3.
|
||||
It has to be explicitly reset to use ngrams of higher order, even
|
||||
if the file specified with -lm contains higher order ngrams.
|
||||
|
||||
The flag -skipoovs establishes compatibility with broken behavior
|
||||
in some old software. It should only be used with bo model files
|
||||
produced with the old tools. It will
|
||||
|
||||
- let OOVs be counted as such even when the model has a probability
|
||||
for <unk>
|
||||
- skip not just the OOV but the entire n-gram context in which any
|
||||
OOVs occur (instead of backing off on OOV contexts).
|
||||
|
||||
OTHER MODEL OPERATIONS
|
||||
|
||||
ngram performs a few other operations on backoff models.
|
||||
|
||||
ngram -lm bo1 -mix-lm bo2 -lambda 0.2 -write-lm bo3
|
||||
|
||||
produces a new model in bo3 that is the interpolation of bo1 and bo2
|
||||
with a weight of 0.2 (for bo1).
|
||||
|
||||
ngram -lm bo -renorm -write-lm bo.new
|
||||
|
||||
recomputes the backoff weights in the model bo (thus normalizing
|
||||
probabilities to 1) and leaves the result in bo.new.
|
||||
|
||||
API FOR LANGUAGE MODELS
|
||||
|
||||
These programs are just examples of how to use the object-oriented
|
||||
language model library currently under construction. To use the API
|
||||
one would have to read the various .h files and how the interfaces
|
||||
are used in the example progams. No comprehensive documentation is
|
||||
available as yet. Sorry.
|
||||
|
||||
AVAILABILITY
|
||||
|
||||
This code is Copyright SRI International, but is available free of
|
||||
charge for non-profit use. See the License file in the top-level
|
||||
direcory for the terms of use.
|
||||
|
||||
Andreas Stolcke
|
||||
$Date: 1999/07/31 18:48:33 $
|
||||
12
language_model/srilm-1.7.3/lm/test/doc/malloc-notes
Normal file
12
language_model/srilm-1.7.3/lm/test/doc/malloc-notes
Normal file
@@ -0,0 +1,12 @@
|
||||
|
||||
size of ngram reading Switchboard trigram
|
||||
|
||||
SunOS5.5.1 libc malloc 12M
|
||||
-lmalloc 12M
|
||||
|
||||
IRIX5.1 libc malloc 12M
|
||||
-lmalloc MX_FAST = 24 11M
|
||||
MX_FAST = 10 12M
|
||||
MX_FAST = 32 11M
|
||||
|
||||
|
||||
115
language_model/srilm-1.7.3/lm/test/doc/overview
Normal file
115
language_model/srilm-1.7.3/lm/test/doc/overview
Normal file
@@ -0,0 +1,115 @@
|
||||
|
||||
NEW SRI-LM LIBRARY AND TOOLS -- OVERVIEW
|
||||
|
||||
Design Goals
|
||||
|
||||
- coverage of state-of-the-art LM methods
|
||||
- extensible vehicle for LM research
|
||||
- code reusability
|
||||
- tool versatility
|
||||
- speed
|
||||
|
||||
Implementation language: C++ (GNU compiler)
|
||||
|
||||
|
||||
|
||||
LM CLASS HIERARCHY
|
||||
|
||||
LM
|
||||
Ngram -- arbitrary-order N-gram backoff models
|
||||
DFNgram -- N-grams including disfluency model
|
||||
VarNgram -- variable-order N-grams
|
||||
TaggedNgram -- word/tag N-grams
|
||||
CacheLM -- unigram from recent history
|
||||
DynamicLM -- changes as a function of external info
|
||||
BayesMix -- mixture of LMs with contextual adaptation
|
||||
|
||||
|
||||
|
||||
|
||||
OTHER CLASSES
|
||||
|
||||
Vocab -- word string/index mapping
|
||||
TaggedVocab -- same for word/tag pairs
|
||||
|
||||
LMStats -- statistics for LM estimation
|
||||
NgramStats -- N-gram counts
|
||||
TaggedNgramStats -- word/tag N-gram counts
|
||||
|
||||
Discount -- backoff probality discounting
|
||||
GoodTuring -- standard
|
||||
ConstDiscount -- Ney's method
|
||||
NaturalDiscount -- Ristad's method
|
||||
|
||||
|
||||
|
||||
HELPER LIBRARIES
|
||||
|
||||
libdstruct -- template data structures
|
||||
|
||||
Array -- self-extending arrays
|
||||
|
||||
Map
|
||||
SArray -- sorted arrays
|
||||
LHash -- linear hash tables
|
||||
|
||||
Trie -- index trees (based on a Map type)
|
||||
|
||||
MemStats -- memory usage tracking
|
||||
|
||||
|
||||
libmisc -- convenience functions:
|
||||
option parsing,
|
||||
compressed file i/o,
|
||||
object debugging
|
||||
|
||||
|
||||
|
||||
TOOLS
|
||||
|
||||
ngram-count -- N-gram counting and model estimation
|
||||
ngram-merge -- N-gram count merging
|
||||
ngram -- N-gram model scoring, perplexity,
|
||||
sentence generation, mixing and
|
||||
interpolation
|
||||
|
||||
|
||||
|
||||
LM INTERFACE
|
||||
|
||||
LogP wordProb(VocabIndex word, const VocabIndex *context)
|
||||
LogP wordProb(VocabString word, const VocabString *context)
|
||||
|
||||
LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
|
||||
LogP sentenceProb(const VocabString *sentence, TextStats &stats)
|
||||
|
||||
unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0)
|
||||
setState(const char *state);
|
||||
|
||||
wordProbSum(const VocabIndex *context)
|
||||
|
||||
VocabIndex generateWord(const VocabIndex *context)
|
||||
VocabIndex *generateSentence(unsigned maxWords, VocabIndex *sentence)
|
||||
VocabString *generateSentence(unsigned maxWords, VocabString *sentence)
|
||||
|
||||
Boolean isNonWord(VocabIndex word)
|
||||
Boolean read(File &file);
|
||||
void write(File &file);
|
||||
|
||||
|
||||
|
||||
|
||||
EXTENSIBILITY/REUSABILITY
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
THINGS TO DO
|
||||
|
||||
- Node array interface
|
||||
- General interpolated LMs
|
||||
- LM "shell" for interactive model manipulation and use (Tcl based)
|
||||
|
||||
22
language_model/srilm-1.7.3/lm/test/doc/time-space-tradeoff
Normal file
22
language_model/srilm-1.7.3/lm/test/doc/time-space-tradeoff
Normal file
@@ -0,0 +1,22 @@
|
||||
|
||||
Standard, speed-optimized libraries (using hash tables)
|
||||
|
||||
bin/i386-solaris/ngram -debug 1 -memuse -lm $EVAL2000/data/lms/devel2000/swbd+ch_en+h4eval2000-m-pruned.3bo.gz -write-vocab /dev/null
|
||||
reading 34610 1-grams
|
||||
reading 4826134 2-grams
|
||||
reading 9733194 3-grams
|
||||
total memory 276788424 (263.966M), used 170254952 (162.368M), wasted 106533472 (
|
||||
101.598M)
|
||||
Time elapsed 67.06 user 70.26 system 5.90 utilization 113.57%
|
||||
|
||||
|
||||
"compact", space-optimized libraries (using sorted arrays)
|
||||
|
||||
bin/i386-solaris_c/ngram -debug 1 -memuse -lm $EVAL2000/data/lms/devel2000/swbd+ch_en+h4eval2000-m-pruned.3bo.gz -write-vocab /dev/null
|
||||
reading 34610 1-grams
|
||||
reading 4826134 2-grams
|
||||
reading 9733194 3-grams
|
||||
total memory 175131744 (167.019M), used 170254952 (162.368M), wasted 4876792 (4.
|
||||
65087M)
|
||||
Time elapsed 75.17 user 81.84 system 4.47 utilization 114.83%
|
||||
|
||||
Reference in New Issue
Block a user