competition update

2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions
--- a/language_model/docs/.gitignore
+++ b/language_model/docs/.gitignore
@@ -0,0 +1,3 @@
+_gen/
+_build/
+build/
--- a/language_model/docs/Makefile
+++ b/language_model/docs/Makefile
@@ -0,0 +1,21 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SPHINXPROJ    = Wenet
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/language_model/docs/conf.py
+++ b/language_model/docs/conf.py
@@ -0,0 +1,71 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = 'Wenet'
+copyright = '2020, wenet-team'
+author = 'wenet-team'
+
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "nbsphinx",
+    "sphinx.ext.autodoc",
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    "sphinx.ext.mathjax",
+    "sphinx.ext.todo",
+    # "sphinxarg.ext",
+    "sphinx_markdown_tables",
+    'recommonmark',
+    'sphinx_rtd_theme',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.txt': 'markdown',
+    '.md': 'markdown',
+}
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+# html_theme = 'alabaster'
+html_theme = "sphinx_rtd_theme"
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
--- a/language_model/docs/images/check_detail.png
+++ b/language_model/docs/images/check_detail.png
--- a/language_model/docs/images/checks.png
+++ b/language_model/docs/images/checks.png
--- a/language_model/docs/images/lm_system.png
+++ b/language_model/docs/images/lm_system.png
--- a/language_model/docs/images/runtime_android.gif
+++ b/language_model/docs/images/runtime_android.gif
--- a/language_model/docs/images/runtime_server.gif
+++ b/language_model/docs/images/runtime_server.gif
--- a/language_model/docs/images/runtime_web.png
+++ b/language_model/docs/images/runtime_web.png
--- a/language_model/docs/images/subsampling_overalp.gif
+++ b/language_model/docs/images/subsampling_overalp.gif
--- a/language_model/docs/images/u2.gif
+++ b/language_model/docs/images/u2.gif
--- a/language_model/docs/index.rst
+++ b/language_model/docs/index.rst
@@ -0,0 +1,28 @@
+.. Wenet documentation master file, created by
+   sphinx-quickstart on Thu Dec  3 11:43:53 2020.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to Wenet's documentation!
+=================================
+
+
+Wenet is an tansformer-based end-to-end ASR toolkit.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Tutorial:
+
+   ./papers.md
+   ./tutorial.md
+   ./lm.md
+   ./runtime.md
+   ./jit_in_wenet.md
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
--- a/language_model/docs/jit_in_wenet.md
+++ b/language_model/docs/jit_in_wenet.md
@@ -0,0 +1,31 @@
+# JIT in WeNet
+
+We want that our PyTorch model can be directly exported by torch.jit.script method,
+which is essential for deploying the model to production.
+
+See the following resource for how to deploy PyTorch models in production in details.
+- [INTRODUCTION TO TORCHSCRIPT](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html)
+- [TORCHSCRIPT LANGUAGE REFERENCE](https://pytorch.org/docs/stable/jit_language_reference.html#language-reference)
+- [LOADING A TORCHSCRIPT MODEL IN C++](https://pytorch.org/tutorials/advanced/cpp_export.html)
+- [TorchScript and PyTorch JIT | Deep Dive](https://www.youtube.com/watch?v=2awmrMRf0dA&t=574s)
+- [Research to Production: PyTorch JIT/TorchScript Updates](https://www.youtube.com/watch?v=St3gdHJzic0)
+
+To ensure that, we will try to export the model before training stage.
+If it fails, we should modify the training code to satisfy the export requirements.
+
+``` python
+# See in wenet/bin/train.py
+script_model = torch.jit.script(model)
+script_model.save(os.path.join(args.model_dir, 'init.zip'))
+```
+
+Two principles should be taken into consideration when we contribute our python code
+to WeNet, especially for the subclass of torch.nn.Module, and for the forward function.
+
+1. Know what is allowed and what is disallowed.
+    - [Torch and Tensor Unsupported Attributes](https://pytorch.org/docs/master/jit_unsupported.html#jit-unsupported)
+    - [Python Language Reference Coverage](https://pytorch.org/docs/master/jit_python_reference.html#python-language-reference)
+
+2. Try to use explicit typing as much as possible. You can try to do type checking
+forced by typeguard, see https://typeguard.readthedocs.io/en/latest/userguide.html for details.
+
--- a/language_model/docs/lm.md
+++ b/language_model/docs/lm.md
@@ -0,0 +1,105 @@
+# LM for WeNet
+
+WeNet uses n-gram based statistical language model and the WFST framework to support the custom language model.
+And LM is only supported in runtime of WeNet.
+
+## Motivation
+
+Why n-gram based LM? This may be the first question many people will ask.
+Now that LM based on RNN and Transformer is in full swing, why does WeNet go backward?
+The reason is simple, it is for productivity.
+The n-gram-based language model has mature and complete training tools,
+any amount of corpus can be trained, the training is very fast, the hotfix is easy,
+and it has a wide range of mature applications in actual products.
+
+Why WFST? It may be the second question many people will ask.
+Since both industry and research have been working so hard to abandon traditional speech recognition,
+especially the complex decoding technology. Why does WeNet back?
+The reason is also very simple, it is for productivity.
+WFST is a standard and powerful tool in traditional speech recognition.
+And based on this solution, we have mature and complete bug fix solutions and product solutions,
+such as that we can use the replace function in WFST for class-based personalization such as contact recognition.
+
+Therefore, just like WeNet's design goal "Production first and Production Ready",
+LM in WeNet also puts productivity as the first priority.
+So it draws on many very productive tools and solutions accumulated in traditional speech recognition.
+The difference to traditional speech recognition are:
+
+1. The training in WeNet is pure end to end.
+2. As described below, LM is optional in decoding, you can choose whether to use LM according to your needs and application scenarios.
+
+
+## System Design
+
+The whole system is shown in the bellowing picture. There are two ways to generate N-best.
+
+![LM System Design](./images/lm_system.png)
+
+1. Without LM, we use CTC prefix beam search to generate N-best.
+2. With LM, we use CTC WFST search to generate N-best and CTC WFST search is the traditional WFST based decoder.
+
+There are two main parts of the CTC WFST based search.
+
+The first is building the decoding graph, which is to compose the model unit T, the lexicon L and the language model G into one unified graph TLG. And in which:
+1. T is the model unit in E2E training. Typically it's char in Chinese, char or BPE in English.
+2. L is the lexicon, the lexicon is very simple. What we need to do is just split a word into its modeling unit sequence.
+For example, the word "我们" is split into two chars "我 们", and the word "APPLE" is split into five letters "A P P L E".
+We can see there is no phonemes and there is no need to design pronunciation on purpose.
+3. G is the language model, namely compiling the n-gram to standard WFST representation.
+
+The second is the decoder, which is the same as the traditional decoder, which uses the standard Viterbi beam search algorithm in decoding.
+
+## Implementation
+
+WeNet draws on the decoder and related tools in Kaldi to support LM and WFST based decoding.
+For ease of using and keeping independence, we directly migrated the code related to decoding in Kaldi to [this directory](https://github.com/wenet-e2e/wenet/tree/main/runtime/core/kaldi) in WeNet runtime.
+And modify and organize according to the following principles:
+1. To minimize changes, the migrated code remains the same directory structure as the original.
+2. We use GLOG to replace the log system in Kaldi.
+3. We modify the code format to meet the lint requirements of the code style in WeNet.
+
+The core code is https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/ctc_wfst_beam_search.cc,
+which wraps the LatticeFasterDecoder in Kaldi.
+And we use blank frame skipping to speed up decoding.
+
+In addition, WeNet also migrated related tools for building the decoding graph,
+such as arpa2fst, fstdeterminizestar, fsttablecompose, fstminimizeencoded, and other tools.
+So all the tools related to LM are built-in tools and can be used out of the box.
+
+
+## Results
+
+We get consistent gain (3%~10%) on different datasets,
+including aishell, aishell2, and librispeech,
+please go to the corresponding example dataset for the details.
+
+## How to use?
+
+Here is an example from aishell, which shows how to prepare the dictionary, how to train the LM,
+how to build the graph, and how to decode with the runtime.
+
+``` sh
+# 7.1 Prepare dict
+unit_file=$dict
+mkdir -p data/local/dict
+cp $unit_file data/local/dict/units.txt
+tools/fst/prepare_dict.py $unit_file ${data}/resource_aishell/lexicon.txt \
+    data/local/dict/lexicon.txt
+# 7.2 Train lm
+lm=data/local/lm
+mkdir -p $lm
+tools/filter_scp.pl data/train/text \
+     $data/data_aishell/transcript/aishell_transcript_v0.8.txt > $lm/text
+local/aishell_train_lms.sh
+# 7.3 Build decoding TLG
+tools/fst/compile_lexicon_token_fst.sh \
+    data/local/dict data/local/tmp data/local/lang
+tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
+# 7.4 Decoding with runtime
+./tools/decode.sh --nj 16 \
+    --beam 15.0 --lattice_beam 7.5 --max_active 7000 \
+    --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
+    --fst_path data/lang_test/TLG.fst \
+    data/test/wav.scp data/test/text $dir/final.zip \
+    data/lang_test/words.txt $dir/lm_with_runtime
+```
--- a/language_model/docs/make.bat
+++ b/language_model/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+    set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+    echo.
+    echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+    echo.installed, then set the SPHINXBUILD environment variable to point
+    echo.to the full path of the 'sphinx-build' executable. Alternatively you
+    echo.may add the Sphinx directory to PATH.
+    echo.
+    echo.If you don't have Sphinx installed, grab it from
+    echo.http://sphinx-doc.org/
+    exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
--- a/language_model/docs/papers.md
+++ b/language_model/docs/papers.md
@@ -0,0 +1,7 @@
+## Papers
+
+* [U2: Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition](https://arxiv.org/abs/2012.05481)
+* [WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit](https://arxiv.org/pdf/2102.01547v1.pdf), v1.
+* [WeNet: Production Oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit](https://arxiv.org/pdf/2102.01547.pdf), v2, accepted by InterSpeech 2021.
+* [U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition](https://arxiv.org/pdf/2106.05642.pdf)
+
--- a/language_model/docs/runtime.md
+++ b/language_model/docs/runtime.md
@@ -0,0 +1,63 @@
+# Runtime for WeNet
+
+WeNet runtime uses [Unified Two Pass (U2)](https://arxiv.org/pdf/2012.05481.pdf) framework for inference. U2 has the following advantages:
+* **Unified**: U2 unified the streaming and non-streaming model in a simple way, and our runtime is also unified. Therefore you can easily balance the latency and accuracy by changing chunk_size (described in the following section).
+* **Accurate**: U2 achieves better accuracy by CTC joint training.
+* **Fast**: Our runtime uses attention rescoring based decoding method described in U2, which is much faster than a traditional autoregressive beam search.
+* **Other benefits**: In practice, we find U2 is more stable on long-form speech than standard transformer which usually fails or degrades a lot on long-form speech; and we can easily get the word-level time stamps by CTC spikes in U2. Both of these aspects are favored for industry adoption.
+
+## Platforms Supported
+
+The WeNet runtime supports the following platforms.
+
+* Server
+  * [x86](https://github.com/wenet-e2e/wenet/tree/main/runtime/server/x86)
+* Device
+  * [android](https://github.com/wenet-e2e/wenet/tree/main/runtime/device/android/wenet)
+
+## Architecture and Implementation
+
+### Architecture
+
+The following picture shows how U2 works.
+
+![U2](images/u2.gif)
+
+When input is not finished, the input frames $x_t$ are fed into the *Shared Encoder* module frame by frame to get the encoder output $e_t$, then $e_t$ is transformed by the *CTC Activation* module (typically, it's just a linear transform with a log_softmax) to get the CTC prob $y_t$ at current frame, and $y_t$ is further used by the *CTC prefix beam search* module to generate n-best results at current time $t$, and the best result is used as partial result of the U2 system.
+
+When input is finished at time $T$, the n-best results from the *CTC prefix beam search* module and the encoder output $e_1, e_2, e_3, ..., e_T$ are fed into the *Attention Decoder* module, then the *Attention Decoder* module computes a score for every result. The result with the best score is selected as the final result of U2 system.
+
+We can group $C$ continuous frames $x_t, x_{t+1}, x_{t+C}$ as one chunk for the *Shared Encoder* module, and $C$ is called chunk_size in the U2 framework. The chunk_size will affect the attention computation in the *Shared Encoder* module. When chunk_size is infinite, it is a non-streaming case. The system gives the best accuracy with infinite latency. When chunk_size is limited (typically less than 1s), it is a streaming case. The system has limited latency and also gives promising accuracy. So the developer can balance the accuracy and latency and setting a proper chunk_size.
+
+### Interface Design
+
+We use LibTorch to implement U2 runtime in WeNet, and we export several interfaces in PyTorch python code by @torch.jit.export (see [asr_model.py](https://github.com/wenet-e2e/wenet/tree/main/wenet/transformer/asr_model.py)), which are required and used in C++ runtime in [torch_asr_model.cc](https://github.com/wenet-e2e/wenet/tree/main/runtime/server/x86/decoder/torch_asr_model.cc) and [torch_asr_decoder.cc](https://github.com/wenet-e2e/wenet/tree/main/runtime/server/x86/decoder/torch_asr_decoder.cc). Here we just list the interface and give a brief introduction.
+
+| interface                        | description                             |
+|----------------------------------|-----------------------------------------|
+| subsampling_rate (args)          | get the subsampling rate of the model   |
+| right_context (args)             | get the right context of the model      |
+| sos_symbol (args)                | get the sos symbol id of the model      |
+| eos_symbol (args)                | get the eos symbol id of the model      |
+| forward_encoder_chunk (args)     | used for the *Shared Encoder* module    |
+| ctc_activation (args)            | used for the *CTC Activation* module    |
+| forward_attention_decoder (args) | used for the *Attention Decoder* module |
+
+### Cache in Details
+
+For streaming scenario, the *Shared Encoder* module works in an incremental way. The current chunk computation requries the inputs and outputs of all the history chunks. We implement the incremental computation by using caches. Overall, three caches are used in our runtime.
+
+* Encoder Conformer/Transformer layers output cache: cache the output of every encoder layer.
+* Conformer CNN cache: if conformer is used, we should cache the left context for causal CNN computation in Conformer.
+* Subsampling cache: cache the output of subsampling layer, which is the input of the first encoder layer.
+
+Please see [encoder.py:forward_chunk()](https://github.com/wenet-e2e/wenet/tree/main/wenet/transformer/encoder.py) and [torch_asr_decoder.cc](https://github.com/wenet-e2e/wenet/tree/main/runtime/server/x86/decoder/torch_asr_decoder.cc) for details of the caches.
+
+In practice, CNN is also used in the subsampling. We should handle the CNN cache in subsampling. However, since there are serveral CNN layers in subsampling with different left contexts, right contexts and strides, which makes it tircky to directly implement the CNN cache in subsampling. In our implementation, we simply overlap the input to avoid subsampling CNN cache. It is simple and straightforward with negligible additional cost since subsampling CNN only costs a very small fraction of the whole computation. The following picture shows how it works, where the blue color is for the overlap part of current inputs and previous inputs.
+
+![Overlap input for Subsampling CNN](images/subsampling_overalp.gif)
+
+## References
+1. [Sequence Modeling With CTC](https://distill.pub/2017/ctc/)
+2. [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/pdf/1408.2873.pdf)
+3. [Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition](https://arxiv.org/pdf/2012.05481.pdf)
--- a/language_model/docs/tutorial.md
+++ b/language_model/docs/tutorial.md
@@ -0,0 +1,192 @@
+## Tutorial
+
+If you meet any problems when going through this tutorial, please feel free to ask in github [issues](https://github.com/mobvoi/wenet/issues). Thanks for any kind of feedback.
+
+### Setup environment
+- Clone the repo
+
+```sh
+git clone https://github.com/mobvoi/wenet.git
+```
+
+
+
+- Install Conda
+
+https://docs.conda.io/en/latest/miniconda.html
+
+
+- Create Conda env
+
+Pytorch 1.6.0 is recommended. We met some error with NCCL when using 1.7.0 on 2080 Ti.
+
+```
+conda create -n wenet python=3.8
+conda activate wenet
+pip install -r requirements.txt
+conda install pytorch==1.6.0 cudatoolkit=10.1 torchaudio=0.6.0 -c pytorch
+```
+
+### First Experiment
+
+We provide a recipe `example/aishell/s0/run.sh` on aishell-1 data.
+
+The recipe is simple and we suggest you run each stage one by one manually and check the result to understand the whole process.
+
+```
+cd example/aishell/s0
+bash run.sh --stage -1 --stop-stage -1
+bash run.sh --stage 0 --stop-stage 0
+bash run.sh --stage 1 --stop-stage 1
+bash run.sh --stage 2 --stop-stage 2
+bash run.sh --stage 3 --stop-stage 3
+bash run.sh --stage 4 --stop-stage 4
+bash run.sh --stage 5 --stop-stage 5
+bash run.sh --stage 6 --stop-stage 6
+```
+
+You could also just run the whole script
+```
+bash run.sh --stage -1 --stop-stage 6
+```
+
+
+#### Stage -1: Download data
+
+This stage downloads the aishell-1 data to the local path `$data`. This may take several hours. If you have already downloaded the data, please change the `$data` variable in `run.sh` and start from `--stage 0`.
+
+#### Stage 0: Prepare Training data
+
+In this stage, `local/aishell_data_prep.sh` organizes the original aishell-1 data into two files:
+* **wav.scp** each line records two tab-separated columns : `wav_id` and `wav_path`
+* **text**  each line records two tab-separated columns :  `wav_id` and `text_label`
+
+**wav.scp**
+```
+BAC009S0002W0122 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
+BAC009S0002W0123 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
+BAC009S0002W0124 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
+BAC009S0002W0125 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0125.wav
+...
+```
+
+**text**
+```
+BAC009S0002W0122 而对楼市成交抑制作用最大的限购
+BAC009S0002W0123 也成为地方政府的眼中钉
+BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
+BAC009S0002W0125 各地政府便纷纷跟进
+...
+```
+
+If you want to train using your customized data, just organize the data into two files `wav.scp` and `text`, and start from `stage 1`.
+
+
+#### Stage 1: Extract optinal cmvn features
+
+`example/aishell/s0` uses raw wav as input and and [TorchAudio](https://pytorch.org/audio/stable/index.html) to extract the features just-in-time in dataloader. So in this step we just copy the training wav.scp and text file into the `raw_wav/train/` dir.
+
+`tools/compute_cmvn_stats.py` is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features. Setting `cmvn=false` will skip this step.
+
+#### Stage 2: Generate label token dictionary
+
+The dict is a map between label tokens (we use characters for Aishell-1) and
+ the integer indices.
+
+An example dict is as follows
+```
+<blank> 0
+<unk> 1
+一 2
+丁 3
+...
+龚 4230
+龟 4231
+<sos/eos> 4232
+```
+
+* `<blank>` denotes the blank symbol for CTC.
+* `<unk>` denotes the unknown token, any out-of-vocabulary tokens will be mapped into it.
+* `<sos/eos>` denotes start-of-speech and end-of-speech symbols for attention based encoder decoder training, and they shares the same id.
+
+#### Stage 3: Prepare WeNet data format
+
+This stage generates a single WeNet format file including all the input/output information needed by neural network training/evaluation.
+
+See the generated training feature file in `fbank_pitch/train/format.data`.
+
+In the WeNet format file , each line records a data sample of seven tab-separated columns. For example, a line is as follows (tab replaced with newline here):
+
+```
+utt:BAC009S0764W0121
+feat:/export/data/asr-data/OpenSLR/33/data_aishell/wav/test/S0764/BAC009S0764W0121.wav
+feat_shape:4.2039375
+text:甚至出现交易几乎停滞的情况
+token:甚 至 出 现 交 易 几 乎 停 滞 的 情 况
+tokenid:2474 3116 331 2408 82 1684 321 47 235 2199 2553 1319 307
+token_shape:13,4233
+```
+
+`feat_shape` is the duration(in seconds) of the wav.
+
+#### Stage 4: Neural Network training
+
+The NN model is trained in this step.
+
+- Multi-GPU mode
+
+If using DDP mode for multi-GPU, we suggest using `dist_backend="nccl"`. If the NCCL does not work, try using `gloo` or use `torch==1.6.0`
+Set the GPU ids in CUDA_VISIBLE_DEVICES. For example, set `export CUDA_VISIBLE_DEVICES="0,1,2,3,6,7"` to use card 0,1,2,3,6,7.
+
+- Resume training
+
+If your experiment is terminated after running several epochs for some reasons (e.g. the GPU is accidentally used by other people and is out-of-memory ), you could continue the training from a checkpoint model. Just find out the finished epoch in `exp/your_exp/`, set  `checkpoint=exp/your_exp/$n.pt` and run the `run.sh --stage 4`. Then the training will continue from the $n+1.pt
+
+- Config
+
+The config of neural network structure, optimization parameter, loss parameters, and dataset can be set in a YAML format file.
+
+In `conf/`,  we provide several models like transformer and conformer. see `conf/train_conformer.yaml` for reference.
+
+- Use Tensorboard
+
+The training takes several hours. The actual time depends on the number and type of your GPU cards. In an 8-card 2080 Ti machine, it takes about less than one day for 50 epochs.
+You could use tensorboard to monitor the loss.
+
+```
+tensorboard --logdir tensorboard/$your_exp_name/ --port 12598 --bind_all
+```
+
+#### Stage 5: Recognize wav using the trained model
+
+This stage shows how to recognize a set of wavs into texts. It also shows how to do the model averaging.
+
+- Average model
+
+If `${average_checkpoint}` is set to `true`, the best `${average_num}` models on cross validation set will be averaged to generate a boosted model and used for recognition.
+
+- Decoding
+
+Recognition is also called decoding or inference. The function of the NN will be applied on the input acoustic feature sequence to output a sequence of text.
+
+Four decoding methods are provided in WeNet:
+
+* `ctc_greedy_search` : encoder + CTC greedy search
+* `ctc_prefix_beam_search` :  encoder + CTC prefix beam search
+* `attention` : encoder + attention-based decoder decoding
+* `attention_rescoring` : rescoring the ctc candidates from ctc prefix beam search with encoder output on attention-based decoder.
+
+In general, attention_rescoring is the best method. Please see [U2 paper](https://arxiv.org/pdf/2012.05481.pdf) for the details of these algorithms.
+
+`--beam_size` is a tunable parameter, a large beam size may get better results but also cause higher computation cost.
+
+`--batch_size` can be greater than 1 for "ctc_greedy_search" and "attention" decoding mode, and must be 1 for "ctc_prefix_beam_search" and "attention_rescoring" decoding mode.
+
+- WER evaluation
+
+`tools/compute-wer.py` will calculate the word (or char) error rate of the result. If you run the recipe without any change, you may get WER ~= 5%.
+
+
+#### Stage 6: Export the trained model
+
+`wenet/bin/export_jit.py` will export the trained model using Libtorch. The exported model files can be easily used for inference in other programming languages such as C++.