From 6fee75015be98dfdee7d85bf6f42da421c4b0574 Mon Sep 17 00:00:00 2001
From: nckcard <nckcard@gmail.com>
Date: Thu, 3 Jul 2025 13:54:46 -0700
Subject: [PATCH] better data download & unzip instructions

---
 README.md                        | 26 ++++++++++++++++++++++----
 model_training/README.md         |  4 ++--
 model_training/evaluate_model.py |  2 +-
 model_training/rnn_args.yaml     |  2 +-
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index ffa10cc..ea78782 100644
--- a/README.md
+++ b/README.md
@@ -16,8 +16,6 @@ Sergey D. Stavisky*, and David M. Brandman*.
 ## Overview
 This repository contains the code and data necessary to reproduce the results of the paper ["*An Accurate and Rapidly Calibrating Speech Neuroprosthesis*" by Card et al. (2024), *N Eng J Med*](https://www.nejm.org/doi/full/10.1056/NEJMoa2314132).
 
-The code is written in Python, and the data can be downloaded from Dryad, [here](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85). Please download this data and place it in the `data` directory before running the code. Be sure to unzip `t15_copyTask_neuralData.zip` and `t15_pretrained_rnn_baseline.zip`.
-
 The code is organized into five main directories: `utils`, `analyses`, `data`, `model_training`, and `language_model`:
 - The `utils` directory contains utility functions used throughout the code.
 - The `analyses` directory contains the code necessary to reproduce results shown in the main text and supplemental appendix.
@@ -26,7 +24,8 @@ The code is organized into five main directories: `utils`, `analyses`, `data`, `
 - The `language_model` directory contains the ngram language model implementation and a pretrained 1gram language model. Pretrained 3gram and 5gram language models can be downloaded [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). See [`language_model/README.md`](language_model/README.md) for more information.
 
 ## Data
-The data used in this repository consists of various datasets for recreating figures and training/evaluating the brain-to-text model:
+### Data Overview
+The data used in this repository (which can be downloaded from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85)) consists of various datasets for recreating figures and training/evaluating the brain-to-text model:
 - `t15_copyTask.pkl`: This file contains the online Copy Task results required for generating Figure 2.
 - `t15_personalUse.pkl`: This file contains the Conversation Mode data required for generating Figure 4.
 - `t15_copyTask_neuralData.zip`: This dataset contains the neural data for the Copy Task.
@@ -48,7 +47,26 @@ The data used in this repository consists of various datasets for recreating fig
     - Each block of data contains sentences drawn from a range of corpuses (Switchboard, OpenWebText2, a 50-word corpus, a custom frequent-word corpus, and a corpus of random word sequences). Furthermore, the majority of the data is during attempted vocalized speaking, but some of it is during attempted silent speaking.
 - `t15_pretrained_rnn_baseline.zip`: This dataset contains the pretrained RNN baseline model checkpoint and args. An example of how to load this model and use it for inference is provided in the [`model_training/evaluate_model.py`](model_training/evaluate_model.py) file.
 
-Please download these datasets from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85) and place them in the `data` directory. Be sure to unzip both datasets before running the code.
+### Data Directory Structure
+Please download these datasets from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85) and place them in the `data` directory. Be sure to unzip `t15_copyTask_neuralData.zip` and place the resulting `hdf5_data_final` folder into the `data` directory. Likewise, unzip `t15_pretrained_rnn_baseline.zip` and place the resulting `t15_pretrained_rnn_baseline` folder into the `data` directory. The final directory structure should look like this:
+```
+data/
+├── t15_copyTask.pkl
+├── t15_personalUse.pkl
+├── hdf5_data_final/
+│   ├── t15.2023.08.11/
+│   │   ├── data_train.hdf5
+│   ├── t15.2023.08.13/
+│   │   ├── data_train.hdf5
+│   │   ├── data_val.hdf5
+│   │   ├── data_test.hdf5
+│   ├── ...
+├── t15_pretrained_rnn_baseline/
+│   ├── checkpoint/
+│   │   ├── args.yaml
+│   │   ├── best_checkpoint
+│   ├── training_log
+```
 
 ## Dependencies
 - The code has only been tested on Ubuntu 22.04 with two NVIDIA RTX 4090 GPUs.
diff --git a/model_training/README.md b/model_training/README.md
index 0474085..4282c4f 100644
--- a/model_training/README.md
+++ b/model_training/README.md
@@ -8,7 +8,7 @@ All model training and evaluation code was tested on a computer running Ubuntu 2
 ## Setup
 1. Install the required `b2txt25` conda environment by following the instructions in the root `README.md` file. This will set up the necessary dependencies for running the model training and evaluation code.
 
-2. Download the dataset from Dryad: [Dryad Dataset](https://datadryad.org/dataset/doi:10.5061/dryad.dncjsxm85). Place the downloaded data in the `data` directory. Be sure to unzip `t15_copyTask_neuralData.zip` and `t15_pretrained_rnn_baseline.zip`.
+2. Download the dataset from Dryad: [Dryad Dataset](https://datadryad.org/dataset/doi:10.5061/dryad.dncjsxm85). Place the downloaded data in the `data` directory. See the main [README.md](../README.md) file for more details on the included datasets and the proper `data` directory structure.
 
 ## Training
 ### Baseline RNN Model
@@ -41,7 +41,7 @@ python language_model/language-model-standalone.py --lm_path language_model/pret
 Finally, use the `b2txt25` conda environment to run the `evaluate_model.py` script to load the pretrained baseline RNN, use it for inference on the heldout val or test sets to get phoneme logits, pass them through the language model via redis to get word predictions, and then save the predicted sentences to a .txt file in the format required for competition submission. An example output file for the val split can be found at `rnn_baseline_submission_file_valsplit.txt`.
 ```bash
 conda activate b2txt25
-python evaluate_model.py --model_path ../data/t15_pretrained_rnn_baseline --data_dir ../data/t15_copyTask_neuralData --eval_type test --gpu_number 1
+python evaluate_model.py --model_path ../data/t15_pretrained_rnn_baseline --data_dir ../data/hdf5_data_final --eval_type test --gpu_number 1
 ```
 
 ### Shutdown redis
diff --git a/model_training/evaluate_model.py b/model_training/evaluate_model.py
index 97879fe..65c8fe5 100644
--- a/model_training/evaluate_model.py
+++ b/model_training/evaluate_model.py
@@ -16,7 +16,7 @@ from evaluate_model_helpers import *
 parser = argparse.ArgumentParser(description='Evaluate a pretrained RNN model on the copy task dataset.')
 parser.add_argument('--model_path', type=str, default='../data/t15_pretrained_rnn_baseline',
                     help='Path to the pretrained model directory (relative to the current working directory).')
-parser.add_argument('--data_dir', type=str, default='../data/t15_copyTask_neuralData',
+parser.add_argument('--data_dir', type=str, default='../data/hdf5_data_final',
                     help='Path to the dataset directory (relative to the current working directory).')
 parser.add_argument('--eval_type', type=str, default='test', choices=['val', 'test'],
                     help='Evaluation type: "val" for validation set, "test" for test set. '
diff --git a/model_training/rnn_args.yaml b/model_training/rnn_args.yaml
index 63f194f..2944d29 100644
--- a/model_training/rnn_args.yaml
+++ b/model_training/rnn_args.yaml
@@ -81,7 +81,7 @@ dataset:
   test_percentage: 0.1 # percentage of data to use for testing
   feature_subset: null # specific features to include in the dataset
 
-  dataset_dir: ../data/t15_copyTask_neuralData # directory containing the dataset
+  dataset_dir: ../data/hdf5_data_final # directory containing the dataset
   bad_trials_dict: null # dictionary of bad trials to exclude from the dataset
   sessions: # list of sessions to include in the dataset
   - t15.2023.08.11