diff --git a/TPU_ISSUES_RECORD.md b/TPU_ISSUES_RECORD.md index f0db3a5..e5aeceb 100644 --- a/TPU_ISSUES_RECORD.md +++ b/TPU_ISSUES_RECORD.md @@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable - But Accelerate expects a proper batch_sampler when iterating - This is a fundamental incompatibility between our batching approach and Accelerate's expectations -## FINAL SOLUTION ✅ +## COMPREHENSIVE SOLUTION ✅ -### Problem Resolution +### Problem Resolution Status 1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration 2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn +3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset + +### Latest Error (2025-10-12 13:38) +``` +INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168]. +``` + +**Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32). + +**Analysis**: +- We enabled `bf16` mixed precision in Accelerator configuration +- Model parameters are automatically converted to `bf16` +- But input data remains as `f32`, causing type mismatch during forward pass +- TPU XLA compiler is strict about type matching + +### Solution: Data Type Conversion in Dataset +Fixed in `dataset.py:130` by converting neural data to `bf16`: +```python +# Before (causes type mismatch): +input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32 + +# After (TPU compatible): +input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility +``` ### Final Implementation ```python @@ -128,11 +152,24 @@ self.train_loader = DataLoader( **Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data. -## Next Steps +## Complete Solution Summary + +### Three-Step Fix for TPU Training +1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders +2. **Custom collate_fn**: Handles pre-batched data from our dataset +3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility + +### Files Modified +- [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration +- [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1 +- [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16 + +### Next Steps 1. ~~Implement even_batches=False~~ ✅ DONE 2. ~~Fix batch_sampler None issue~~ ✅ DONE -3. Test TPU training with complete solution -4. Integrate final solution into CLAUDE.md +3. ~~Fix data type mismatch~~ ✅ DONE +4. Test TPU training with complete solution +5. Integrate final solution into CLAUDE.md ## Lessons Learned - Don't overcomplicate TPU conversion - it should be straightforward diff --git a/model_training_nnn/dataset.py b/model_training_nnn/dataset.py index 7b5e81d..e964676 100644 --- a/model_training_nnn/dataset.py +++ b/model_training_nnn/dataset.py @@ -126,8 +126,8 @@ class BrainToTextDataset(Dataset): try: g = f[f'trial_{t:04d}'] - # Remove features is neccessary - input_features = torch.from_numpy(g['input_features'][:]) # neural data + # Remove features is neccessary + input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility if self.feature_subset: input_features = input_features[:,self.feature_subset]