tpu

2025-10-12 21:43:12 +08:00
parent 6e1d8e18f7
commit dfb3f7312c
2 changed files with 44 additions and 7 deletions
--- a/TPU_ISSUES_RECORD.md
+++ b/TPU_ISSUES_RECORD.md
@@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable
 - But Accelerate expects a proper batch_sampler when iterating
 - This is a fundamental incompatibility between our batching approach and Accelerate's expectations
-## FINAL SOLUTION ✅
+## COMPREHENSIVE SOLUTION ✅
-### Problem Resolution
+### Problem Resolution Status
 1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
 2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
 3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset
 ### Latest Error (2025-10-12 13:38)
 ```
 INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168].
 ```
 **Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32).
 **Analysis**:
 - We enabled `bf16` mixed precision in Accelerator configuration
 - Model parameters are automatically converted to `bf16`
 - But input data remains as `f32`, causing type mismatch during forward pass
 - TPU XLA compiler is strict about type matching
 ### Solution: Data Type Conversion in Dataset
 Fixed in `dataset.py:130` by converting neural data to `bf16`:
 ```python
 # Before (causes type mismatch):
 input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32
 # After (TPU compatible):
 input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility
 ```
 ### Final Implementation
 ```python
@@ -128,11 +152,24 @@ self.train_loader = DataLoader(
 **Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.
-## Next Steps
+## Complete Solution Summary
 ### Three-Step Fix for TPU Training
 1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders
 2. **Custom collate_fn**: Handles pre-batched data from our dataset
 3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility
 ### Files Modified
 - [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration
 - [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1
 - [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16
 ### Next Steps
 1. ~~Implement even_batches=False~~ ✅ DONE
 2. ~~Fix batch_sampler None issue~~ ✅ DONE
-3. Test TPU training with complete solution
+3. ~~Fix data type mismatch~~ ✅ DONE
-4. Integrate final solution into CLAUDE.md
+4. Test TPU training with complete solution
 5. Integrate final solution into CLAUDE.md
 ## Lessons Learned
 - Don't overcomplicate TPU conversion - it should be straightforward
--- a/model_training_nnn/dataset.py
+++ b/model_training_nnn/dataset.py
@@ -127,7 +127,7 @@ class BrainToTextDataset(Dataset):
                        g = f[f'trial_{t:04d}']
                        # Remove features is neccessary
-                        input_features = torch.from_numpy(g['input_features'][:]) # neural data
+                        input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility
                        if self.feature_subset:
                            input_features = input_features[:,self.feature_subset]