tpu

2025-10-12 21:43:12 +08:00
parent 6e1d8e18f7
commit dfb3f7312c
2 changed files with 44 additions and 7 deletions
--- a/TPU_ISSUES_RECORD.md
+++ b/TPU_ISSUES_RECORD.md
@@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable
 - But Accelerate expects a proper batch_sampler when iterating
 - This is a fundamental incompatibility between our batching approach and Accelerate's expectations

-## FINAL SOLUTION ✅
+## COMPREHENSIVE SOLUTION ✅

-### Problem Resolution
+### Problem Resolution Status
 1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
 2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
+3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset
+
+### Latest Error (2025-10-12 13:38)
+```
+INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168].
+```
+
+**Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32).
+
+**Analysis**:
+- We enabled `bf16` mixed precision in Accelerator configuration
+- Model parameters are automatically converted to `bf16`
+- But input data remains as `f32`, causing type mismatch during forward pass
+- TPU XLA compiler is strict about type matching
+
+### Solution: Data Type Conversion in Dataset
+Fixed in `dataset.py:130` by converting neural data to `bf16`:
+```python
+# Before (causes type mismatch):
+input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32
+
+# After (TPU compatible):
+input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility
+```

 ### Final Implementation
 ```python
@@ -128,11 +152,24 @@ self.train_loader = DataLoader(

 **Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.

-## Next Steps
+## Complete Solution Summary
+
+### Three-Step Fix for TPU Training
+1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders
+2. **Custom collate_fn**: Handles pre-batched data from our dataset
+3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility
+
+### Files Modified
+- [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration
+- [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1
+- [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16
+
+### Next Steps
 1. ~~Implement even_batches=False~~ ✅ DONE
 2. ~~Fix batch_sampler None issue~~ ✅ DONE
-3. Test TPU training with complete solution
-4. Integrate final solution into CLAUDE.md
+3. ~~Fix data type mismatch~~ ✅ DONE
+4. Test TPU training with complete solution
+5. Integrate final solution into CLAUDE.md

 ## Lessons Learned
 - Don't overcomplicate TPU conversion - it should be straightforward
--- a/model_training_nnn/dataset.py
+++ b/model_training_nnn/dataset.py
@@ -127,7 +127,7 @@ class BrainToTextDataset(Dataset):
                        g = f[f'trial_{t:04d}']

                        # Remove features is neccessary
-                        input_features = torch.from_numpy(g['input_features'][:]) # neural data
+                        input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility
                        if self.feature_subset:
                            input_features = input_features[:,self.feature_subset]