This commit is contained in:
Zchen
2025-10-12 21:43:12 +08:00
parent 6e1d8e18f7
commit dfb3f7312c
2 changed files with 44 additions and 7 deletions

View File

@@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable
- But Accelerate expects a proper batch_sampler when iterating
- This is a fundamental incompatibility between our batching approach and Accelerate's expectations
## FINAL SOLUTION ✅
## COMPREHENSIVE SOLUTION ✅
### Problem Resolution
### Problem Resolution Status
1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset
### Latest Error (2025-10-12 13:38)
```
INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168].
```
**Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32).
**Analysis**:
- We enabled `bf16` mixed precision in Accelerator configuration
- Model parameters are automatically converted to `bf16`
- But input data remains as `f32`, causing type mismatch during forward pass
- TPU XLA compiler is strict about type matching
### Solution: Data Type Conversion in Dataset
Fixed in `dataset.py:130` by converting neural data to `bf16`:
```python
# Before (causes type mismatch):
input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32
# After (TPU compatible):
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility
```
### Final Implementation
```python
@@ -128,11 +152,24 @@ self.train_loader = DataLoader(
**Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.
## Next Steps
## Complete Solution Summary
### Three-Step Fix for TPU Training
1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders
2. **Custom collate_fn**: Handles pre-batched data from our dataset
3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility
### Files Modified
- [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration
- [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1
- [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16
### Next Steps
1. ~~Implement even_batches=False~~ ✅ DONE
2. ~~Fix batch_sampler None issue~~ ✅ DONE
3. Test TPU training with complete solution
4. Integrate final solution into CLAUDE.md
3. ~~Fix data type mismatch~~ ✅ DONE
4. Test TPU training with complete solution
5. Integrate final solution into CLAUDE.md
## Lessons Learned
- Don't overcomplicate TPU conversion - it should be straightforward

View File

@@ -127,7 +127,7 @@ class BrainToTextDataset(Dataset):
g = f[f'trial_{t:04d}']
# Remove features is neccessary
input_features = torch.from_numpy(g['input_features'][:]) # neural data
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility
if self.feature_subset:
input_features = input_features[:,self.feature_subset]