tpu
This commit is contained in:
@@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable
|
||||
- But Accelerate expects a proper batch_sampler when iterating
|
||||
- This is a fundamental incompatibility between our batching approach and Accelerate's expectations
|
||||
|
||||
## FINAL SOLUTION ✅
|
||||
## COMPREHENSIVE SOLUTION ✅
|
||||
|
||||
### Problem Resolution
|
||||
### Problem Resolution Status
|
||||
1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
|
||||
2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
|
||||
3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset
|
||||
|
||||
### Latest Error (2025-10-12 13:38)
|
||||
```
|
||||
INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168].
|
||||
```
|
||||
|
||||
**Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32).
|
||||
|
||||
**Analysis**:
|
||||
- We enabled `bf16` mixed precision in Accelerator configuration
|
||||
- Model parameters are automatically converted to `bf16`
|
||||
- But input data remains as `f32`, causing type mismatch during forward pass
|
||||
- TPU XLA compiler is strict about type matching
|
||||
|
||||
### Solution: Data Type Conversion in Dataset
|
||||
Fixed in `dataset.py:130` by converting neural data to `bf16`:
|
||||
```python
|
||||
# Before (causes type mismatch):
|
||||
input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32
|
||||
|
||||
# After (TPU compatible):
|
||||
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility
|
||||
```
|
||||
|
||||
### Final Implementation
|
||||
```python
|
||||
@@ -128,11 +152,24 @@ self.train_loader = DataLoader(
|
||||
|
||||
**Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.
|
||||
|
||||
## Next Steps
|
||||
## Complete Solution Summary
|
||||
|
||||
### Three-Step Fix for TPU Training
|
||||
1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders
|
||||
2. **Custom collate_fn**: Handles pre-batched data from our dataset
|
||||
3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility
|
||||
|
||||
### Files Modified
|
||||
- [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration
|
||||
- [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1
|
||||
- [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16
|
||||
|
||||
### Next Steps
|
||||
1. ~~Implement even_batches=False~~ ✅ DONE
|
||||
2. ~~Fix batch_sampler None issue~~ ✅ DONE
|
||||
3. Test TPU training with complete solution
|
||||
4. Integrate final solution into CLAUDE.md
|
||||
3. ~~Fix data type mismatch~~ ✅ DONE
|
||||
4. Test TPU training with complete solution
|
||||
5. Integrate final solution into CLAUDE.md
|
||||
|
||||
## Lessons Learned
|
||||
- Don't overcomplicate TPU conversion - it should be straightforward
|
||||
|
@@ -126,8 +126,8 @@ class BrainToTextDataset(Dataset):
|
||||
try:
|
||||
g = f[f'trial_{t:04d}']
|
||||
|
||||
# Remove features is neccessary
|
||||
input_features = torch.from_numpy(g['input_features'][:]) # neural data
|
||||
# Remove features is neccessary
|
||||
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility
|
||||
if self.feature_subset:
|
||||
input_features = input_features[:,self.feature_subset]
|
||||
|
||||
|
Reference in New Issue
Block a user