tpu
This commit is contained in:
@@ -95,11 +95,35 @@ TypeError: 'NoneType' object is not iterable
|
|||||||
- But Accelerate expects a proper batch_sampler when iterating
|
- But Accelerate expects a proper batch_sampler when iterating
|
||||||
- This is a fundamental incompatibility between our batching approach and Accelerate's expectations
|
- This is a fundamental incompatibility between our batching approach and Accelerate's expectations
|
||||||
|
|
||||||
## FINAL SOLUTION ✅
|
## COMPREHENSIVE SOLUTION ✅
|
||||||
|
|
||||||
### Problem Resolution
|
### Problem Resolution Status
|
||||||
1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
|
1. ~~even_batches Error~~ ✅ RESOLVED with DataLoaderConfiguration
|
||||||
2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
|
2. ~~batch_sampler None Error~~ ✅ RESOLVED with custom collate_fn
|
||||||
|
3. ~~Data Type Mismatch Error~~ ✅ RESOLVED with bf16 conversion in dataset
|
||||||
|
|
||||||
|
### Latest Error (2025-10-12 13:38)
|
||||||
|
```
|
||||||
|
INVALID_ARGUMENT: Call parameter must match argument; got parameter 0 shape: f32[64,7168], argument shape: bf16[64,7168].
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root Cause**: Mixed precision training with `mixed_precision='bf16'` expects all tensors to be `bf16`, but our data is being loaded as `f32` (float32).
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- We enabled `bf16` mixed precision in Accelerator configuration
|
||||||
|
- Model parameters are automatically converted to `bf16`
|
||||||
|
- But input data remains as `f32`, causing type mismatch during forward pass
|
||||||
|
- TPU XLA compiler is strict about type matching
|
||||||
|
|
||||||
|
### Solution: Data Type Conversion in Dataset
|
||||||
|
Fixed in `dataset.py:130` by converting neural data to `bf16`:
|
||||||
|
```python
|
||||||
|
# Before (causes type mismatch):
|
||||||
|
input_features = torch.from_numpy(g['input_features'][:]) # defaults to f32
|
||||||
|
|
||||||
|
# After (TPU compatible):
|
||||||
|
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # convert to bf16 for TPU compatibility
|
||||||
|
```
|
||||||
|
|
||||||
### Final Implementation
|
### Final Implementation
|
||||||
```python
|
```python
|
||||||
@@ -128,11 +152,24 @@ self.train_loader = DataLoader(
|
|||||||
|
|
||||||
**Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.
|
**Key Insight**: Our dataset's `__getitem__()` returns complete batches, but Accelerate expects individual samples. The solution is to use `batch_size=1` and a custom `collate_fn` that unwraps the pre-batched data.
|
||||||
|
|
||||||
## Next Steps
|
## Complete Solution Summary
|
||||||
|
|
||||||
|
### Three-Step Fix for TPU Training
|
||||||
|
1. **DataLoaderConfiguration**: Added `even_batches=False` for batch_size=1 DataLoaders
|
||||||
|
2. **Custom collate_fn**: Handles pre-batched data from our dataset
|
||||||
|
3. **Data Type Conversion**: Convert input data to `bf16` for mixed precision compatibility
|
||||||
|
|
||||||
|
### Files Modified
|
||||||
|
- [rnn_trainer.py:44-46](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L44-L46): Added DataLoaderConfiguration
|
||||||
|
- [rnn_trainer.py:193-210](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\rnn_trainer.py#L193-L210): Custom collate_fn and batch_size=1
|
||||||
|
- [dataset.py:130](f:\BRAIN-TO-TEXT\nejm-brain-to-text.worktrees\dev2\model_training_nnn\dataset.py#L130): Convert neural data to bf16
|
||||||
|
|
||||||
|
### Next Steps
|
||||||
1. ~~Implement even_batches=False~~ ✅ DONE
|
1. ~~Implement even_batches=False~~ ✅ DONE
|
||||||
2. ~~Fix batch_sampler None issue~~ ✅ DONE
|
2. ~~Fix batch_sampler None issue~~ ✅ DONE
|
||||||
3. Test TPU training with complete solution
|
3. ~~Fix data type mismatch~~ ✅ DONE
|
||||||
4. Integrate final solution into CLAUDE.md
|
4. Test TPU training with complete solution
|
||||||
|
5. Integrate final solution into CLAUDE.md
|
||||||
|
|
||||||
## Lessons Learned
|
## Lessons Learned
|
||||||
- Don't overcomplicate TPU conversion - it should be straightforward
|
- Don't overcomplicate TPU conversion - it should be straightforward
|
||||||
|
@@ -127,7 +127,7 @@ class BrainToTextDataset(Dataset):
|
|||||||
g = f[f'trial_{t:04d}']
|
g = f[f'trial_{t:04d}']
|
||||||
|
|
||||||
# Remove features is neccessary
|
# Remove features is neccessary
|
||||||
input_features = torch.from_numpy(g['input_features'][:]) # neural data
|
input_features = torch.from_numpy(g['input_features'][:]).to(torch.bfloat16) # neural data - convert to bf16 for TPU compatibility
|
||||||
if self.feature_subset:
|
if self.feature_subset:
|
||||||
input_features = input_features[:,self.feature_subset]
|
input_features = input_features[:,self.feature_subset]
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user