tpu
This commit is contained in:
@@ -226,6 +226,31 @@ All TPU training issues have been systematically identified and fixed:
|
||||
|
||||
**Ready for TPU training test** with 687M parameter brain-to-text model.
|
||||
|
||||
---
|
||||
|
||||
## New Issue: TPU Memory Exhaustion (2025-10-12 15:00)
|
||||
```
|
||||
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 3.50M. That was not possible. There are 2.07M free.; (0x0x0_HBM0)
|
||||
```
|
||||
|
||||
**Root Cause**: TPU HBM memory fragmentation with batch_size=64
|
||||
- Single batch: 64 × (512 features × 14 patches) × 2 bytes = ~917KB per batch
|
||||
- Combined with 687M model parameters + gradients + activations → memory exhaustion
|
||||
- TPU memory allocation is stricter than GPU, requires contiguous blocks
|
||||
|
||||
**Solution**: Memory-optimized configuration
|
||||
```yaml
|
||||
# rnn_args.yaml optimizations:
|
||||
batch_size: 32 # reduced from 64
|
||||
gradient_accumulation_steps: 2 # maintains effective batch size of 64
|
||||
num_dataloader_workers: 0 # TPU compatibility
|
||||
```
|
||||
|
||||
**Memory Calculation**:
|
||||
- New batch memory: 32 × 7168 × 2 bytes = ~458KB (50% reduction)
|
||||
- Gradient accumulation maintains training stability
|
||||
- Effective batch size unchanged: 2 steps × 32 = 64 samples
|
||||
|
||||
## Lessons Learned
|
||||
- **Root Cause**: TPU XLA compiler requires strict dtype consistency across all tensors
|
||||
- **Key Insight**: `torch.eye()` and `torch.zeros()` default to f32 - must explicitly specify dtype
|
||||
|
Reference in New Issue
Block a user