This commit is contained in:
Zchen
2025-10-12 22:14:17 +08:00
parent 0cbb83e052
commit cf1d2b0801
2 changed files with 28 additions and 4 deletions

View File

@@ -226,6 +226,31 @@ All TPU training issues have been systematically identified and fixed:
**Ready for TPU training test** with 687M parameter brain-to-text model.
---
## New Issue: TPU Memory Exhaustion (2025-10-12 15:00)
```
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 3.50M. That was not possible. There are 2.07M free.; (0x0x0_HBM0)
```
**Root Cause**: TPU HBM memory fragmentation with batch_size=64
- Single batch: 64 × (512 features × 14 patches) × 2 bytes = ~917KB per batch
- Combined with 687M model parameters + gradients + activations → memory exhaustion
- TPU memory allocation is stricter than GPU, requires contiguous blocks
**Solution**: Memory-optimized configuration
```yaml
# rnn_args.yaml optimizations:
batch_size: 32 # reduced from 64
gradient_accumulation_steps: 2 # maintains effective batch size of 64
num_dataloader_workers: 0 # TPU compatibility
```
**Memory Calculation**:
- New batch memory: 32 × 7168 × 2 bytes = ~458KB (50% reduction)
- Gradient accumulation maintains training stability
- Effective batch size unchanged: 2 steps × 32 = 64 samples
## Lessons Learned
- **Root Cause**: TPU XLA compiler requires strict dtype consistency across all tensors
- **Key Insight**: `torch.eye()` and `torch.zeros()` default to f32 - must explicitly specify dtype