tpu

2025-10-12 22:14:17 +08:00
parent 0cbb83e052
commit cf1d2b0801
2 changed files with 28 additions and 4 deletions
--- a/TPU_ISSUES_RECORD.md
+++ b/TPU_ISSUES_RECORD.md
@@ -226,6 +226,31 @@ All TPU training issues have been systematically identified and fixed:

 **Ready for TPU training test** with 687M parameter brain-to-text model.

+---
+
+## New Issue: TPU Memory Exhaustion (2025-10-12 15:00)
+```
+RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 3.50M. That was not possible. There are 2.07M free.; (0x0x0_HBM0)
+```
+
+**Root Cause**: TPU HBM memory fragmentation with batch_size=64
+- Single batch: 64 × (512 features × 14 patches) × 2 bytes = ~917KB per batch
+- Combined with 687M model parameters + gradients + activations → memory exhaustion
+- TPU memory allocation is stricter than GPU, requires contiguous blocks
+
+**Solution**: Memory-optimized configuration
+```yaml
+# rnn_args.yaml optimizations:
+batch_size: 32  # reduced from 64
+gradient_accumulation_steps: 2  # maintains effective batch size of 64
+num_dataloader_workers: 0  # TPU compatibility
+```
+
+**Memory Calculation**:
+- New batch memory: 32 × 7168 × 2 bytes = ~458KB (50% reduction)
+- Gradient accumulation maintains training stability
+- Effective batch size unchanged: 2 steps × 32 = 64 samples
+
 ## Lessons Learned
 - **Root Cause**: TPU XLA compiler requires strict dtype consistency across all tensors
 - **Key Insight**: `torch.eye()` and `torch.zeros()` default to f32 - must explicitly specify dtype