Files
b2txt25/TPU_ISSUES_RECORD.md
Zchen 580648c058 tpu
2025-10-12 21:31:07 +08:00

86 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TPU Training Issues Record
## Core Problem
**Primary Error**: `ValueError: You need to use 'even_batches=False' when the batch sampler has no batch size`
This error occurs when using TPU with Hugging Face Accelerate framework and custom DataLoaders that have `batch_size=None`.
## Root Cause Analysis
1. Our custom dataset returns full batches (not individual samples)
2. DataLoader is created with `batch_size=None` because batching is handled by the dataset
3. TPU training with Accelerate requires `even_batches=False` for this configuration
4. The `even_batches` parameter needs to be set in the DataLoader preparation, not Accelerator initialization
## Failed Solution Attempts
### Attempt 1: Adding even_batches to Accelerator.__init__()
```python
self.accelerator = Accelerator(
mixed_precision='bf16',
gradient_accumulation_steps=1,
even_batches=False # ❌ WRONG - This parameter doesn't exist in Accelerator.__init__()
)
```
**Error**: `TypeError: Accelerator.__init__() got an unexpected keyword argument 'even_batches'`
### Attempt 2: Complex TPU-specific DataLoader handling
- Created conditional TPU/GPU logic
- Manual data movement with `to(device)`
- Custom collate_fn modifications
- Result: Overengineered solution that didn't address root cause
### Attempt 3: Memory optimization
- Reduced TPU cores from 8 to 2
- Reduced batch size
- Misunderstood TPU memory allocation (fewer cores = less total memory, not more per core)
### Attempt 4: Removing all TPU-specific logic
- Let Accelerator handle everything automatically
- Result: Same even_batches error returned
## Correct Solution
The `even_batches=False` parameter should be passed using `DataLoaderConfiguration` when initializing the Accelerator:
```python
from accelerate import Accelerator, DataLoaderConfiguration
# Configure DataLoader behavior for TPU
dataloader_config = DataLoaderConfiguration(
even_batches=False # Required for batch_size=None DataLoaders
)
self.accelerator = Accelerator(
mixed_precision='bf16' if args.get('use_amp', True) else 'no',
gradient_accumulation_steps=args.get('gradient_accumulation_steps', 1),
log_with=None,
project_dir=args.get('output_dir', './output'),
dataloader_config=dataloader_config # ✅ CORRECT - Pass DataLoaderConfiguration
)
```
## Technical Context
- **Model**: Brain-to-text RNN with 687M parameters
- **Dataset**: Custom dataset that returns full batches (batch_size=None in DataLoader)
- **TPU Config**: 8 cores × 16GB = 128GB total memory
- **Batch Size**: 64
- **Framework**: PyTorch XLA with Hugging Face Accelerate
## Key Files Modified
- `model_training_nnn/rnn_trainer.py` - Main trainer class
- `model_training_nnn/rnn_args.yaml` - Configuration file
- `model_training_nnn/dataset.py` - Custom dataset class
## Memory Allocation Facts
- TPU v5e-8: 8 cores × 16GB = 128GB total
- Fewer cores = LESS total memory (not more per core)
## Next Steps
1. Implement correct even_batches=False in accelerator.prepare()
2. Test TPU training without overengineering
3. Verify memory usage with 8 cores configuration
## Lessons Learned
- Don't overcomplicate TPU conversion - it should be straightforward
- Read Accelerate documentation carefully for parameter placement
- Document issues immediately to avoid confusion
- TPU memory allocation: fewer cores = less total memory