b2txt25/TPU_ISSUES_RECORD.md

# TPU Training Issues Record

## Core Problem
**Primary Error**: `ValueError: You need to use 'even_batches=False' when the batch sampler has no batch size`

This error occurs when using TPU with Hugging Face Accelerate framework and custom DataLoaders that have `batch_size=None`.

## Root Cause Analysis
1. Our custom dataset returns full batches (not individual samples)
2. DataLoader is created with `batch_size=None` because batching is handled by the dataset
3. TPU training with Accelerate requires `even_batches=False` for this configuration
4. The `even_batches` parameter needs to be set in the DataLoader preparation, not Accelerator initialization

## Failed Solution Attempts

### Attempt 1: Adding even_batches to Accelerator.__init__()
```python
self.accelerator = Accelerator(
    mixed_precision='bf16',
    gradient_accumulation_steps=1,
    even_batches=False  # ❌ WRONG - This parameter doesn't exist in Accelerator.__init__()
)
```
**Error**: `TypeError: Accelerator.__init__() got an unexpected keyword argument 'even_batches'`

### Attempt 2: Complex TPU-specific DataLoader handling
- Created conditional TPU/GPU logic
- Manual data movement with `to(device)`
- Custom collate_fn modifications
- Result: Overengineered solution that didn't address root cause

### Attempt 3: Memory optimization
- Reduced TPU cores from 8 to 2
- Reduced batch size
- Misunderstood TPU memory allocation (fewer cores = less total memory, not more per core)

### Attempt 4: Removing all TPU-specific logic
- Let Accelerator handle everything automatically
- Result: Same even_batches error returned

## Correct Solution
The `even_batches=False` parameter should be passed using `DataLoaderConfiguration` when initializing the Accelerator:

```python
from accelerate import Accelerator, DataLoaderConfiguration

# Configure DataLoader behavior for TPU
dataloader_config = DataLoaderConfiguration(
    even_batches=False  # Required for batch_size=None DataLoaders
)

self.accelerator = Accelerator(
    mixed_precision='bf16' if args.get('use_amp', True) else 'no',
    gradient_accumulation_steps=args.get('gradient_accumulation_steps', 1),
    log_with=None,
    project_dir=args.get('output_dir', './output'),
    dataloader_config=dataloader_config  # ✅ CORRECT - Pass DataLoaderConfiguration
)
```

## Technical Context
- **Model**: Brain-to-text RNN with 687M parameters
- **Dataset**: Custom dataset that returns full batches (batch_size=None in DataLoader)
- **TPU Config**: 8 cores × 16GB = 128GB total memory
- **Batch Size**: 64
- **Framework**: PyTorch XLA with Hugging Face Accelerate

## Key Files Modified
- `model_training_nnn/rnn_trainer.py` - Main trainer class
- `model_training_nnn/rnn_args.yaml` - Configuration file
- `model_training_nnn/dataset.py` - Custom dataset class

## Memory Allocation Facts
- TPU v5e-8: 8 cores × 16GB = 128GB total
- Fewer cores = LESS total memory (not more per core)

## Next Steps
1. Implement correct even_batches=False in accelerator.prepare()
2. Test TPU training without overengineering
3. Verify memory usage with 8 cores configuration

## Lessons Learned
- Don't overcomplicate TPU conversion - it should be straightforward
- Read Accelerate documentation carefully for parameter placement
- Document issues immediately to avoid confusion
- TPU memory allocation: fewer cores = less total memory