# TPU Training Issues Record ## Core Problem **Primary Error**: `ValueError: You need to use 'even_batches=False' when the batch sampler has no batch size` This error occurs when using TPU with Hugging Face Accelerate framework and custom DataLoaders that have `batch_size=None`. ## Root Cause Analysis 1. Our custom dataset returns full batches (not individual samples) 2. DataLoader is created with `batch_size=None` because batching is handled by the dataset 3. TPU training with Accelerate requires `even_batches=False` for this configuration 4. The `even_batches` parameter needs to be set in the DataLoader preparation, not Accelerator initialization ## Failed Solution Attempts ### Attempt 1: Adding even_batches to Accelerator.__init__() ```python self.accelerator = Accelerator( mixed_precision='bf16', gradient_accumulation_steps=1, even_batches=False # ❌ WRONG - This parameter doesn't exist in Accelerator.__init__() ) ``` **Error**: `TypeError: Accelerator.__init__() got an unexpected keyword argument 'even_batches'` ### Attempt 2: Complex TPU-specific DataLoader handling - Created conditional TPU/GPU logic - Manual data movement with `to(device)` - Custom collate_fn modifications - Result: Overengineered solution that didn't address root cause ### Attempt 3: Memory optimization - Reduced TPU cores from 8 to 2 - Reduced batch size - Misunderstood TPU memory allocation (fewer cores = less total memory, not more per core) ### Attempt 4: Removing all TPU-specific logic - Let Accelerator handle everything automatically - Result: Same even_batches error returned ## Correct Solution The `even_batches=False` parameter should be passed using `DataLoaderConfiguration` when initializing the Accelerator: ```python from accelerate import Accelerator, DataLoaderConfiguration # Configure DataLoader behavior for TPU dataloader_config = DataLoaderConfiguration( even_batches=False # Required for batch_size=None DataLoaders ) self.accelerator = Accelerator( mixed_precision='bf16' if args.get('use_amp', True) else 'no', gradient_accumulation_steps=args.get('gradient_accumulation_steps', 1), log_with=None, project_dir=args.get('output_dir', './output'), dataloader_config=dataloader_config # ✅ CORRECT - Pass DataLoaderConfiguration ) ``` ## Technical Context - **Model**: Brain-to-text RNN with 687M parameters - **Dataset**: Custom dataset that returns full batches (batch_size=None in DataLoader) - **TPU Config**: 8 cores × 16GB = 128GB total memory - **Batch Size**: 64 - **Framework**: PyTorch XLA with Hugging Face Accelerate ## Key Files Modified - `model_training_nnn/rnn_trainer.py` - Main trainer class - `model_training_nnn/rnn_args.yaml` - Configuration file - `model_training_nnn/dataset.py` - Custom dataset class ## Memory Allocation Facts - TPU v5e-8: 8 cores × 16GB = 128GB total - Fewer cores = LESS total memory (not more per core) ## Next Steps 1. Implement correct even_batches=False in accelerator.prepare() 2. Test TPU training without overengineering 3. Verify memory usage with 8 cores configuration ## Lessons Learned - Don't overcomplicate TPU conversion - it should be straightforward - Read Accelerate documentation carefully for parameter placement - Document issues immediately to avoid confusion - TPU memory allocation: fewer cores = less total memory