fixed : tf call cuda

2025-10-15 23:37:24 +08:00
parent 01024678c1
commit f9d3f47d20
2 changed files with 149 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -449,5 +449,145 @@ The deprecated APIs still work but generate warnings. For production code:
 - Test thoroughly as synchronization behavior may differ slightly
 - Legacy code will continue to function until removed in future versions

+## TensorFlow TPU Implementation
+
+The original PyTorch implementation has been converted to TensorFlow for optimal performance on TPU v5e-8 environments, particularly for the Brain-to-Text '25 Competition on Kaggle.
+
+### Key TensorFlow Components (`model_training_nnn_tpu/`)
+
+#### Core Files
+- **`rnn_model_tf.py`**: TensorFlow implementation of TripleGRUDecoder architecture
+  - `NoiseModel`: 2-layer GRU for noise estimation with day-specific layers
+  - `CleanSpeechModel`: 3-layer GRU for clean speech recognition with day-specific layers
+  - `NoisySpeechModel`: 2-layer GRU for noisy speech recognition (no day layers)
+  - `TripleGRUDecoder`: Main adversarial architecture combining all three models
+  - `CTCLoss`: Custom CTC loss implementation for TPU compatibility
+  - `create_tpu_strategy()`: Enhanced TPU connection function with robust environment detection
+
+- **`trainer_tf.py`**: TensorFlow training pipeline with distributed TPU support
+- **`dataset_tf.py`**: TensorFlow data loading with augmentation pipeline optimized for TPU
+- **`train_model_tf.py`**: Main training script entry point
+- **`evaluate_model_tf.py`**: Evaluation pipeline for model performance analysis
+
+### TPU v5e-8 Specific Optimizations
+
+#### 1. Enhanced TPU Connection
+The `create_tpu_strategy()` function provides robust TPU detection across different environments:
+
+```python
+def create_tpu_strategy():
+    """Create TPU strategy for distributed training on TPU v5e-8"""
+    # Multi-environment TPU detection
+    if 'COLAB_TPU_ADDR' in os.environ:
+        tpu_address = os.environ['COLAB_TPU_ADDR']
+    elif 'TPU_NAME' in os.environ:
+        tpu_name = os.environ['TPU_NAME']
+    elif 'TPU_WORKER_ID' in os.environ:
+        # Kaggle TPU environment
+        tpu_address = f'grpc://10.0.0.2:8470'  # Default Kaggle TPU address
+
+    # Enhanced error handling and debugging output
+    # Fallback to default strategy if TPU connection fails
+```
+
+**Environment Variables Detected**:
+- `COLAB_TPU_ADDR`: Google Colab TPU environment
+- `TPU_NAME`: Generic TPU name specification
+- `TPU_WORKER_ID`: Kaggle TPU environment indicator
+
+**Troubleshooting TPU Connection Issues**:
+- Error: "Failed to initialize TPU: Please provide a TPU Name to connect to."
+- Solution: The function automatically detects and uses appropriate TPU addresses based on environment
+- Debugging: All TPU-related environment variables are printed during initialization
+
+#### 2. Mixed Precision Training
+Configured for optimal TPU v5e-8 performance:
+```python
+def configure_mixed_precision():
+    """Configure mixed precision for optimal TPU v5e-8 performance"""
+    policy = keras.mixed_precision.Policy('mixed_bfloat16')
+    keras.mixed_precision.set_global_policy(policy)
+```
+
+#### 3. XLA-Optimized Operations
+- **Static Tensor Operations**: Using `tf.stack()` and `tf.gather()` instead of dynamic indexing
+- **Efficient Matrix Operations**: `tf.linalg.matmul()` for batch matrix multiplication
+- **TPU-Friendly GRU Layers**: Disabled recurrent dropout for better TPU performance
+- **Patch Processing**: TensorFlow equivalent of PyTorch's unfold using `tf.image.extract_patches()`
+
+### Key Architecture Differences from PyTorch
+
+#### 1. Gradient Reversal Layer (GRL)
+```python
+@tf.custom_gradient
+def gradient_reverse(x, lambd=1.0):
+    """Gradient Reversal Layer for TensorFlow"""
+    def grad(dy):
+        return -lambd * dy  # Only return gradient w.r.t. x
+    return tf.identity(x), grad
+```
+
+#### 2. CTC Loss Implementation
+Custom sparse tensor conversion for TPU compatibility:
+```python
+def dense_to_sparse(dense_tensor, sequence_lengths):
+    """Convert dense tensor to sparse tensor for CTC"""
+    mask = tf.not_equal(dense_tensor, 0)
+    indices = tf.where(mask)
+    values = tf.gather_nd(dense_tensor, indices)
+    return tf.SparseTensor(indices=indices, values=values, dense_shape=dense_shape)
+```
+
+#### 3. Day-Specific Layers
+Using `add_weight()` for TPU-compatible variable management:
+```python
+for i in range(n_days):
+    weight = self.add_weight(
+        name=f'day_weight_{i}',
+        shape=(neural_dim, neural_dim),
+        initializer=tf.keras.initializers.Identity(),
+        trainable=True
+    )
+```
+
+### Training on TPU v5e-8
+
+#### Basic Training Command
+```python
+# In Kaggle TPU v5e-8 environment
+python train_model_tf.py
+```
+
+#### Expected Output
+```
+🔍 Detecting TPU environment...
+📍 Kaggle TPU detected, worker ID: 0, address: grpc://10.0.0.2:8470
+✅ TPU initialized successfully!
+🎉 Number of TPU cores: 8
+Training on 8 TPU cores  # Should show 8 cores, not 1
+```
+
+### Performance Benefits
+
+1. **Multi-Core Utilization**: Properly configured TPU strategy utilizes all 8 TPU v5e-8 cores
+2. **Mixed Precision**: bfloat16 precision optimized for TPU matrix units
+3. **XLA Compilation**: Static operations enable efficient XLA graph compilation
+4. **Memory Efficiency**: Optimized for TPU memory constraints and batch processing
+
+### Common Issues and Solutions
+
+#### Issue: "Training on 1 TPU cores" instead of 8
+**Cause**: TPU connection fallback to default strategy
+**Solution**: Enhanced `create_tpu_strategy()` function with environment detection
+**Check**: Verify TPU environment variables are properly set
+
+#### Issue: CTC Loss dtype errors
+**Cause**: Mixed precision dtype mismatches
+**Solution**: Explicit dtype casting in `CTCLoss.call()`
+
+#### Issue: Gradient Reversal Layer errors
+**Cause**: Incorrect gradient return format
+**Solution**: Return only gradient w.r.t. input tensor, not lambda parameter
+
 ## Competition Context
-This codebase also serves as baseline for the Brain-to-Text '25 Competition on Kaggle, providing reference implementations for neural signal decoding.
+This codebase serves as baseline for the Brain-to-Text '25 Competition on Kaggle, providing both PyTorch and TensorFlow reference implementations for neural signal decoding with optimizations for TPU v5e-8 training environments.
--- a/model_training_nnn_tpu/rnn_model_tf.py
+++ b/model_training_nnn_tpu/rnn_model_tf.py
@@ -763,6 +763,14 @@ def create_tpu_strategy():

    print("🔍 Detecting TPU environment...")

+    # Disable GPU to avoid CUDA conflicts in TPU environment
+    try:
+        print("🚫 Disabling GPU to prevent CUDA conflicts...")
+        tf.config.set_visible_devices([], 'GPU')
+        print("✅ GPU disabled successfully")
+    except Exception as e:
+        print(f"⚠️  Warning: Could not disable GPU: {e}")
+
    # Check for various TPU environment variables
    tpu_address = None
    tpu_name = None