Emu3.5 Hangs On RTX 5090: A Linux/WSL Bug

by Admin 42 views
Emu3.5 (34B) Generation Hangs at 0% on RTX 5090 (sm_120) on Linux/WSL

Is anyone else encountering issues with Emu3.5 on the RTX 5090 when running Linux in WSL? Let's dive into this bug and figure out a solution together!

1. Executive Summary

After some serious digging, a critical bug has been found. The Emu3.5 (34B) model successfully loads on an RTX 5090 (Blackwell, sm_120) in a fully configured WSL/Linux setup. But, the generation process grinds to a halt at 0%. It just hangs there, leaving you wondering what went wrong.

  • âś… What's Working: The GPU is recognized, CUDA 12.9 is installed, the PyTorch cu128 nightly build is up and running, flash_attn is compiled for sm_120, and the model loads perfectly, whether it's in 4-bit or 16-bit.
  • ❌ What's Not Working: The model.generate() call is a no-go. It gets stuck at 0%, the GPU utilization is low (around 9%), and there are no error messages to help you troubleshoot.
  • Key Finding: The issue isn't related to bitsandbytes (since the 16-bit bfloat16 fails in the same way). Instead, it seems like there's a core incompatibility in the generation code when running on the sm_120 architecture within Linux/WSL. It's like trying to fit a square peg in a round hole!
  • Workaround: Here's a glimmer of hope: the same model does work on the same machine in a native Windows environment using "eager" attention. So, if you're desperate to get things running, that might be a temporary fix.

2. Environment Details

This environment was built specifically to provide full sm_120 (Blackwell) support. It's like a finely tuned race car, but something's still off.

  • Hardware: AMD Threadripper PRO 5975WX | 64GB RAM | NVIDIA GeForce RTX 5090 32GB
  • OS: Windows 11 Pro w/ WSL 2 (Ubuntu 24.04.3 LTS)
  • Driver: 571.76
  • CUDA Toolkit (in WSL): 12.9.1 (nvcc V12.9.1)
  • Environment: Conda (Python 3.12.9)

Key Dependencies (in Conda Env):

torch==2.7.0.dev20250310+cu128
torchvision==0.22.0.dev20250310+cu128
transformers==4.48.2
flash_attn==2.8.3 (compiled with TORCH_CUDA_ARCH_LIST="12.0")
bitsandbytes==0.45.3
accelerate==1.3.0

3. Steps to Reproduce

Want to try and reproduce this bug yourself? Here's how you can do it:

  1. Install CUDA 12.9 in WSL:

    sudo apt-get install cuda-toolkit-12-9
    
  2. Create Conda Environment & Set Variables:

    conda create -n emu_rtx5090 python=3.12 -y
    conda activate emu_rtx5090
    export CUDA_HOME=/usr/local/cuda-12.9
    export PATH=$CUDA_HOME/bin:$PATH
    export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
    
  3. Install PyTorch cu128 Nightly:

    pip cache purge
    pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
    
  4. Install Dependencies & Compile flash_attn for sm_120:

    pip install transformers==4.48.2 bitsandbytes accelerate omegaconf protobuf tiktoken
    export TORCH_CUDA_ARCH_LIST="12.0"
    pip install flash_attn==2.8.3 --no-build-isolation --no-cache-dir
    
  5. Clone Model & Run Inference:

    • Clone BAAI/Emu3.5-Image and BAAI/Emu3.5-VisionTokenizer.
    • Use a config file (simple_test_config.py) to run inference.py with 4-bit quantization and flash_attention_2.

4. Expected Behavior

Ideally, the script should load the model without a hitch, and the Generating: progress bar should smoothly move from 0% to 100%, ultimately saving a beautiful, generated image. That's the dream, right?

5. Actual Behavior

Here's the frustrating reality:

  • The model loads perfectly: all 15 shards are loaded in about 30 seconds. So far, so good!

  • VRAM usage is correct: ~29GB / 32GB. The model is definitely using the GPU.

  • The transformers==4.48.2 downgrade correctly fixes the AttributeError: ... no attribute 'generate'. One problem solved, another emerges.

  • The script hangs indefinitely at the generation step:

    Generating:   0%|          | 0/1 [00:00<?, ?it/s]
    
  • nvidia-smi shows low GPU utilization (9-10%) and no errors are thrown. It's like the script is stuck in a never-ending loop with no way out. So Annoying!


6. Diagnostic Tests Performed

To make sure this wasn't just a simple setup error, multiple tests were conducted. Think of it as a scientific investigation to uncover the root cause of the problem.

Test 1: 4-bit Quantization + flash_attn (as above)

  • Result: FAILED (Hangs at 0%)

Test 2: 16-bit (bfloat16) + flash_attn + CPU Offload

  • Goal: Rule out bitsandbytes as the culprit. Is it the quantization that's causing the issue?
  • Result: FAILED (Hangs at 0% in the exact same way as the 4-bit test). This is a crucial clue!
  • Conclusion: bitsandbytes is NOT the root cause. We can cross that off the list.

Test 3: 4-bit Quantization + eager Attention

  • Goal: Rule out flash_attn as the problem. Is it something specific to the attention mechanism?
  • Result: Model loading became 25x slower (about 50 seconds per shard versus about 2 seconds/shard with flash_attn). Ouch!
  • Conclusion: The test was interrupted because it was impractical to wait for the entire model to load. However, this proves that flash_attn is working correctly for model loading. The generation bug likely exists in both flash_attn and eager code paths. This is getting more complex!

7. Test Results Summary

Test Configuration Model Load Generation GPU Util VRAM Status
4-bit NF4 + flash_attn ✅ 30s (15 shards) ❌ Stuck 0% 9% 29GB FAILED
16-bit + flash_attn + CPU offload ✅ 43 min (15 shards) ❌ Stuck 0% Low Variable FAILED
Eager attention ⚠️ 100s (2/15 shards) ⏸️ Interrupted 1% 28GB INCOMPLETE

8. Conclusion

The Emu3.5 (34B) generation code appears to have a core incompatibility with the NVIDIA Blackwell (sm_120) architecture when running on Linux/WSL. Since both quantized and non-quantized inference fail identically, the bug lies deeper than the quantization kernels. It's a fundamental issue that needs to be addressed.

Given that a native Windows environment on the same machine can produce images (using eager attention), this points to a specific issue with the Emu3.5 generation loop on Linux with sm_120. It's like the code is speaking a different language on Linux.

The BAAI team is requested to investigate compatibility with the RTX 50-series (sm_120) on Linux. Your expertise is needed to solve this puzzle!


9. Workaround

For users with RTX 5090 hardware, here's a potential workaround:

  • Windows + Eager Attention: Works successfully (confirmed with transformers 4.48.2, PyTorch cu124). It's not ideal, but it's a start.
  • Performance: 512x512 images in 45-60 seconds. Patience is key!

10. Additional Information

  • A full diagnostic report is available upon request. Just ask, and it will be provided.
  • Willing to provide additional logs or test specific patches. Help to find a solution.
  • The system has been verified working with other CUDA workloads. It's not a hardware problem.

System verification:

import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
# Output:
# PyTorch: 2.7.0.dev20250310+cu128
# CUDA: True
# GPU: NVIDIA GeForce RTX 5090