Emu3.5 Hangs On RTX 5090: A Linux/WSL Bug
Is anyone else encountering issues with Emu3.5 on the RTX 5090 when running Linux in WSL? Let's dive into this bug and figure out a solution together!
1. Executive Summary
After some serious digging, a critical bug has been found. The Emu3.5 (34B) model successfully loads on an RTX 5090 (Blackwell, sm_120) in a fully configured WSL/Linux setup. But, the generation process grinds to a halt at 0%. It just hangs there, leaving you wondering what went wrong.
- âś… What's Working: The GPU is recognized, CUDA 12.9 is installed, the PyTorch
cu128nightly build is up and running,flash_attnis compiled forsm_120, and the model loads perfectly, whether it's in 4-bit or 16-bit. - ❌ What's Not Working: The
model.generate()call is a no-go. It gets stuck at 0%, the GPU utilization is low (around 9%), and there are no error messages to help you troubleshoot. - Key Finding: The issue isn't related to
bitsandbytes(since the 16-bitbfloat16fails in the same way). Instead, it seems like there's a core incompatibility in the generation code when running on thesm_120architecture within Linux/WSL. It's like trying to fit a square peg in a round hole! - Workaround: Here's a glimmer of hope: the same model does work on the same machine in a native Windows environment using "eager" attention. So, if you're desperate to get things running, that might be a temporary fix.
2. Environment Details
This environment was built specifically to provide full sm_120 (Blackwell) support. It's like a finely tuned race car, but something's still off.
- Hardware: AMD Threadripper PRO 5975WX | 64GB RAM | NVIDIA GeForce RTX 5090 32GB
- OS: Windows 11 Pro w/ WSL 2 (Ubuntu 24.04.3 LTS)
- Driver: 571.76
- CUDA Toolkit (in WSL): 12.9.1 (
nvccV12.9.1) - Environment: Conda (Python 3.12.9)
Key Dependencies (in Conda Env):
torch==2.7.0.dev20250310+cu128
torchvision==0.22.0.dev20250310+cu128
transformers==4.48.2
flash_attn==2.8.3 (compiled with TORCH_CUDA_ARCH_LIST="12.0")
bitsandbytes==0.45.3
accelerate==1.3.0
3. Steps to Reproduce
Want to try and reproduce this bug yourself? Here's how you can do it:
-
Install CUDA 12.9 in WSL:
sudo apt-get install cuda-toolkit-12-9 -
Create Conda Environment & Set Variables:
conda create -n emu_rtx5090 python=3.12 -y conda activate emu_rtx5090 export CUDA_HOME=/usr/local/cuda-12.9 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH -
Install PyTorch
cu128Nightly:pip cache purge pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128 -
Install Dependencies & Compile
flash_attnforsm_120:pip install transformers==4.48.2 bitsandbytes accelerate omegaconf protobuf tiktoken export TORCH_CUDA_ARCH_LIST="12.0" pip install flash_attn==2.8.3 --no-build-isolation --no-cache-dir -
Clone Model & Run Inference:
- Clone
BAAI/Emu3.5-ImageandBAAI/Emu3.5-VisionTokenizer. - Use a config file (
simple_test_config.py) to runinference.pywith 4-bit quantization andflash_attention_2.
- Clone
4. Expected Behavior
Ideally, the script should load the model without a hitch, and the Generating: progress bar should smoothly move from 0% to 100%, ultimately saving a beautiful, generated image. That's the dream, right?
5. Actual Behavior
Here's the frustrating reality:
-
The model loads perfectly: all 15 shards are loaded in about 30 seconds. So far, so good!
-
VRAM usage is correct: ~29GB / 32GB. The model is definitely using the GPU.
-
The
transformers==4.48.2downgrade correctly fixes theAttributeError: ... no attribute 'generate'. One problem solved, another emerges. -
The script hangs indefinitely at the generation step:
Generating: 0%| | 0/1 [00:00<?, ?it/s] -
nvidia-smishows low GPU utilization (9-10%) and no errors are thrown. It's like the script is stuck in a never-ending loop with no way out. So Annoying!
6. Diagnostic Tests Performed
To make sure this wasn't just a simple setup error, multiple tests were conducted. Think of it as a scientific investigation to uncover the root cause of the problem.
Test 1: 4-bit Quantization + flash_attn (as above)
- Result: FAILED (Hangs at 0%)
Test 2: 16-bit (bfloat16) + flash_attn + CPU Offload
- Goal: Rule out
bitsandbytesas the culprit. Is it the quantization that's causing the issue? - Result: FAILED (Hangs at 0% in the exact same way as the 4-bit test). This is a crucial clue!
- Conclusion:
bitsandbytesis NOT the root cause. We can cross that off the list.
Test 3: 4-bit Quantization + eager Attention
- Goal: Rule out
flash_attnas the problem. Is it something specific to the attention mechanism? - Result: Model loading became 25x slower (about 50 seconds per shard versus about 2 seconds/shard with
flash_attn). Ouch! - Conclusion: The test was interrupted because it was impractical to wait for the entire model to load. However, this proves that
flash_attnis working correctly for model loading. The generation bug likely exists in bothflash_attnandeagercode paths. This is getting more complex!
7. Test Results Summary
| Test Configuration | Model Load | Generation | GPU Util | VRAM | Status |
|---|---|---|---|---|---|
| 4-bit NF4 + flash_attn | ✅ 30s (15 shards) | ❌ Stuck 0% | 9% | 29GB | FAILED |
| 16-bit + flash_attn + CPU offload | ✅ 43 min (15 shards) | ❌ Stuck 0% | Low | Variable | FAILED |
| Eager attention | ⚠️ 100s (2/15 shards) | ⏸️ Interrupted | 1% | 28GB | INCOMPLETE |
8. Conclusion
The Emu3.5 (34B) generation code appears to have a core incompatibility with the NVIDIA Blackwell (sm_120) architecture when running on Linux/WSL. Since both quantized and non-quantized inference fail identically, the bug lies deeper than the quantization kernels. It's a fundamental issue that needs to be addressed.
Given that a native Windows environment on the same machine can produce images (using eager attention), this points to a specific issue with the Emu3.5 generation loop on Linux with sm_120. It's like the code is speaking a different language on Linux.
The BAAI team is requested to investigate compatibility with the RTX 50-series (sm_120) on Linux. Your expertise is needed to solve this puzzle!
9. Workaround
For users with RTX 5090 hardware, here's a potential workaround:
- Windows + Eager Attention: Works successfully (confirmed with transformers 4.48.2, PyTorch cu124). It's not ideal, but it's a start.
- Performance: 512x512 images in 45-60 seconds. Patience is key!
10. Additional Information
- A full diagnostic report is available upon request. Just ask, and it will be provided.
- Willing to provide additional logs or test specific patches. Help to find a solution.
- The system has been verified working with other CUDA workloads. It's not a hardware problem.
System verification:
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
# Output:
# PyTorch: 2.7.0.dev20250310+cu128
# CUDA: True
# GPU: NVIDIA GeForce RTX 5090