Emu3.5 Hangs On RTX 5090: A Linux/WSL Bug

Nov 4, 2025 by Admin 42 views

Emu3.5 (34B) Generation Hangs at 0% on RTX 5090 (sm_120) on Linux/WSL

Is anyone else encountering issues with Emu3.5 on the RTX 5090 when running Linux in WSL? Let's dive into this bug and figure out a solution together!

1. Executive Summary

After some serious digging, a critical bug has been found. The Emu3.5 (34B) model successfully loads on an RTX 5090 (Blackwell, sm_120) in a fully configured WSL/Linux setup. But, the generation process grinds to a halt at 0%. It just hangs there, leaving you wondering what went wrong.

✅ What's Working: The GPU is recognized, CUDA 12.9 is installed, the PyTorch cu128 nightly build is up and running, flash_attn is compiled for sm_120, and the model loads perfectly, whether it's in 4-bit or 16-bit.
❌ What's Not Working: The model.generate() call is a no-go. It gets stuck at 0%, the GPU utilization is low (around 9%), and there are no error messages to help you troubleshoot.
Key Finding: The issue isn't related to bitsandbytes (since the 16-bit bfloat16 fails in the same way). Instead, it seems like there's a core incompatibility in the generation code when running on the sm_120 architecture within Linux/WSL. It's like trying to fit a square peg in a round hole!
Workaround: Here's a glimmer of hope: the same model does work on the same machine in a native Windows environment using "eager" attention. So, if you're desperate to get things running, that might be a temporary fix.

2. Environment Details

This environment was built specifically to provide full sm_120 (Blackwell) support. It's like a finely tuned race car, but something's still off.

Hardware: AMD Threadripper PRO 5975WX | 64GB RAM | NVIDIA GeForce RTX 5090 32GB
OS: Windows 11 Pro w/ WSL 2 (Ubuntu 24.04.3 LTS)
Driver: 571.76
CUDA Toolkit (in WSL): 12.9.1 (nvcc V12.9.1)
Environment: Conda (Python 3.12.9)

Key Dependencies (in Conda Env):

torch==2.7.0.dev20250310+cu128
torchvision==0.22.0.dev20250310+cu128
transformers==4.48.2
flash_attn==2.8.3 (compiled with TORCH_CUDA_ARCH_LIST="12.0")
bitsandbytes==0.45.3
accelerate==1.3.0

3. Steps to Reproduce

Want to try and reproduce this bug yourself? Here's how you can do it:

Install CUDA 12.9 in WSL:
```
sudo apt-get install cuda-toolkit-12-9
```

Create Conda Environment & Set Variables:

conda create -n emu_rtx5090 python=3.12 -y
conda activate emu_rtx5090
export CUDA_HOME=/usr/local/cuda-12.9
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Install PyTorch cu128 Nightly:

pip cache purge
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

Install Dependencies & Compile flash_attn for sm_120:

pip install transformers==4.48.2 bitsandbytes accelerate omegaconf protobuf tiktoken
export TORCH_CUDA_ARCH_LIST="12.0"
pip install flash_attn==2.8.3 --no-build-isolation --no-cache-dir

Clone Model & Run Inference:
- Clone BAAI/Emu3.5-Image and BAAI/Emu3.5-VisionTokenizer.
- Use a config file (simple_test_config.py) to run inference.py with 4-bit quantization and flash_attention_2.

4. Expected Behavior

Ideally, the script should load the model without a hitch, and the Generating: progress bar should smoothly move from 0% to 100%, ultimately saving a beautiful, generated image. That's the dream, right?

5. Actual Behavior

Here's the frustrating reality:

The model loads perfectly: all 15 shards are loaded in about 30 seconds. So far, so good!
VRAM usage is correct: ~29GB / 32GB. The model is definitely using the GPU.
The transformers==4.48.2 downgrade correctly fixes the AttributeError: ... no attribute 'generate'. One problem solved, another emerges.
The script hangs indefinitely at the generation step:
```
Generating:   0%|          | 0/1 [00:00<?, ?it/s]
```
nvidia-smi shows low GPU utilization (9-10%) and no errors are thrown. It's like the script is stuck in a never-ending loop with no way out. So Annoying!

6. Diagnostic Tests Performed

To make sure this wasn't just a simple setup error, multiple tests were conducted. Think of it as a scientific investigation to uncover the root cause of the problem.

Test 1: 4-bit Quantization + `flash_attn` (as above)

Result: FAILED (Hangs at 0%)

Test 2: 16-bit (`bfloat16`) + `flash_attn` + CPU Offload

Goal: Rule out bitsandbytes as the culprit. Is it the quantization that's causing the issue?
Result: FAILED (Hangs at 0% in the exact same way as the 4-bit test). This is a crucial clue!
Conclusion: bitsandbytes is NOT the root cause. We can cross that off the list.

Test 3: 4-bit Quantization + `eager` Attention

Goal: Rule out flash_attn as the problem. Is it something specific to the attention mechanism?
Result: Model loading became 25x slower (about 50 seconds per shard versus about 2 seconds/shard with flash_attn). Ouch!
Conclusion: The test was interrupted because it was impractical to wait for the entire model to load. However, this proves that flash_attn is working correctly for model loading. The generation bug likely exists in both flash_attn and eager code paths. This is getting more complex!

7. Test Results Summary

Test Configuration	Model Load	Generation	GPU Util	VRAM	Status
4-bit NF4 + flash_attn	✅ 30s (15 shards)	❌ Stuck 0%	9%	29GB	FAILED
16-bit + flash_attn + CPU offload	✅ 43 min (15 shards)	❌ Stuck 0%	Low	Variable	FAILED
Eager attention	⚠️ 100s (2/15 shards)	⏸️ Interrupted	1%	28GB	INCOMPLETE

8. Conclusion

The Emu3.5 (34B) generation code appears to have a core incompatibility with the NVIDIA Blackwell (sm_120) architecture when running on Linux/WSL. Since both quantized and non-quantized inference fail identically, the bug lies deeper than the quantization kernels. It's a fundamental issue that needs to be addressed.

Given that a native Windows environment on the same machine can produce images (using eager attention), this points to a specific issue with the Emu3.5 generation loop on Linux with sm_120. It's like the code is speaking a different language on Linux.

The BAAI team is requested to investigate compatibility with the RTX 50-series (sm_120) on Linux. Your expertise is needed to solve this puzzle!

9. Workaround

For users with RTX 5090 hardware, here's a potential workaround:

Windows + Eager Attention: Works successfully (confirmed with transformers 4.48.2, PyTorch cu124). It's not ideal, but it's a start.
Performance: 512x512 images in 45-60 seconds. Patience is key!

10. Additional Information

A full diagnostic report is available upon request. Just ask, and it will be provided.
Willing to provide additional logs or test specific patches. Help to find a solution.
The system has been verified working with other CUDA workloads. It's not a hardware problem.

System verification:

import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
# Output:
# PyTorch: 2.7.0.dev20250310+cu128
# CUDA: True
# GPU: NVIDIA GeForce RTX 5090