← All posts
gpu

Fused MoE + LoRA: Fitting Heterogeneous Adapters in a Single Forward Pass

This is a design note on a kernel I’ve been working on — fusing LoRA application into the MoE forward pass so you don’t pay for a separate matmul per adapter.

The problem

Standard MoE inference: tokens get routed to experts, each expert runs its FFN, results get combined. If you want to apply LoRA adapters per-expert, the naive approach is to run the base expert forward, then apply the LoRA delta as a separate low-rank matmul. That’s an extra kernel launch per expert, and the memory access pattern is bad.

The approach

The idea is to fuse the LoRA application into the existing FusedMoE kernel. During the dispatch phase, when tokens are being scattered to experts, we also load the corresponding LoRA A and B matrices. The expert computation becomes (W + AB)x computed as Wx + A(Bx), where the Bx projection happens in shared memory before the main GEMM.

Parameter counts

For context on why this matters: Qwen3-30B-A3B at LoRA rank 16 across all experts is roughly 50M trainable parameters. Qwen3.5-397B-A17B is closer to 800M. These are small enough to fit in a single checkpoint shard, which makes the training side much simpler.

Construction ownership

One design question was where the LoRA weights live in the vLLM layer hierarchy. We settled on having the FusedMoE layer own construction of the combined weight tensor, with the MoERunner handling execution. This inverts the previous ownership model but makes runtime LoRA updates cleaner.

Still iterating on the kernel. More to come.