Shipping INTELLECT-3: Lessons from Async RL on 512 H200s

We shipped INTELLECT-3 a few weeks ago. This is a dump of things I wish I’d known going in, mostly around the distributed training side.

The disaggregated architecture

prime-rl separates the actor (inference) and learner (training) into independent services. This sounds obvious in hindsight but the coordination problems are real. The actor generates rollouts, ships them to a buffer, and the learner pulls batches asynchronously. No lockstep.

The win is utilization — GPUs on the learner side aren’t sitting idle waiting for rollout generation, and vice versa. The cost is complexity in the data pipeline and making sure you’re not training on stale data.

Things that bit us

Round-robin sampling in the RL packer had a subtle bug where keys weren’t being populated correctly across data sources. Cost us a day of debugging what looked like a training instability but was actually a data issue.

Microbatch step time variance was larger than expected. Parsing the training logs and looking at step time distributions helped us find a few nodes that were consistently slow — turned out to be NFS latency on checkpoint writes.

Multi-node torchrun setup with FSDP2 had issues with tied weight embeddings that took a while to diagnose. The error messages were not helpful.

What worked well

The async architecture itself held up. Once the data pipeline was solid, scaling from 128 to 512 GPUs was mostly a cluster operations problem, not a training algorithm problem. That’s the right kind of problem to have.