Why we rewrote our event router in Rust
Our event router sits in front of everything. Every request, every webhook, every agent callback passes through it before it reaches a service. For three years it was a JVM application, and for most of those three years that was fine.
Then our p99 started telling a story our averages hid. Median latency was a clean 4ms. The 99th percentile wandered between 30ms and 200ms with no obvious cause — no traffic spike, no slow downstream. The cause, it turned out, was the garbage collector.
Measuring before rewriting
The first rule of a rewrite is to earn it. We instrumented the router with detailed GC tracing and correlated pause times against the latency histogram. The correlation was almost perfect: every tail-latency excursion lined up with a young-generation collection.
// p99 latency vs GC pause, 24h window
correlation(p99_ms, gc_pause_ms) = 0.94
A 0.94 correlation is not a subtle hint. We weren’t going to tune our way out of it; the allocation pattern of a router — millions of short-lived envelope objects — is close to the worst case for a generational collector.
What Rust bought us
We rewrote the hot path — parse, route, dispatch — in Rust, leaving the control plane in Java. The results, after two weeks in production behind a canary:
- p99 dropped from a noisy 30–200ms to a flat 12ms.
- Memory footprint fell by 60%, because there is no heap headroom to reserve for a collector.
- The service stopped having “moods.” Latency became boring, which is the highest compliment we pay a system.
The goal was never “Rust is faster.” It was “make the tail predictable.” Predictability is the feature.
What it didn’t buy us
Rust did not make us faster on the median — the JVM was already excellent there. It cost us in iteration speed for the first month while the team built fluency, and it added a build-time dependency story we had to invest in. We would make the same call again, but only for the hot path, and only because we had the histogram to justify it.