Zed Industries serves 2x faster code completions with the Baseten Inference Stack

company overview

Zed Industries (Zed, for short) is building the world’s fastest code editor for hyper-responsive and collaborative coding experiences between humans and agents. Zed is built from scratch in Rust for performance and control by the creators of Atom.

The Zed editor is open-source, as well as Zeta, the open-source (and open-data) LLM powering Zed's Edit Prediction feature. Edit Prediction returns individual or multi-line code suggestions that can be quickly accepted by pressing tab.

problems

When Zed first reached out to us, their Edit Prediction launch was one week away. Zeta—the model powering Edit Prediction—was built to anticipate coders’ next moves with a user experience that felt instantaneous. To power this user experience on launch day, fast, high-throughput inference was essential.

Zed was already using another well-known inference provider during Edit Prediction’s closed beta period, but they weren’t completely satisfied on a few fronts:

Latency wasn’t low enough. For launch, Zed was targeting a P90 under 500 ms and a P50 under 200 ms. Neither of these metrics was being met.
Compute was limited, in terms of capacity and regions. Zed was being pushed into a fixed hardware commitment with no guarantee of being able to scale from 1 to n replicas on launch day. They also needed multi-region deployments to ensure low latency for their international user base, but were only given access to a single region.
Their inference was a black box. With open-source as one of their core values, Zed wanted more visibility into what was driving their model performance so their team could grow their own expertise and find iterative improvements.
Support was limited. Zed was seeking more hands-on, engaged partners with the expertise to optimize their code generation use case, while being able to communicate what was being done and why.

solutions

We quickly paired Zed with forward deployed and model performance engineers who optimized their code generation workload.

Our engineers are experts in inference infrastructure and modality-specific runtimes. Within days, our team had tried over 75 different performance optimizations, leading to a massive reduction in latency and increase in throughput.

“The ‘let’s get this thing going and then figure out the contract details later’ attitude is so much better than some of your competitors who wanted to tie us down first. We’re really impressed with the level of commitment and support from your engineers.”
– Nathan Sobo, Co-founder

For the final deployment, some of the biggest gains were achieved through:

TensorRT-LLM (vs. vLLM) as the inference framework, KV caching, and custom-tuned speculative decoding to massively reduce latency.
Lookahead decoding to increase throughput.
Multi-cloud capacity management for unlimited scale across clouds and regions.
Globally distributed GPUs, with geo-aware routing to keep latency low for users anywhere in the world.

Our engineers custom-tuned the parameters of their speculative decoding algorithm and autoscaling settings to ensure optimal resource utilization at a low latency. Our optimized autoscaling and multi-cloud capacity management enabled Zed to scale seamlessly from 1 to n GPUs across regions and clouds, powering low latency and unlimited capacity for their launch.

“Our engineering team just wants to work with you guys. They told me really straightforwardly. And that goes a long way.”
– Nathan Sobo, Co-founder

results

Because we’re OpenAI compatible, the transition process from Zed’s previous inference provider to the Baseten platform was painless. After thorough testing, Zed moved all of their traffic over to their deployments on the Baseten platform within a day—no code changes necessary.

“We’re really appreciative of the support and just moving so quickly on this. The latency improvements have been so impressive, and on such a short timeline.”
– Antonio Scandurra, Co-founder

As a result, we’ve exceeded their initial performance goals, powering:

100% uptime
45% lower p90 latency
3.6x higher throughput
Unlimited autoscaling across multiple regions

With additional performance improvements after launch, Zed now delivers over 2x faster Edit Prediction with Zeta on Baseten compared to their previous inference provider. And everything we’ve done—all of the performance optimizations—belongs to Zed. No black boxes.

"I want the best possible experience for our users, but also for our company. Baseten has hands down provided both. We really appreciate the level of commitment and support from your entire team."
– Nathan Sobo, Co-founder

what’s next

Our model performance team has continued to iterate on Zed’s code generation use case, shipping a custom version of lookahead decoding (“Baseten Lookahead”) that has shaved 100s of milliseconds of additional latency off of Zeta’s predictions with more optimizations to come.

Check out Zed’s blog or product roadmap to learn more about their editor, Agentic Editing, and Edit Prediction. (Customer or not—our engineers have always been huge Zed fans.)