← Back

February 2026

Learnings from a 2M QPS Ads Endpoint

Notes from working on Core Ads Delivery at Meta — the retrieval phase of ad serving for Audience Network (third-party apps).

The ads retrieval system on Audience Network serves ~2M queries per second. At that scale, small inefficiencies compound fast. A 1% CPU regression means hundreds of machines. A bad throttling decision means either dropped revenue or overloaded infrastructure.

Throttling Based on Expected Serving Cost

The original throttling system made binary decisions — let a request through or drop it — based on simple load signals. The problem: not all requests cost the same to serve. A request with 500 candidate ads costs significantly more than one with 10, but the throttler treated them identically.

We rebuilt the throttling model to use expected serving costas the decision input. Instead of "is the system busy?" the question became "can the system afford this specific request?" This meant cheaper requests could still flow during partial overload, while expensive requests got shed first.

The result was more accurate throttling — fewer false drops during normal traffic, faster shedding during genuine overload.

Moving Filtering Before Ranking

The retrieval pipeline had a classic ordering problem: we were running the full ranking stage on ads that would later be filtered out for eligibility reasons (targeting mismatch, budget exhaustion, policy violations). The ranking stage is expensive — it involves ML model inference for each candidate.

We moved the eligibility filters upstream, before ranking. Ads that can't possibly win get dropped before we spend compute scoring them. This saved 7% CPU across the retrieval system — a significant reduction at this scale.

What I Took Away

Working on a 2M QPS system teaches you to think in distributions, not averages. P50 latency can look fine while P99 is on fire. A throttling model that works at 1M QPS can fall apart at 2M because the cost distribution shifts. And the biggest wins often come from reordering operations — not from making individual operations faster.