Uber processes trillions of Kafka messages per day across the company. Our team owned the 15 pipelines that covered all trip and driver event data — every ride request, driver location update, trip state change, and earnings event. These fed an internal tool used by customer support agents and engineers to debug issues in real time.
The Architecture
The system had three layers:
- UI layer — internal tool for querying and debugging trip/driver data live
- Backend service layer — API serving queries against real-time and historical data
- Data pipeline layer — Apache Flink stream processing, ~300K messages/sec across 15 Kafka topics
Flink jobs wrote to both a real-time data store (for live queries) and a persistent Hive data store. We maintained ~1.5 PB of driver data and ~500-600 TB of trip data.
The $900K Optimization
The original architecture had two separate Flink streams for each data type — one writing to the real-time store, one writing to Hive for historical queries. Same data, processed twice, stored in different formats.
I identified this as unnecessary duplication and executed a two-part optimization:
- Migrated the serialization format from JSON to Protobuf — reduced storage footprint and improved serialization/deserialization performance
- Merged the two Flink streams into one — a single stream that writes to both stores, eliminating redundant processing
Combined savings: ~$900K/year in compute (Flink job consolidation) and storage costs.
Reliability at Scale
These pipelines had to stay up through every peak period — holidays, major events, surge periods. A pipeline outage meant customer support agents couldn't debug live issues and engineers couldn't investigate production problems. The reliability bar was high because the blast radius of a failure was operational, not just technical.