Jashwanth Pedapudi

Building and optimizing Apache Flink pipelines that processed the entirety of Uber's trip and driver data in real time.

Uber processes trillions of Kafka messages per day across the company. Our team owned the 15 pipelines that covered all trip and driver event data — every ride request, driver location update, trip state change, and earnings event. These fed an internal tool used by customer support agents and engineers to debug issues in real time.

The Architecture

The system had three layers:

UI layer — internal tool for querying and debugging trip/driver data live
Backend service layer — API serving queries against real-time and historical data
Data pipeline layer — Apache Flink stream processing, ~300K messages/sec across 15 Kafka topics

Flink jobs wrote to both a real-time data store (for live queries) and a persistent Hive data store. We maintained ~1.5 PB of driver data and ~500-600 TB of trip data.

The $900K Optimization

The original architecture had two separate Flink streams for each data type — one writing to the real-time store, one writing to Hive for historical queries. Same data, processed twice, stored in different formats.

I identified this as unnecessary duplication and executed a two-part optimization:

Migrated the serialization format from JSON to Protobuf — reduced storage footprint and improved serialization/deserialization performance
Merged the two Flink streams into one — a single stream that writes to both stores, eliminating redundant processing

Combined savings: ~$900K/year in compute (Flink job consolidation) and storage costs.

Reliability at Scale

These pipelines had to stay up through every peak period — holidays, major events, surge periods. A pipeline outage meant customer support agents couldn't debug live issues and engineers couldn't investigate production problems. The reliability bar was high because the blast radius of a failure was operational, not just technical.

Real-Time Streaming Pipelines — Uber

The Architecture

The $900K Optimization

Reliability at Scale