Menu
← All work

30% More Performance Across a 5,000-Server Fleet

At Index Exchange I built an ML-enhanced rate-limiting model that lifted server performance 30% across a 5,000-server global fleet while holding 95% of revenue, and found a $10,000-a-month cost reduction in the infrastructure data.

Status

Enterprise Employment

Domain

Infrastructure

Headline result

30% performance gain at 95% revenue retention across 5,000 servers; $10,000/month cost reduction identified and acted on

Demonstrates

ML systems at scale Performance engineering Infrastructure cost analysis

Representative stack

Go Python Perl Prometheus InfluxDB

Fleet

  • 5,000 global servers
  • Real-time ad auction traffic

Model

  • ML-enhanced rate limiting (Go, Perl, Python)
  • Decision-tree tuning
  • 6-engineer ML pipeline

Measurement

  • Prometheus + InfluxDB analysis
  • $10K/month cost reduction surfaced
An ML rate-limiting model that traded almost no revenue for a large efficiency gain

Situation

This is enterprise work at large scale. At Index Exchange, a real-time advertising exchange, I worked on the performance of a fleet of roughly 5,000 servers handling auction traffic where latency and throughput translate directly into revenue. Small percentage changes at that scale are large absolute numbers.

Problem

The fleet was carrying more load than it comfortably should, and the simple answer, add more servers, meant adding cost linearly forever. The harder and cheaper answer was to make the existing fleet do more. The constraint was that the same traffic carried the revenue: any rate-limiting that shed load risked shedding income with it.

Approach

I built an ML-enhanced rate-limiting model in Go, Perl, and Python that decided what to shed and what to keep based on its revenue value, not on a flat threshold. The model was tuned continuously through a scalable ML pipeline I built with a six-engineer team, using decision-tree modeling to keep the trade-off calibrated as traffic patterns shifted. In parallel I analyzed the fleet’s Prometheus and InfluxDB telemetry in Jupyter to find where resources were being spent without return.

Architecture and key decisions

  • Rate limiting by revenue value, not by volume. The model protected the requests that paid and shed the ones that did not, which is how a 30% performance gain cost only 5% of revenue.
  • Continuous tuning over a one-time model. Traffic on an exchange changes constantly, so the decision-tree model retuned through a pipeline rather than being trained once and left to drift.
  • Telemetry read for cost. Prometheus and InfluxDB data usually get watched for outages. Reading the same data for cost surfaced a $10,000-a-month reduction that informed how resources were allocated.

What shipped

The production rate-limiting model running across the 5,000-server fleet, the ML pipeline that kept it tuned, and an infrastructure-cost analysis that turned monitoring data into a concrete monthly saving.

Outcome

Server performance rose 30% while 95% of revenue was retained across the 5,000-server fleet, and the telemetry analysis surfaced a $10,000-a-month cost reduction that was acted on. The same hardware did substantially more work.

What this demonstrates

Getting more out of infrastructure you already pay for is one of the cleanest forms of margin there is, and it is mostly a measurement-and-modeling problem. When I tell an operator their systems are leaving capacity or money on the table, the method is the one used here: instrument it, model the trade-off that matters, and tune it as reality moves.

The playbooks behind this work