CuprBot Labs: Double Your Capacity. Same Payroll.

Situation

This is enterprise work, presented as exactly that. At Yelp I worked on the data infrastructure that the company’s storage, machine learning, backups, and customer reporting all depended on. The pipeline moved more than 30 petabytes a month. At that scale, the cost of running it and the cost of it breaking are both large numbers, and both land on the same small team.

Problem

A pipeline that size running on AWS carries two pressures at once. The monthly bill grows with usage whether or not the work gets more valuable, and an outage means lost data and broken customer-facing reports. The team needed to lower the cost structure and raise reliability at the same time, without pausing a system that never stops moving data.

Approach

I co-designed a scalable infrastructure service that migrated the AWS-hosted pipeline onto Kubernetes, working with a team of five senior engineers. The migration was structured so the running system stayed running: the same 30+ PB/month kept flowing while the cost base underneath it changed. Alongside the migration I built a Python workflow state manager on Yelp’s PaaSTA platform, with node discovery, isolation, load balancing, authorization, and integrated alerting and monitoring.

Architecture and key decisions

Cost re-architected at the scheduling layer. Moving to Kubernetes changed how compute was scheduled and packed, which is where the several-thousand-dollars-a-day saving came from. Right-sizing a bill this large is an architecture decision that a spreadsheet pass never reaches.
A state manager so failures stop early. The workflow state manager prevented several data-loss incidents by tracking what each stage had done and isolating failures before they spread downstream.
Hours given back to the team. The same automation returned hundreds of engineering hours a month that had been spent babysitting jobs and recovering from partial failures.
Built for on-call reality. Integrated alerting and monitoring meant high-severity incidents on 24/7 rotation got diagnosed and root-caused, not just restarted.

What shipped

A Kubernetes-based infrastructure service carrying the full 30+ PB/month workload, a Python workflow state manager running on PaaSTA, and the monitoring and alerting around both. The cost base dropped by several thousand dollars a day and the recurrence rate of data-infrastructure incidents fell.

Outcome

The pipeline kept processing 30+ petabytes a month through the migration, the daily cost fell by several thousand dollars, and the workflow state manager returned hundreds of engineering hours a month while preventing data-loss incidents. The reliability work showed up as fewer repeat high-severity pages.

What this demonstrates

Cost and reliability at this scale are the same skill: understanding where the money and the risk actually live in a distributed system, then changing the architecture so both improve together. When I tell an operator that their cloud bill or their data platform can be made leaner without breaking what runs on it, the judgment comes from doing it at petabyte scale.