Menu
← All work

A Centralized LLM Platform for a 100M-User Product

On contract at a consumer social platform with 100M+ users, I architected the centralized LLM orchestration that put premium AI features into production, migrated models to clear a 40M-event backlog, and chartered the internal AI Guild.

Status

Enterprise Employment

Domain

AI Engineering

Headline result

Premium AI features in production for 100M+ users; a 40M-event backlog cleared on a model migration; an internal AI Guild chartered

Demonstrates

LLM platform architecture AI evaluation and governance AI org leadership

Representative stack

LLM orchestration AWS Bedrock OpenAI Braintrust BAML Redis WebSockets

Platform

  • Centralized LLM orchestration
  • Premium AI features
  • Decoupled from core systems

Reliability

  • Bedrock to OpenAI migration
  • Rate limiting + circuit breakers
  • 40M-event backlog cleared

Quality

  • Braintrust + BAML eval pipeline
  • Golden datasets
  • Prompt-drift gates before release
AI delivery separated from core systems, with evaluation and governance built in

Situation

This is contract enterprise work, anonymized at the client’s level. The product was a consumer social platform with more than 100 million users that wanted to ship premium AI features. I served as the engineering organization’s go-to AI authority and owned the backend and AI delivery for its premium subscription line.

Problem

AI features at that scale fail in a specific way. Each one gets wired directly into core systems by a different team, the cost and latency are nobody’s job, and quality is judged by whoever wrote the prompt. The platform needed AI delivery that was centralized, observable, and safe to ship to 100 million people, and it needed an organization that knew how to operate it.

Approach

I architected a centralized LLM orchestration platform that productionized the premium AI features and separated AI delivery from the core systems, so new features shipped against a shared platform instead of being rebuilt each time. I drove a multi-model migration from AWS Bedrock to OpenAI with rate limiting and circuit breakers that cleared a 40-million-event processing backlog while lowering operating cost and latency. To keep quality measurable, I built an automated evaluation pipeline on Braintrust and BAML with golden datasets that caught prompt drift and gated summary quality before release.

Architecture and key decisions

  • A platform, so AI delivery decoupled from core systems. Features shipped against one orchestration layer, which raised feature velocity and made cost and latency observable in one place.
  • Migration protected by rate limiting and circuit breakers. Moving models under live 100M-user load needs guardrails. The same controls that made the migration safe cleared the 40M-event backlog instead of drowning in it.
  • Evaluation as a release gate. Golden datasets and an automated eval pipeline meant a prompt change that quietly degraded quality was caught before users saw it, the same way a test suite gates code.
  • Real-time delivery where it mattered. I designed a decoupled real-time notification system on Redis and WebSockets, shipped across the premium product line and a new freemium tier.

What shipped

A centralized LLM orchestration platform serving premium AI features in production, a completed Bedrock-to-OpenAI migration with the backlog cleared, an automated evaluation pipeline gating quality, a real-time notification system, and the backend and AI for a premium subscription with roughly 20,000 subscribers at about $60 a membership and near-99% gross margin. Alongside the systems, I chartered the company’s internal AI Guild and drove AI upskilling across multiple teams.

Outcome

Premium AI features ran in production for a 100M+ user product on a shared, observable platform; the model migration cleared a 40-million-event backlog while reducing cost and latency; and the organization gained an AI Guild and a working evaluation discipline it did not have before. The technical work and the organizational work shipped together.

What this demonstrates

Putting AI into production at scale is two jobs at once: the platform, with orchestration, migration safety, and evaluation gates, and the organization that owns it. I have done both on the same engagement, for a product measured in hundreds of millions of users. That is the pattern I bring to companies trying to get past their first stalled AI pilot.

The playbooks behind this work