Menu
← All work

170-Page PDFs Into Structured Product Data

Genetic testing labs ship results as massive PDFs. I built the extraction pipeline that turns them into structured, queryable product data, live against three lab formats.

Status

Pilot

Domain

Data & Revenue

Headline result

238 markers extracted from a 170-page report; 3 live lab-format integrations; 21,297-item reference database

Demonstrates

AI document ETL Data productization Partner integration

Representative stack

Python LLM extraction PostgreSQL REST API PDF parsing

Input

  • 170-page genetic PDFs
  • 3 lab formats (dnaPower, Biomune, MapMyBiome)

Extraction

  • Marker parsing
  • Normalization + validation

Product

  • 21,297-food reference DB
  • Production API backend (15 routes)
Unstructured lab reports in, queryable product data out

Situation

Genetic testing companies deliver results as enormous PDF reports, 170+ pages of markers, findings, and recommendations locked in a format only humans can read slowly. A partner business development motion with a real lab (dnaPower) needed those reports turned into something software could act on. This is a pilot: real partner, deployed production backend, honestly pre-revenue.

Problem

The PDF is where the value goes to die. A 243-page report containing 68 clinically relevant findings is useless to any downstream product, personalization, meal planning, recommendations, until every marker is extracted, normalized across labs that all format differently, and validated. Manual re-keying at this density is error-prone enough to be dangerous.

Approach

Build the extraction pipeline against real documents from day one, not idealized samples. Each lab format gets a parsing profile; extracted markers are normalized into a common schema and validated before they touch the database. The structured output feeds a production API backend and a 21,297-item food reference database that turns raw markers into usable product logic.

Architecture and key decisions

  • Format profiles over a universal parser. Three labs, three layouts; pretending one parser handles all of them is how silent extraction errors happen. Each format earned its own validated profile: dnaPower at 266 markers, Biomune at 68 findings across 243 pages, MapMyBiome at 150+.
  • Validation before storage. Extraction confidence is checked before data enters the product path. In a health-adjacent domain, a wrong marker is worse than a missing one.
  • A reference database as the value multiplier. Markers alone are trivia; joined against 21,297 foods they become a product.
  • Deployed, not demoed. The backend runs in production with 15 API routes. The partner conversation happens against a working system.

What shipped

The multi-format extraction pipeline, normalization and validation layer, the food reference database, and a deployed production backend serving 15 API routes.

Outcome

238 markers extracted from a 170-page report in the flagship format; three lab formats integrated and live; a working production system anchoring a real partner discussion. Status stated plainly: this is a pilot, pre-revenue, and I cite no projections.

What this demonstrates

Every industry has its version of the 170-page PDF: invoices, lab reports, inspection documents, contracts. This is the pattern for turning document piles into structured data products: format-aware extraction, validation gates, and a reference layer that makes the data worth money.

The playbooks behind this work