How to Extract Structured Data from Lab Results, Pathology Reports, and Patient Intake Forms β With an API That Deploys in 60 Seconds
A healthtech developer's guide to building a medical document extraction pipeline without training an ML model. Covers CBC panel extraction, pathology reports, intake forms, HIPAA audit trails, BYOK, and EU AI Act compliance.
Healthcare runs on documents. Lab results, pathology reports, faxed intake forms β the data your application needs is trapped inside PDFs that look different depending on whether they came from Quest Diagnostics, LabCorp, a local hospital system, or a handwritten intake packet from a rural clinic. If you've ever tried to extract a patient's hemoglobin level or INR value reliably across all of those formats, you already know the problem.
The good news: you don't need to train a model. You don't need a data science team. You don't need to call a vendor's sales team and wait two weeks for a demo. This guide walks through how to stand up a working lab result data extraction API in under 60 seconds using Fabrx β and covers the compliance and architectural questions your team will inevitably face.
Why Medical Document Extraction Is Still Broken for Healthtech Builders
The tools that have existed for years β AWS Textract, Google Document AI, traditional OCR libraries β were built for structured, consistent forms. Insurance claim PDFs. Bank statements. W-2s. These tools work reasonably well when the document format is known in advance and doesn't change.
Medical documents break those assumptions on every axis:
- Variable layouts. Quest Diagnostics uses a completely different page layout than LabCorp, which uses a different layout than a hospital LIS export. The same test β a Complete Blood Count β will appear in different positions, with different column orders, different reference range formats, and sometimes different abbreviations (Hgb vs. HGB vs. Hemoglobin).
- Fax-to-PDF degradation. A substantial portion of US lab results still travel by fax. The resulting scan quality is inconsistent. Textract's key-value extraction fails frequently on fax artifacts, column misalignment, and low-contrast text.
- Multi-page complexity. A single pathology report may be eight pages long, with a narrative summary on page one and structured findings embedded in pages three through six. A traditional extraction pipeline that processes pages independently loses cross-page context.
- Compliance overhead. PHI is involved. Your compliance team wants audit logs. Your healthcare customers want to know what AI provider is touching their data, whether it's being used for model training, and where data is stored. Traditional extraction tools rarely answer these questions clearly.
The result: healthtech engineering teams spend weeks building brittle template-matching pipelines that break every time a lab partner updates their report format, then spend more weeks updating the templates. It's an endless maintenance tax on core product velocity.
The Three Document Types That Kill Healthtech Pipelines
Most medical document extraction challenges cluster around three document categories, each with its own failure modes:
Lab Results
Lab results β Complete Blood Count panels, metabolic panels, lipid panels, thyroid function tests β are highly structured in principle but wildly variable in practice. The core extraction targets are consistent: test name, value, unit, reference range, and abnormal flag. But the way these fields are laid out across different lab vendors means template-based extraction has a failure rate that compounds with each new lab partner you onboard. LOINC code mapping (translating "Hgb" at Quest to the canonical LOINC code 718-7) adds another layer of complexity that most OCR tools don't address at all.
Pathology Reports
Pathology reports mix narrative prose with structured diagnostic codes. A breast biopsy pathology report might include a free-text clinical impression section followed by structured fields for tumor grade, ER/PR/HER2 receptor status, and staging. Extracting both the narrative summary and the structured codes requires a model that understands document-level context β not just optical character recognition.
Patient Intake Forms
Patient intake forms are the hardest category. They involve checkboxes (sometimes filled in pen, sometimes digital), handwritten fields, multi-column layouts, and multi-page structures. A patient may have checked "yes" to diabetes but the checkbox is a hand-drawn X rather than a filled circle. Traditional OCR pipelines classify these as empty. The clinical impact of getting this wrong is obvious.
What "Structured Extraction" Actually Means for Medical Documents
When a healthtech team says they want "structured extraction," they usually mean: give me a predictable JSON object I can write application code against. Specifically:
- Fields the application cares about β not everything on the document
- Consistent field names across different source document layouts
- Field-level confidence scores so the application can flag low-confidence results for human review
- A stable schema contract that downstream EHR connectors, analytics pipelines, and alerting systems depend on
This is different from what most OCR tools produce, which is a flat key-value dump of every text element on the page. You get everything, labeled inconsistently, with no guarantee that field names will be stable across different source formats. That output needs significant post-processing before it's usable β which is where most healthtech data engineering time actually goes.
The better approach: define the schema you want once, in plain English, and let the extraction layer handle mapping arbitrary document formats to your schema. This is what Fabrx's conversational schema builder does.
A Working Example: Extracting a CBC Panel from a Lab Result PDF
Here's a concrete walkthrough of how to build a working CBC panel extraction API using Fabrx β from schema definition to live API endpoint.
Step 1: Define your schema conversationally
In the Fabrx dashboard, describe what you want to extract. You don't need to specify XPath selectors, regex patterns, or bounding box coordinates. You type something like:
Extract the following fields from each test result row: - patient_name (string) - patient_mrn (string) - collection_date (date, ISO 8601) - test_name (string) - loinc_code (string, if present) - value (string) - unit (string) - reference_range_low (number, nullable) - reference_range_high (number, nullable) - abnormal_flag (enum: "H", "L", "HH", "LL", "N", null) - ordering_provider (string)
Fabrx generates the extraction schema and validates it against a sample document you upload.
Step 2: Deploy the API endpoint
Click Deploy. Your extraction endpoint is live. You get a URL and an API key. No infrastructure to provision, no container to build, no model to host.
Step 3: Call the API
curl -X POST https://api.fabrx.ai/v1/extract \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "file=@cbc_panel_quest.pdf" \ -F "endpoint_id=ep_cbc_panel_v1"
Step 4: Receive structured JSON
{
"patient_name": "Jane Smith",
"patient_mrn": "MRN-004821",
"collection_date": "2026-06-14",
"results": [
{
"test_name": "Hemoglobin",
"loinc_code": "718-7",
"value": "11.2",
"unit": "g/dL",
"reference_range_low": 12.0,
"reference_range_high": 16.0,
"abnormal_flag": "L",
"confidence": 0.97
},
{
"test_name": "WBC",
"loinc_code": "6690-2",
"value": "8.4",
"unit": "K/uL",
"reference_range_low": 4.5,
"reference_range_high": 11.0,
"abnormal_flag": "N",
"confidence": 0.99
}
],
"ordering_provider": "Dr. Michael Torres, MD",
"extraction_id": "ext_7Qm9kzXp3",
"processing_time_ms": 1240
}The same endpoint, with the same JSON schema, handles the LabCorp version of a CBC panel without any configuration changes. The extraction layer normalizes the different source layouts to your schema.
loinc_code as a target field and Fabrx will populate it from the document if present, or attempt to resolve it from the test name if not β reducing the need for a separate LOINC normalization step in your pipeline.Your medical document extraction API β live in under 60 seconds.
No templates. No training data. EU AI Act compliant on the free plan.
Get started free βHandling Pathology Reports: Narrative Fields vs. Structured Codes
Pathology reports present a specific challenge: they contain both narrative clinical prose and structured diagnostic data, often interleaved across multiple pages. A breast biopsy report might open with a clinical history summary, then include an intraoperative findings section, followed by a structured microscopic description, and conclude with a diagnostic impression that blends free text with ICD-10 codes.
A pure OCR approach extracts all of this as undifferentiated text. A template-based approach works for one lab's pathology format and breaks immediately when you onboard a second lab partner.
The extraction schema for pathology reports typically needs to handle two parallel structures:
- Narrative fields β clinical impression, gross description, microscopic findings. These are extracted as strings, often multi-sentence. Some teams run a secondary LLM summarization step to condense these for downstream display.
- Structured codes β diagnosis codes (ICD-10), procedure codes (CPT), tumor characteristics (grade, stage, receptor status). These need to be extracted as typed fields with validation, not as raw strings.
In Fabrx, you define both in the same schema. Narrative fields are typed as string. Structured codes are typed as string with enum validation or number with range constraints. The extraction model handles the distinction automatically β it understands that "Grade 2/3" in the body of the report should map to your tumor_grade field, not your clinical_impression field.
For teams building EHR connectors, the ability to extract both narrative and structured fields in one pass β without a separate NLP pipeline β significantly reduces the number of moving parts in the integration architecture.
Patient Intake Form Extraction: Checkboxes, Handwriting, and Multi-Page Complexity
Patient intake forms are where most medical document extraction pipelines fail quietly. A checkbox indicating a positive family history of cardiac disease, missed because the patient used an X instead of a check mark, can affect clinical decision-making downstream. This is not a theoretical concern β it's a documented failure mode of OCR-only extraction pipelines in clinical settings.
Fabrx's extraction model handles intake forms with several capabilities that template-based tools lack:
- Checkbox detection across handwriting styles. The model recognizes filled circles, X marks, check marks, and diagonal lines as positive checkbox states, not as noise.
- Multi-page context. A patient may complete a medical history section on page two that references a condition disclosed on page four. Cross-page references are resolved within a single extraction call.
- Handwritten field extraction. Emergency contact names, physician names, insurance policy numbers written by hand are extracted with confidence scores. Low-confidence handwritten fields are flagged for human review rather than silently errored.
- Structured output for conditional fields. Many intake forms use conditional logic β "if yes, specify." Fabrx extracts both the checkbox state and the conditional text field as linked fields in the output JSON.
For teams building patient intake pipelines, the practical outcome is a reduction in the manual review queue. Instead of routing every form to a staff member for data entry, only forms with low-confidence fields β typically 5β15% of the total volume β require human verification.
See also: how Fabrx handles scanned document OCR at scale.
Compliance You Don't Have to Ask Sales For: Audit Trails, PII Detection, and EU AI Act Readiness
Healthcare compliance conversations with enterprise customers follow a predictable script. "What data do you retain? How long? Where is the audit log? What's your PHI handling policy? Are you HIPAA-compliant?" If your document extraction layer can't answer these questions with documentation rather than a sales call, you're adding weeks to every enterprise deal.
Fabrx is built so that compliance answers are available at signup β not gated behind an enterprise tier.
Field-level audit trails
Every extraction call produces an audit record: timestamp, document hash, extraction endpoint version, field-level confidence scores, and the AI model used. This record is immutable and queryable. For HIPAA audit readiness, you can demonstrate exactly what was extracted from a specific document, when, and with what confidence β without reconstructing events from application logs.
PII detection and flagging
Fabrx runs automated detection for the 18 PHI identifiers defined in the HIPAA Safe Harbor standard β names, dates (other than year), geographic data below state level, phone numbers, fax numbers, email addresses, social security numbers, medical record numbers, and others. Detected PHI is flagged in the extraction output, enabling downstream de-identification workflows without a separate NLP pass.
EU AI Act readiness
This is the compliance dimension that no competitor in the medical document extraction space has addressed. Under the EU AI Act, AI systems used in health and life sciences contexts are classified as high-risk. High-risk systems require conformity assessments, human oversight mechanisms, transparency documentation, and audit logging. If you're building a healthtech product for European markets β or for US customers who handle data from European patients β EU AI Act compliance is not optional.
Fabrx's field-level data lineage, confidence scoring, and immutable audit trails are designed to satisfy EU AI Act high-risk system requirements out of the box. You don't need to build a separate compliance layer on top of the extraction pipeline.
Read more: GDPR and EU AI Act compliant document processing with Fabrx.
BYOK: Why Your Healthtech Customers Will Ask About Your AI Provider
Enterprise healthcare customers increasingly require contractual control over which AI models process their data. The questions arrive in every vendor evaluation: "What model are you using? Does it train on our data? Where is inference happening? Can we use our own API keys?"
For most document extraction vendors, the honest answer is: "We use a shared model, we can't tell you exactly which one, and data residency control requires a custom enterprise agreement." That answer fails vendor security reviews at most health systems.
Fabrx supports Bring Your Own Key (BYOK) across 100+ model providers. You can configure extraction endpoints to use your organization's Azure OpenAI deployment, your own Anthropic API key, or a self-hosted model running in your VPC β without leaving the Fabrx interface. Model training opt-out is enforced at the provider level when you supply your own credentials.
For teams building for EU markets, BYOK also addresses data residency requirements under GDPR β you can route extraction through a provider with EU data center guarantees rather than relying on a third-party vendor's data processing agreement.
See: how Fabrx's no-code API builder integrates with your existing infrastructure.
Schema Versioning: What Happens When Quest Changes Their Format
This is the question no competitor addresses β but it's the one that actually determines long-term engineering maintenance cost.
Lab vendors update their report formats. Quest Diagnostics redesigns their PDF layout. A hospital system migrates their LIS to a new vendor. A pathology lab switches from one reporting template to another. When this happens with a template-based extraction system, the extraction breaks. Silently, if you're unlucky. With loud failures, if you're fortunate enough to have the right monitoring in place.
Schema versioning in Fabrx addresses this at the API contract level. Each extraction endpoint has a version. When a document format change causes extraction drift β confidence scores drop, field coverage decreases β Fabrx surfaces this in the observability dashboard before it causes production failures.
You can:
- Pin a specific schema version for downstream systems that depend on a stable field contract, while testing an updated schema against new document samples in staging.
- Run A/B extraction across two schema versions to measure field coverage and confidence score differences before migrating downstream consumers.
- Review extraction history to identify exactly when a format change occurred and which document batch first triggered the drift.
For healthtech teams managing relationships with multiple lab partners, this means format changes become an observable, manageable event rather than a production incident that your on-call engineer discovers at 2am because a downstream alert fired on bad data.
Getting to Your First Working Medical Extraction API in Under 60 Seconds
The 60-second claim is specific: from account creation to a live, callable extraction endpoint. Here's the actual sequence:
- Create a free account at app.fabrx.ai. No credit card required.
- Create a new extraction endpoint. Name it (e.g., "CBC Panel β Quest Diagnostics").
- Describe your schema in plain English. Type what fields you want. The schema builder generates the configuration.
- Upload a sample document to validate extraction. Review the output JSON. Adjust any field descriptions if needed.
- Click Deploy. Copy your endpoint URL and API key.
- Make your first API call. POST a PDF. Receive structured JSON.
Steps 1β5 typically take less than 60 seconds for a developer who has a sample document on hand. Steps 1β6 β including the first successful API response β typically take under two minutes.
Compare this to AWS Textract: provisioning IAM roles, configuring S3 buckets for document storage, calling the Textract API, post-processing the key-value output, writing normalization logic, and building the schema mapping layer on top. That's a multi-day engineering project before you have a usable extraction endpoint for a single lab format.
Fabrx vs. Other Medical Document Extraction Tools
The competitive landscape for medical document extraction is fragmented. Here's an honest assessment of the main alternatives and where they fall short for healthtech builders:
Affinda
Affinda positions itself as enterprise-grade and offers a clean interface. The gap: no developer-first tutorial, no schema versioning, no BYOK, no EU AI Act readiness, and the path to a working integration goes through a sales conversation. For a healthtech startup that needs to ship in days, not weeks, the sales-gated onboarding is a blocker.
DocuPipe
DocuPipe offers a visual demo and claims HIPAA/SOC2/ISO compliance. The gap: no schema versioning, no BYOK, no EU AI Act coverage, and no time-to-deploy benchmarks. There's also no tutorial content that walks a developer through a medical document extraction use case end-to-end.
AWS Textract
Textract is the default choice for teams already in the AWS ecosystem. The gap: it produces raw key-value output that requires significant post-processing for medical documents, it's brittle on fax-quality scans, and it has no concept of schema definition, versioning, or audit trails. Every time a lab format changes, you're back to updating normalization code.
Landing.ai
Landing.ai's agentic document extraction is technically impressive. The gap: it requires data scientist involvement to configure effectively, there's no no-code path for the developer who wants to define a schema in plain English and get a live endpoint, and there's no compliance or audit trail coverage in their published materials.
Spike API
Spike is purpose-built for blood test and lab report OCR with LOINC mapping β the narrowest offering in this space. The gap: it handles blood tests only, offers no schema customization for other medical document types, and has no compliance depth beyond basic SOC 2.
Fabrx
Fabrx is the only option that combines: conversational schema building (no ML required), sub-60-second deployment, field-level audit trails on every plan including free, BYOK with 100+ providers, schema versioning with drift detection, and EU AI Act readiness. It handles lab results, pathology reports, and patient intake forms from the same interface.
For the healthtech developer who needs to ship a compliant, maintainable medical document extraction pipeline without a data science team and without a sales conversation, Fabrx is the only tool in this space that removes all of those blockers simultaneously.
Whether you're building an EHR integration that ingests faxed lab results, a clinical workflow tool that needs structured pathology data, or a patient portal that processes scanned intake forms β the fastest path from PDF to structured JSON is a 60-second deploy, not a six-week vendor evaluation.
Related articles
EU AI Act Compliant Document Data Extraction: What Builders Need Before August 2026 (and After)
The August 2026 EU AI Act enforcement deadline has made document extraction a compliance surface. Here is exactly what GDPR and EU AI Act Articles 10, 11, and 13 require of your extraction pipeline β and how to satisfy both frameworks at once without a compliance team.
Read article βHow to Build a Document Extraction API Without Writing a Single Line of Code (In Under 60 Seconds)
Turn any document β invoice, contract, receipt, medical record β into structured JSON through a live API endpoint, using plain English to define your schema. No developer required. EU AI Act compliant on the free plan.
Read article βInvoice Data Extraction API: From PDF to Structured JSON in Under 60 Seconds β No Templates, No Training
Stop keying invoices by hand. Fabrx turns any PDF, scan, or image invoice into structured JSON via a live REST API β no template training, no model fine-tuning, EU AI Act compliant on the free plan.
Read article βYour document extraction API β live in under 60 seconds.
No templates. No training data. EU AI Act compliant on the free plan.
Get started free β