πŸŽ‰ Typeless is now Fabrx! Same great product, new name.
HealthtechΒ·11 min read

How to Extract Structured Data from Lab Results, Pathology Reports, and Patient Intake Forms β€” With an API That Deploys in 60 Seconds

A healthtech developer's guide to building a medical document extraction pipeline without training an ML model. Covers CBC panel extraction, pathology reports, intake forms, HIPAA audit trails, BYOK, and EU AI Act compliance.

Healthcare runs on documents. Lab results, pathology reports, faxed intake forms β€” the data your application needs is trapped inside PDFs that look different depending on whether they came from Quest Diagnostics, LabCorp, a local hospital system, or a handwritten intake packet from a rural clinic. If you've ever tried to extract a patient's hemoglobin level or INR value reliably across all of those formats, you already know the problem.

The good news: you don't need to train a model. You don't need a data science team. You don't need to call a vendor's sales team and wait two weeks for a demo. This guide walks through how to stand up a working lab result data extraction API in under 60 seconds using Fabrx β€” and covers the compliance and architectural questions your team will inevitably face.

Why Medical Document Extraction Is Still Broken for Healthtech Builders

The tools that have existed for years β€” AWS Textract, Google Document AI, traditional OCR libraries β€” were built for structured, consistent forms. Insurance claim PDFs. Bank statements. W-2s. These tools work reasonably well when the document format is known in advance and doesn't change.

Medical documents break those assumptions on every axis:

  • Variable layouts. Quest Diagnostics uses a completely different page layout than LabCorp, which uses a different layout than a hospital LIS export. The same test β€” a Complete Blood Count β€” will appear in different positions, with different column orders, different reference range formats, and sometimes different abbreviations (Hgb vs. HGB vs. Hemoglobin).
  • Fax-to-PDF degradation. A substantial portion of US lab results still travel by fax. The resulting scan quality is inconsistent. Textract's key-value extraction fails frequently on fax artifacts, column misalignment, and low-contrast text.
  • Multi-page complexity. A single pathology report may be eight pages long, with a narrative summary on page one and structured findings embedded in pages three through six. A traditional extraction pipeline that processes pages independently loses cross-page context.
  • Compliance overhead. PHI is involved. Your compliance team wants audit logs. Your healthcare customers want to know what AI provider is touching their data, whether it's being used for model training, and where data is stored. Traditional extraction tools rarely answer these questions clearly.

The result: healthtech engineering teams spend weeks building brittle template-matching pipelines that break every time a lab partner updates their report format, then spend more weeks updating the templates. It's an endless maintenance tax on core product velocity.

The Three Document Types That Kill Healthtech Pipelines

Most medical document extraction challenges cluster around three document categories, each with its own failure modes:

Lab Results

Lab results β€” Complete Blood Count panels, metabolic panels, lipid panels, thyroid function tests β€” are highly structured in principle but wildly variable in practice. The core extraction targets are consistent: test name, value, unit, reference range, and abnormal flag. But the way these fields are laid out across different lab vendors means template-based extraction has a failure rate that compounds with each new lab partner you onboard. LOINC code mapping (translating "Hgb" at Quest to the canonical LOINC code 718-7) adds another layer of complexity that most OCR tools don't address at all.

Pathology Reports

Pathology reports mix narrative prose with structured diagnostic codes. A breast biopsy pathology report might include a free-text clinical impression section followed by structured fields for tumor grade, ER/PR/HER2 receptor status, and staging. Extracting both the narrative summary and the structured codes requires a model that understands document-level context β€” not just optical character recognition.

Patient Intake Forms

Patient intake forms are the hardest category. They involve checkboxes (sometimes filled in pen, sometimes digital), handwritten fields, multi-column layouts, and multi-page structures. A patient may have checked "yes" to diabetes but the checkbox is a hand-drawn X rather than a filled circle. Traditional OCR pipelines classify these as empty. The clinical impact of getting this wrong is obvious.

What "Structured Extraction" Actually Means for Medical Documents

When a healthtech team says they want "structured extraction," they usually mean: give me a predictable JSON object I can write application code against. Specifically:

  • Fields the application cares about β€” not everything on the document
  • Consistent field names across different source document layouts
  • Field-level confidence scores so the application can flag low-confidence results for human review
  • A stable schema contract that downstream EHR connectors, analytics pipelines, and alerting systems depend on

This is different from what most OCR tools produce, which is a flat key-value dump of every text element on the page. You get everything, labeled inconsistently, with no guarantee that field names will be stable across different source formats. That output needs significant post-processing before it's usable β€” which is where most healthtech data engineering time actually goes.

The better approach: define the schema you want once, in plain English, and let the extraction layer handle mapping arbitrary document formats to your schema. This is what Fabrx's conversational schema builder does.

Fabrx advantage: Schema-first extraction means you define the output shape you want β€” patient MRN, test name, value, unit, reference range, flag β€” and Fabrx maps every document variant to that schema automatically. No templates. No per-vendor configuration files. No maintenance burden when a lab partner updates their report layout.

A Working Example: Extracting a CBC Panel from a Lab Result PDF

Here's a concrete walkthrough of how to build a working CBC panel extraction API using Fabrx β€” from schema definition to live API endpoint.

Step 1: Define your schema conversationally

In the Fabrx dashboard, describe what you want to extract. You don't need to specify XPath selectors, regex patterns, or bounding box coordinates. You type something like:

Extract the following fields from each test result row:
- patient_name (string)
- patient_mrn (string)
- collection_date (date, ISO 8601)
- test_name (string)
- loinc_code (string, if present)
- value (string)
- unit (string)
- reference_range_low (number, nullable)
- reference_range_high (number, nullable)
- abnormal_flag (enum: "H", "L", "HH", "LL", "N", null)
- ordering_provider (string)

Fabrx generates the extraction schema and validates it against a sample document you upload.

Step 2: Deploy the API endpoint

Click Deploy. Your extraction endpoint is live. You get a URL and an API key. No infrastructure to provision, no container to build, no model to host.

Step 3: Call the API

curl -X POST https://api.fabrx.ai/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@cbc_panel_quest.pdf" \
  -F "endpoint_id=ep_cbc_panel_v1"

Step 4: Receive structured JSON

{
  "patient_name": "Jane Smith",
  "patient_mrn": "MRN-004821",
  "collection_date": "2026-06-14",
  "results": [
    {
      "test_name": "Hemoglobin",
      "loinc_code": "718-7",
      "value": "11.2",
      "unit": "g/dL",
      "reference_range_low": 12.0,
      "reference_range_high": 16.0,
      "abnormal_flag": "L",
      "confidence": 0.97
    },
    {
      "test_name": "WBC",
      "loinc_code": "6690-2",
      "value": "8.4",
      "unit": "K/uL",
      "reference_range_low": 4.5,
      "reference_range_high": 11.0,
      "abnormal_flag": "N",
      "confidence": 0.99
    }
  ],
  "ordering_provider": "Dr. Michael Torres, MD",
  "extraction_id": "ext_7Qm9kzXp3",
  "processing_time_ms": 1240
}

The same endpoint, with the same JSON schema, handles the LabCorp version of a CBC panel without any configuration changes. The extraction layer normalizes the different source layouts to your schema.

Fabrx advantage: LOINC code mapping is built into the extraction schema. You can specifyloinc_code as a target field and Fabrx will populate it from the document if present, or attempt to resolve it from the test name if not β€” reducing the need for a separate LOINC normalization step in your pipeline.

Your medical document extraction API β€” live in under 60 seconds.

No templates. No training data. EU AI Act compliant on the free plan.

Get started free β†’

Handling Pathology Reports: Narrative Fields vs. Structured Codes

Pathology reports present a specific challenge: they contain both narrative clinical prose and structured diagnostic data, often interleaved across multiple pages. A breast biopsy report might open with a clinical history summary, then include an intraoperative findings section, followed by a structured microscopic description, and conclude with a diagnostic impression that blends free text with ICD-10 codes.

A pure OCR approach extracts all of this as undifferentiated text. A template-based approach works for one lab's pathology format and breaks immediately when you onboard a second lab partner.

The extraction schema for pathology reports typically needs to handle two parallel structures:

  • Narrative fields β€” clinical impression, gross description, microscopic findings. These are extracted as strings, often multi-sentence. Some teams run a secondary LLM summarization step to condense these for downstream display.
  • Structured codes β€” diagnosis codes (ICD-10), procedure codes (CPT), tumor characteristics (grade, stage, receptor status). These need to be extracted as typed fields with validation, not as raw strings.

In Fabrx, you define both in the same schema. Narrative fields are typed as string. Structured codes are typed as string with enum validation or number with range constraints. The extraction model handles the distinction automatically β€” it understands that "Grade 2/3" in the body of the report should map to your tumor_grade field, not your clinical_impression field.

For teams building EHR connectors, the ability to extract both narrative and structured fields in one pass β€” without a separate NLP pipeline β€” significantly reduces the number of moving parts in the integration architecture.

Patient Intake Form Extraction: Checkboxes, Handwriting, and Multi-Page Complexity

Patient intake forms are where most medical document extraction pipelines fail quietly. A checkbox indicating a positive family history of cardiac disease, missed because the patient used an X instead of a check mark, can affect clinical decision-making downstream. This is not a theoretical concern β€” it's a documented failure mode of OCR-only extraction pipelines in clinical settings.

Fabrx's extraction model handles intake forms with several capabilities that template-based tools lack:

  • Checkbox detection across handwriting styles. The model recognizes filled circles, X marks, check marks, and diagonal lines as positive checkbox states, not as noise.
  • Multi-page context. A patient may complete a medical history section on page two that references a condition disclosed on page four. Cross-page references are resolved within a single extraction call.
  • Handwritten field extraction. Emergency contact names, physician names, insurance policy numbers written by hand are extracted with confidence scores. Low-confidence handwritten fields are flagged for human review rather than silently errored.
  • Structured output for conditional fields. Many intake forms use conditional logic β€” "if yes, specify." Fabrx extracts both the checkbox state and the conditional text field as linked fields in the output JSON.

For teams building patient intake pipelines, the practical outcome is a reduction in the manual review queue. Instead of routing every form to a staff member for data entry, only forms with low-confidence fields β€” typically 5–15% of the total volume β€” require human verification.

See also: how Fabrx handles scanned document OCR at scale.

Compliance You Don't Have to Ask Sales For: Audit Trails, PII Detection, and EU AI Act Readiness

Healthcare compliance conversations with enterprise customers follow a predictable script. "What data do you retain? How long? Where is the audit log? What's your PHI handling policy? Are you HIPAA-compliant?" If your document extraction layer can't answer these questions with documentation rather than a sales call, you're adding weeks to every enterprise deal.

Fabrx is built so that compliance answers are available at signup β€” not gated behind an enterprise tier.

Field-level audit trails

Every extraction call produces an audit record: timestamp, document hash, extraction endpoint version, field-level confidence scores, and the AI model used. This record is immutable and queryable. For HIPAA audit readiness, you can demonstrate exactly what was extracted from a specific document, when, and with what confidence β€” without reconstructing events from application logs.

PII detection and flagging

Fabrx runs automated detection for the 18 PHI identifiers defined in the HIPAA Safe Harbor standard β€” names, dates (other than year), geographic data below state level, phone numbers, fax numbers, email addresses, social security numbers, medical record numbers, and others. Detected PHI is flagged in the extraction output, enabling downstream de-identification workflows without a separate NLP pass.

EU AI Act readiness

This is the compliance dimension that no competitor in the medical document extraction space has addressed. Under the EU AI Act, AI systems used in health and life sciences contexts are classified as high-risk. High-risk systems require conformity assessments, human oversight mechanisms, transparency documentation, and audit logging. If you're building a healthtech product for European markets β€” or for US customers who handle data from European patients β€” EU AI Act compliance is not optional.

Fabrx's field-level data lineage, confidence scoring, and immutable audit trails are designed to satisfy EU AI Act high-risk system requirements out of the box. You don't need to build a separate compliance layer on top of the extraction pipeline.

Read more: GDPR and EU AI Act compliant document processing with Fabrx.

Compliance: Audit trails are available on every Fabrx plan, including the free tier. Bootstrapped healthtech startups can demonstrate field-level extraction provenance to early healthcare customers β€” a capability that competitors gate behind enterprise contracts.

BYOK: Why Your Healthtech Customers Will Ask About Your AI Provider

Enterprise healthcare customers increasingly require contractual control over which AI models process their data. The questions arrive in every vendor evaluation: "What model are you using? Does it train on our data? Where is inference happening? Can we use our own API keys?"

For most document extraction vendors, the honest answer is: "We use a shared model, we can't tell you exactly which one, and data residency control requires a custom enterprise agreement." That answer fails vendor security reviews at most health systems.

Fabrx supports Bring Your Own Key (BYOK) across 100+ model providers. You can configure extraction endpoints to use your organization's Azure OpenAI deployment, your own Anthropic API key, or a self-hosted model running in your VPC β€” without leaving the Fabrx interface. Model training opt-out is enforced at the provider level when you supply your own credentials.

Fabrx advantage: BYOK with 100+ provider options means you can answer your healthcare customer's "what AI are you using?" question with specifics: your own Azure OpenAI deployment, your own API key, your own data residency configuration. No vendor-level negotiation required.

For teams building for EU markets, BYOK also addresses data residency requirements under GDPR β€” you can route extraction through a provider with EU data center guarantees rather than relying on a third-party vendor's data processing agreement.

See: how Fabrx's no-code API builder integrates with your existing infrastructure.

Schema Versioning: What Happens When Quest Changes Their Format

This is the question no competitor addresses β€” but it's the one that actually determines long-term engineering maintenance cost.

Lab vendors update their report formats. Quest Diagnostics redesigns their PDF layout. A hospital system migrates their LIS to a new vendor. A pathology lab switches from one reporting template to another. When this happens with a template-based extraction system, the extraction breaks. Silently, if you're unlucky. With loud failures, if you're fortunate enough to have the right monitoring in place.

Schema versioning in Fabrx addresses this at the API contract level. Each extraction endpoint has a version. When a document format change causes extraction drift β€” confidence scores drop, field coverage decreases β€” Fabrx surfaces this in the observability dashboard before it causes production failures.

You can:

  • Pin a specific schema version for downstream systems that depend on a stable field contract, while testing an updated schema against new document samples in staging.
  • Run A/B extraction across two schema versions to measure field coverage and confidence score differences before migrating downstream consumers.
  • Review extraction history to identify exactly when a format change occurred and which document batch first triggered the drift.

For healthtech teams managing relationships with multiple lab partners, this means format changes become an observable, manageable event rather than a production incident that your on-call engineer discovers at 2am because a downstream alert fired on bad data.

Fabrx advantage: Schema versioning with drift detection means you find out about lab vendor format changes before they become production failures β€” not after. No competitor in the medical document extraction space offers schema versioning with built-in observability.

Getting to Your First Working Medical Extraction API in Under 60 Seconds

The 60-second claim is specific: from account creation to a live, callable extraction endpoint. Here's the actual sequence:

  1. Create a free account at app.fabrx.ai. No credit card required.
  2. Create a new extraction endpoint. Name it (e.g., "CBC Panel β€” Quest Diagnostics").
  3. Describe your schema in plain English. Type what fields you want. The schema builder generates the configuration.
  4. Upload a sample document to validate extraction. Review the output JSON. Adjust any field descriptions if needed.
  5. Click Deploy. Copy your endpoint URL and API key.
  6. Make your first API call. POST a PDF. Receive structured JSON.

Steps 1–5 typically take less than 60 seconds for a developer who has a sample document on hand. Steps 1–6 β€” including the first successful API response β€” typically take under two minutes.

Compare this to AWS Textract: provisioning IAM roles, configuring S3 buckets for document storage, calling the Textract API, post-processing the key-value output, writing normalization logic, and building the schema mapping layer on top. That's a multi-day engineering project before you have a usable extraction endpoint for a single lab format.

Fabrx vs. Other Medical Document Extraction Tools

The competitive landscape for medical document extraction is fragmented. Here's an honest assessment of the main alternatives and where they fall short for healthtech builders:

Affinda

Affinda positions itself as enterprise-grade and offers a clean interface. The gap: no developer-first tutorial, no schema versioning, no BYOK, no EU AI Act readiness, and the path to a working integration goes through a sales conversation. For a healthtech startup that needs to ship in days, not weeks, the sales-gated onboarding is a blocker.

DocuPipe

DocuPipe offers a visual demo and claims HIPAA/SOC2/ISO compliance. The gap: no schema versioning, no BYOK, no EU AI Act coverage, and no time-to-deploy benchmarks. There's also no tutorial content that walks a developer through a medical document extraction use case end-to-end.

AWS Textract

Textract is the default choice for teams already in the AWS ecosystem. The gap: it produces raw key-value output that requires significant post-processing for medical documents, it's brittle on fax-quality scans, and it has no concept of schema definition, versioning, or audit trails. Every time a lab format changes, you're back to updating normalization code.

Landing.ai

Landing.ai's agentic document extraction is technically impressive. The gap: it requires data scientist involvement to configure effectively, there's no no-code path for the developer who wants to define a schema in plain English and get a live endpoint, and there's no compliance or audit trail coverage in their published materials.

Spike API

Spike is purpose-built for blood test and lab report OCR with LOINC mapping β€” the narrowest offering in this space. The gap: it handles blood tests only, offers no schema customization for other medical document types, and has no compliance depth beyond basic SOC 2.

Fabrx

Fabrx is the only option that combines: conversational schema building (no ML required), sub-60-second deployment, field-level audit trails on every plan including free, BYOK with 100+ providers, schema versioning with drift detection, and EU AI Act readiness. It handles lab results, pathology reports, and patient intake forms from the same interface.

For the healthtech developer who needs to ship a compliant, maintainable medical document extraction pipeline without a data science team and without a sales conversation, Fabrx is the only tool in this space that removes all of those blockers simultaneously.

Whether you're building an EHR integration that ingests faxed lab results, a clinical workflow tool that needs structured pathology data, or a patient portal that processes scanned intake forms β€” the fastest path from PDF to structured JSON is a 60-second deploy, not a six-week vendor evaluation.

Your document extraction API β€” live in under 60 seconds.

No templates. No training data. EU AI Act compliant on the free plan.

Get started free β†’