Cellv0.2.5
Document Extraction

Extraction Pipeline

Turn insurance PDFs into structured data with the multi-pass extraction pipeline

Cell's extraction pipeline processes insurance documents in multiple passes, producing structured data with page-level provenance.

Policy extraction

extractFromPdf runs the full pipeline (passes 1-3) for policy documents:

import { extractFromPdf, applyExtracted } from "@claritylabs-inc/cell";

const { rawText, extracted } = await extractFromPdf(pdfBase64, {
  log: async (msg) => console.log(msg),
  onMetadata: async (raw) => {
    // Save metadata immediately — survives if pass 2 fails
    await db.saveMetadata(docId, raw);
  },
});

const fields = applyExtracted(extracted);

What gets extracted

Pass 1 — Metadata:

  • Carrier, security, underwriter, MGA, broker
  • Policy number, effective/expiration dates, policy year
  • Premium, insured name, policy types
  • Coverage table (name, limit, deductible, page number)

Pass 2 — Sections:

  • Structured sections with title, page range, type, content
  • Subsections with section numbers
  • Coverage type classification per section

Pass 3 — Enrichment:

  • Regulatory context (structured)
  • Complaint contacts
  • Costs and fees
  • Claims contacts

Quote extraction

extractQuoteFromPdf runs a quote-specific pipeline (passes 1-2):

import { extractQuoteFromPdf, applyExtractedQuote } from "@claritylabs-inc/cell";

const { extracted } = await extractQuoteFromPdf(pdfBase64);
const fields = applyExtractedQuote(extracted);

In addition to standard metadata, quotes extract:

  • Premium breakdown[{ line: "GL", amount: "$5,200" }]
  • Subjectivities — conditions for binding
  • Underwriting conditions — carrier requirements
  • Proposed dates — effective, expiration, quote expiration

Retrying sections

If metadata succeeded but sections failed, retry just pass 2 without re-extracting metadata:

import { extractSectionsOnly } from "@claritylabs-inc/cell";

const { extracted } = await extractSectionsOnly(pdfBase64, savedMetadataRaw, {
  log: async (msg) => console.log(msg),
});

Chunking strategy

Documents are split into page chunks for section extraction. Cell uses an adaptive strategy:

  1. Start with 15-page chunks
  2. On JSON parse failure (output truncation), re-split to 10-page chunks
  3. If still failing, re-split to 5-page chunks
  4. If all sizes fail, escalate to the sectionsFallback model with higher token limits
import { getPageChunks } from "@claritylabs-inc/cell";

const chunks = getPageChunks(45, 15);
// [[1, 15], [16, 30], [31, 45]]

Merging results

After chunked extraction, results are merged:

import { mergeChunkedSections, mergeChunkedQuoteSections } from "@claritylabs-inc/cell";

// Policies
const merged = mergeChunkedSections(metadataResult, sectionChunks);

// Quotes (also merges subjectivities + underwriting conditions)
const quoteMerged = mergeChunkedQuoteSections(metadataResult, sectionChunks);

Early persistence

The onMetadata callback fires after pass 1 completes, before sections extraction begins. This ensures metadata is persisted even if pass 2 fails:

const { extracted } = await extractFromPdf(pdfBase64, {
  onMetadata: async (raw) => {
    const parsed = JSON.parse(raw);
    await db.patch(docId, {
      carrier: parsed.metadata.carrier,
      extractionStatus: "metadata_complete",
    });
  },
});

Utility functions

stripFences(text)

Removes markdown code fences from AI responses before JSON parsing:

import { stripFences } from "@claritylabs-inc/cell";

stripFences('```json\n{"key": "value"}\n```');
// '{"key": "value"}'

sanitizeNulls(obj)

Recursively converts null values to undefined. Required for frameworks like Convex that reject null for optional fields:

import { sanitizeNulls } from "@claritylabs-inc/cell";

sanitizeNulls({ a: null, b: [null, 1] });
// { a: undefined, b: [undefined, 1] }

On this page