Extraction Pipeline
Turn insurance PDFs into structured data with the multi-pass extraction pipeline
Cell's extraction pipeline processes insurance documents in multiple passes, producing structured data with page-level provenance.
Policy extraction
extractFromPdf runs the full pipeline (passes 1-3) for policy documents:
import { extractFromPdf, applyExtracted } from "@claritylabs-inc/cell";
const { rawText, extracted } = await extractFromPdf(pdfBase64, {
log: async (msg) => console.log(msg),
onMetadata: async (raw) => {
// Save metadata immediately — survives if pass 2 fails
await db.saveMetadata(docId, raw);
},
});
const fields = applyExtracted(extracted);
What gets extracted
Pass 1 — Metadata:
- Carrier, security, underwriter, MGA, broker
- Policy number, effective/expiration dates, policy year
- Premium, insured name, policy types
- Coverage table (name, limit, deductible, page number)
Pass 2 — Sections:
- Structured sections with title, page range, type, content
- Subsections with section numbers
- Coverage type classification per section
Pass 3 — Enrichment:
- Regulatory context (structured)
- Complaint contacts
- Costs and fees
- Claims contacts
Quote extraction
extractQuoteFromPdf runs a quote-specific pipeline (passes 1-2):
import { extractQuoteFromPdf, applyExtractedQuote } from "@claritylabs-inc/cell";
const { extracted } = await extractQuoteFromPdf(pdfBase64);
const fields = applyExtractedQuote(extracted);
In addition to standard metadata, quotes extract:
- Premium breakdown —
[{ line: "GL", amount: "$5,200" }] - Subjectivities — conditions for binding
- Underwriting conditions — carrier requirements
- Proposed dates — effective, expiration, quote expiration
Retrying sections
If metadata succeeded but sections failed, retry just pass 2 without re-extracting metadata:
import { extractSectionsOnly } from "@claritylabs-inc/cell";
const { extracted } = await extractSectionsOnly(pdfBase64, savedMetadataRaw, {
log: async (msg) => console.log(msg),
});
Chunking strategy
Documents are split into page chunks for section extraction. Cell uses an adaptive strategy:
- Start with 15-page chunks
- On JSON parse failure (output truncation), re-split to 10-page chunks
- If still failing, re-split to 5-page chunks
- If all sizes fail, escalate to the
sectionsFallbackmodel with higher token limits
import { getPageChunks } from "@claritylabs-inc/cell";
const chunks = getPageChunks(45, 15);
// [[1, 15], [16, 30], [31, 45]]
Merging results
After chunked extraction, results are merged:
import { mergeChunkedSections, mergeChunkedQuoteSections } from "@claritylabs-inc/cell";
// Policies
const merged = mergeChunkedSections(metadataResult, sectionChunks);
// Quotes (also merges subjectivities + underwriting conditions)
const quoteMerged = mergeChunkedQuoteSections(metadataResult, sectionChunks);
Early persistence
The onMetadata callback fires after pass 1 completes, before sections extraction begins. This ensures metadata is persisted even if pass 2 fails:
const { extracted } = await extractFromPdf(pdfBase64, {
onMetadata: async (raw) => {
const parsed = JSON.parse(raw);
await db.patch(docId, {
carrier: parsed.metadata.carrier,
extractionStatus: "metadata_complete",
});
},
});
Utility functions
stripFences(text)
Removes markdown code fences from AI responses before JSON parsing:
import { stripFences } from "@claritylabs-inc/cell";
stripFences('```json\n{"key": "value"}\n```');
// '{"key": "value"}'
sanitizeNulls(obj)
Recursively converts null values to undefined. Required for frameworks like Convex that reject null for optional fields:
import { sanitizeNulls } from "@claritylabs-inc/cell";
sanitizeNulls({ a: null, b: [null, 1] });
// { a: undefined, b: [undefined, 1] }