Deterministic extraction before AI
DocAssessment runs seven deterministic steps before the AI layer sees any content. Each step is grounded in an open-source library or published standard, cited below.
Most AI document tools send your raw document directly to a language model and ask it to summarize or interpret. That approach produces confident-sounding output that can invent facts, a failure mode known as hallucination. DocAssessment takes the opposite approach: the AI layer only ever sees already-extracted structured data, never the raw document. The seven steps below describe how that structured data is produced and validated.
Last reviewed: April 21, 2026
1. Text Extraction
The first step converts the uploaded file into plain text. The extractor selected depends on the file type:
- PDF: the
mupdflibrary handles native text PDFs and preserves positional data that later feeds table parsing[1]. - Word documents: Mammoth converts
.docxfiles to plain text while stripping styling artifacts[2]. - Scanned images and image-only PDFs: Tesseract.js runs OCR in WebAssembly and reports a confidence score that later gates the qualification step[3].
- Pasted text: the input is already plain text, so step 1 is skipped and the pipeline jumps to type detection.
No AI model is involved in text extraction. Every character that continues into the pipeline came directly from the file or image the user uploaded.
2. Document Type Detection
Extracted text is classified as one of nine supported types — residential lease, employment agreement, insurance policy, contract, invoice, quote, and several related variants — using keyword routing against a curated lexicon. The classifier is a short, auditable JavaScript module, not a model. If no category scores high enough, the type returns unknown and the qualification gate refuses the document before payment is requested.
Keyword-based classification was chosen over statistical classifiers because it is deterministic: the same text always produces the same label, and each label is traceable to the token set that triggered it.
3. Content Segmentation
The document is split into labelled sections — header block, parties, term, payment, obligations, termination, references, signatures — using regex patterns tuned per document type. Section boundaries matter because downstream fact extraction weights the same token differently depending on which section it appears in. A dollar amount in a "payment" section, for example, is treated as rent or purchase price; the same amount in a "damages" section is treated as a risk signal.
When the structural pass finds fewer than three sections in certain agreement types, the segmenter falls back to paragraph-level splits so the downstream steps still have enough granularity to produce useful facts.
4. Fact Extraction
Parties, dates, money amounts, and places are extracted using a combination of regular expressions and the Compromise natural-language library[4]. Compromise runs entirely in-process, with no network calls and no model. It recognises entity suffixes like LLC and Inc., normalizes date strings, and tags money tokens with their currency.
Each candidate fact is scored for context validity — is it inside a plausible section, does it have the expected neighbours, is it a false positive such as an invoice number that looks like a dollar amount — and capped per category to prevent a noisy document from flooding the downstream explanation.
5. Risk Scoring
A heuristic scorer inspects the extracted clauses and facts for risk signals: automatic renewal without notice, non-compete clauses that exceed the enforceability range in the governing state, indemnification that runs one-directional, security deposits above the state statutory cap, and similar patterns. Each signal contributes a point weight, the document's risk score is the sum, and the contributing tokens are stored so the risk flag can be traced back to source text in the report.
The scorer is a pure function of the extracted data, not a model. Calibrating the weights is a configuration exercise — there is a published changelog of weight adjustments so anyone reviewing the output can audit why a given flag fired.
6. AI Explanation
Only after the first five steps produce structured data is an AI model invoked. The OpenRouter gateway[5] routes the request to a model selected for the job (typically an Anthropic Claude variant). The prompt contains the extracted facts, segment labels, and risk tokens — never the raw document body.
The model's job is narrow: rewrite the already-extracted facts and risks in plain language, grouping related points and flagging where context would help a non-specialist reader. It cannot add facts that were not in the structured input; the output validator in step 7 rejects any output that contains numbers, dates, or party names that do not appear in the extraction.
7. Output Validation
Every AI response is parsed against a Zod schema[6] that enforces the shape of the explanation — required fields, length bounds, a forbidden-word list, and cardinality limits on the number of key points and risk flags. Output that fails validation is either retried with a stricter prompt or dropped in favour of the deterministic extraction alone; the user sees the extraction either way, so a validation failure does not leave them without a report.
Article schema for this page and every published article follows the schema.org Article specification[7], with the DocAssessment organization as the declared author. There are no author-identity or personal-bio fields — the analysis is produced by the pipeline described above, not by a named individual.
What this means for you
Because the AI never sees raw document text, it cannot hallucinate facts that are not in your document. Because the extraction is deterministic, running the same document through the pipeline twice produces the same facts and the same risk score. Because every risk flag carries the source tokens that triggered it, you can verify each flag against the original text before acting on it.
If you want to see the pipeline in action, head to the upload page. If you want to read more background on a specific document type, the articles library covers lease law, employment contracts, insurance, and related topics by jurisdiction.
References
- MuPDF — lightweight PDF, XPS, and eBook viewer/toolkit (accessed April 2026).
- Mammoth — convert Word documents to HTML / plain text (accessed April 2026).
- Tesseract.js — pure JavaScript OCR for 100+ languages (accessed April 2026).
- Compromise — modest natural language processing for JavaScript (accessed April 2026).
- OpenRouter Documentation — unified LLM API gateway (accessed April 2026).
- Zod — TypeScript-first schema validation with static type inference (accessed April 2026).
- schema.org — Article type specification (accessed April 2026).