Route every PDF conversion through a short analysis step before choosing tools or CLI flags.
The goal is not "extract the most text". The goal is:
.md, .html, .txt, .json, .docx, or structured notes.Never start with one fixed default pipeline.
Always:
Heuristics are starting points, not guarantees.
Do not promote one flag combination into a universal default just because it worked well on one PDF. Prefer document-specific evidence over habit.
Use opendataloader-pdf as the primary conversion engine for every PDF conversion task by default.
This skill should assume:
opendataloader-pdf is always the first conversion attemptUse other tools only for one of these reasons:
opendataloader-pdf cannot produce a usable resultIdentify the document class as quickly as possible:
Useful fast checks:
pdfinfo input.pdf
pdftotext -layout input.pdf -
If text is missing or very poor, treat OCR as required.
Use these as default starting points:
medical / lab report
markdown-with-html + --table-method cluster + --image-output off
slide deck / PowerPoint export
markdown-with-html + --image-output off
add --table-method cluster only if the default route under-structures important tabular content
if tables are visually obvious but missing or badly fused, treat this as a detection problem, not a Markdown formatting problem
if the selected route already reconstructs a real table but clips leading characters at column boundaries, treat that as a boundary-splitting defect, not a missing-table failure
narrative / article / letter
start with markdown or text
use markdown-with-html only if structure clearly matters
table-heavy business / finance PDF
start with markdown-with-html
add --table-method cluster when rows or columns flatten
scanned / image-heavy PDF
OCR first, then convert with opendataloader-pdf
mixed-layout PDF
prefer markdown-with-html
validate one easy section and one hard section before accepting output
Pick the output that best matches the document and the user's goal.
markdown-with-html
Use by default when the user wants Markdown and fidelity matters.
Prefer this for tables, medical reports, slides, mixed-layout PDFs, and anything likely to break in pure Markdown.
markdown
Use only when clean plain Markdown matters more than layout fidelity.
html
Use when visual structure matters more than LLM readability.
text
Use for quick linear extraction, narrative documents, or when structure is unimportant.
json
Use when downstream machine processing matters more than human readability.
docx
Use when the user wants editable office output and layout reconstruction matters.
Use OpenDataLoader as the default route.
Preferred defaults:
For Markdown output with fidelity priority:
-f markdown-with-html
For medical PDFs:
add --table-method cluster
For table-heavy PDFs:
add --table-method cluster
For slide decks:
start without --table-method cluster
add it only after a structure check shows meaningful improvement
if a pseudo-table is already collapsed inside one detected row, changing only the Markdown flavor usually will not fix it
if the active engine build recovers the pseudo-table structure, prefer fixing residual boundary artifacts before escalating to hybrid/full mode
For conversions where images are not requested:
add --image-output off
For slide decks, medical reports, and structure-sensitive PDFs: prefer validating both the command success and the actual rendered structure
For referts/reports where exact values matter: validate key sections after conversion instead of trusting first pass
Default route:
opendataloader-pdf -f markdown-with-html --table-method cluster --image-output off
Then verify:
If a clinical table is flattened, compare against pdftotext -layout before accepting output.
Prefer:
opendataloader-pdf -f markdown-with-html --image-output off
Then check for:
If CLI output is still poor, do a cleanup pass tuned for slides instead of assuming the raw extract is final. If the slide contains obvious table-like blocks that are not detected as tables at all, prefer a same-engine retry with a stronger route such as hybrid/full mode before jumping to unrelated extractors. If the slide now produces a real table, validate the first column and header boundaries before assuming the table is fully correct.
If the text layer is poor or absent:
opendataloader-pdf
Prefer conservative reconstruction over aggressive guessing.
Before claiming success, inspect the output for the patterns most likely to break.
For medical PDFs:
For slides:
For table-heavy documents:
For every document class:
Treat these as signals that the current output is not ready:
markdown to markdown-with-html improves wrapping but does not restore missing row boundariesDo not accept a conversion just because the top of the file looks good.
Always validate:
For medical PDFs, this means checking a real lab table, not just the heading block.
For slide decks, this means checking at least one dense diagram or pseudo-table, not just the title slides.
Conversion is not finished just because a file was generated.
If the output is structurally correct but still noisy or hard to read, perform a cleanup pass before delivering it.
Use three buckets:
cleanup
For noise reduction without changing meaning.
Examples:
Important: do not collapse a table just because it is sparse, narrow, or mostly empty. Preserve legitimate single-column and sparse tables if they still carry table meaning.
structural correction
For repairing attachment and readability when the extractor found the right content but the wrong structure.
Examples:
route retry
For cases where the problem comes from the wrong extraction path, not from output cleanup.
Always prefer the least invasive repair that produces a faithful, readable result.
Do not leave raw noisy output untouched if it is clearly improvable.
Do one targeted retry if the first route is wrong.
Examples:
markdown-with-html
--table-method cluster
Do not keep blindly retrying many variants. Choose the next attempt based on the failure mode.
Prefer this retry order:
For --table-method cluster, treat it as a targeted retry or document-specific default, not a universal default.
It is often the best choice for medical PDFs, but not automatically for every slide deck or every business document.
When the user does not specify otherwise:
markdown-with-html over pure markdown
--table-method cluster for medical PDFs--table-method cluster for table-heavy PDFs when rows or columns flatten--table-method cluster is the best default for slide decksmarkdown-with-html alone fixes fused table rows if the underlying table structure is already wrongIf the work involves changing opendataloader-pdf behavior itself, not just running a conversion:
Wins on one PDF are useful, but they do not justify turning a heuristic into a global default without broader validation.
opendataloader-pdf, OCR tools, or PDF utilities are installed in every environment.Before finishing, make sure you can state:
opendataloader-pdf route was chosenDistinguish between:
document fidelity
correct content, correct attachment, correct section structure
visual fidelity
preserving the original visual layout as closely as possible
Optimize first for document fidelity.
Do not sacrifice semantic correctness just to imitate the original page visually.
For most conversions, a structurally correct and readable output is better than a visually similar but semantically broken one.
When reporting back, prefer saying:
Do not deliver raw extractor output without a cleanup and validation pass when fidelity matters.
If the document is complex, say which route was chosen and why.