本地版式感知文档解析器

v20260611

liteparse

LiteParse是一款本地部署、高性能的文档解析工具。它能够从PDF、Office文件和图像中提取布局感知的文本，并输出带有精确坐标（bounding boxes）的结构化JSON数据。适用于构建高级RAG系统、进行引文溯源或作为多模态代理的底层数据预处理。

PDF 解析 OCR RAG 本地结构化数据文本提取

获取技能

259 次下载

概览

LiteParse — Local Document Parsing

Overview

LiteParse is a fast, open-source document parser (Rust core, Python/Node bindings) focused on local, layout-aware text extraction with bounding boxes. It does not produce Markdown and does not call cloud LLMs. Outputs are plain text (layout-preserved) or structured JSON with per-page text_items (position, font metadata, optional confidence).

Version note: Examples target liteparse 2.0.0 (PyPI, May 2026). The upstream V1 branch is legacy; this skill documents V2 / main only.

For parser selection vs MarkItDown, the pdf skill, or LlamaParse, see references/choosing_a_parser.md.

When to Use This Skill

Use LiteParse when you need:

Fast local parsing of PDFs or converted Office/image files without cloud dependencies
Spatial text with bounding boxes for layout-aware RAG, citation grounding, or figure/table region logic
OCR on scanned PDFs or images (bundled Tesseract, or a user-run HTTP OCR server)
Page screenshots (PNG) for multimodal agents that must see charts, figures, or handwriting
Batch ingestion of literature folders, supplementary PDFs, or protocol libraries
Page subsets or password-protected PDFs

When Not to Use

Task	Use instead
Markdown for LLM ingestion (EPUB, audio, YouTube, HTML)	`markitdown` skill
Merge/split PDFs, forms, watermarks, rotation	`pdf` skill
Dense tables, handwriting, production cloud pipelines	LlamaParse (cloud; sign up separately)

Installation

uv pip install "liteparse==2.0.0"

This installs the Python bindings and the lit CLI. Verify:

lit --help
python -c "import liteparse; print(liteparse.__version__)"

Optional system tools (for non-PDF inputs):

LibreOffice — Word, Excel, PowerPoint, OpenDocument, CSV/TSV
ImageMagick — PNG, JPEG, TIFF, WebP, SVG, etc.

Install commands are in references/ocr_and_formats.md.

Node.js / TypeScript (optional): npm i @llamaindex/liteparse — see references/api_reference.md.

Quick Start

Python

from liteparse import LiteParse

parser = LiteParse(quiet=True)
result = parser.parse("paper.pdf")
print(result.text)

for page in result.pages:
    print(f"Page {page.page_num}: {len(page.text_items)} items")

CLI

# Layout-preserved text (default)
lit parse paper.pdf

# Structured JSON with bounding boxes
lit parse paper.pdf --format json -o paper.json

# Disable OCR on text-native PDFs (faster)
lit parse paper.pdf --no-ocr

Core Workflows

1. Parse to layout-preserved text

Best for quick full-document text or feeding chunkers that do not need coordinates.

parser = LiteParse(ocr_enabled=True, quiet=True)
result = parser.parse("document.pdf")
full_text = result.text

lit parse document.pdf -o output.txt

2. Parse to structured JSON (bounding boxes)

Use when building layout-aware RAG, highlighting source regions, or joining text with screenshots.

import json
from liteparse import LiteParse

parser = LiteParse(output_format="json", quiet=True)
result = parser.parse("document.pdf")

# Programmatic access
for page in result.pages:
    for item in page.text_items:
        bbox = (item.x, item.y, item.width, item.height)
        # item.text, item.confidence, item.font_name, item.font_size

lit parse document.pdf --format json -o document.json

JSON field layout: references/output_formats.md.

3. Parse specific pages

parser = LiteParse(target_pages="1-5,10,15-20", quiet=True)
result = parser.parse("long_paper.pdf")

lit parse long_paper.pdf --target-pages "1-5,10"

4. Parse from bytes or stdin

Useful for uploads, S3 downloads, or piping remote PDFs.

with open("document.pdf", "rb") as f:
    result = parser.parse(f.read())

curl -sL https://example.com/report.pdf | lit parse -

5. Page screenshots for multimodal agents

Screenshots capture visual content that text extraction alone misses (figures, complex tables, handwriting).

from pathlib import Path

parser = LiteParse(dpi=150, quiet=True)
shots = parser.screenshot("document.pdf", page_numbers=[1, 2, 3])
out = Path("screenshots")
out.mkdir(exist_ok=True)
for s in shots:
    (out / f"page_{s.page_num}.png").write_bytes(s.image_bytes)

lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 -o ./screenshots

Combine JSON parse + screenshots when an agent needs both coordinates and pixels for the same pages.

6. Batch-parse a directory

For large corpora, prefer the CLI (parallel OCR workers) or the bundled script.

lit batch-parse ./papers ./parsed --format json --recursive
lit batch-parse ./papers ./parsed --extension .pdf --no-ocr

python scripts/batch_parse_dir.py ./papers ./parsed --format json --recursive

See scripts/batch_parse_dir.py for a Python batch wrapper without network calls.

7. OCR configuration

OCR is on by default. Tesseract is bundled; no extra install for basic English OCR.

parser = LiteParse(
    ocr_enabled=True,
    ocr_language="eng",       # Tesseract codes: fra, deu, etc.
    num_workers=4,            # parallel OCR (default: CPU cores - 1)
    dpi=150,                  # higher DPI → better OCR, slower
)

lit parse scan.pdf --ocr-language fra
lit parse scan.pdf --no-ocr
lit parse scan.pdf --ocr-server-url http://localhost:8080/ocr

Offline / air-gapped: set TESSDATA_PREFIX to a directory of .traineddata files, or pass --tessdata-path. Details: references/ocr_and_formats.md.

8. Encrypted PDFs

parser = LiteParse(password="secret", quiet=True)
result = parser.parse("protected.pdf")

lit parse protected.pdf --password secret

9. Search text items by phrase

Merge adjacent items and return combined bounding boxes for a phrase (e.g. section titles).

from liteparse import search_items

page = result.get_page(1)
matches = search_items(page.text_items, "Materials and Methods", case_sensitive=False)

Multi-Format Inputs

Category	Extensions (examples)	Requirement
PDF	`.pdf`	Native
Office	`.docx`, `.xlsx`, `.pptx`, `.doc`, `.odt`, …	LibreOffice
Images	`.png`, `.jpg`, `.tiff`, `.webp`, `.svg`, …	ImageMagick

Files are converted to PDF internally, then parsed. If conversion tools are missing, parsing fails with an actionable error — install the dependency and retry.

Performance Tips

--no-ocr on born-digital PDFs — largest speedup
target_pages — parse only methods/supplement sections
num_workers — scale OCR across CPU cores
max_pages — cap very large files (default 1000)
lit batch-parse — directory-scale jobs with --recursive and --extension
Lower dpi (e.g. 100) when OCR quality is already sufficient

Reference Files

File	Read when
`references/choosing_a_parser.md`	Unsure whether to use LiteParse, MarkItDown, pdf, or LlamaParse
`references/api_reference.md`	Python/TypeScript API, types, `search_items`
`references/cli_reference.md`	Full `lit` command flags
`references/output_formats.md`	JSON schema, bboxes, confidence scores
`references/ocr_and_formats.md`	Tesseract, HTTP OCR, LibreOffice, ImageMagick

Troubleshooting

Issue	Fix
Office file fails	Install LibreOffice; ensure `soffice` is on PATH (Windows: add LibreOffice `program` dir)
Image fails	Install ImageMagick; verify `convert` or `magick` works
OCR poor quality	Increase `--dpi`; try `--ocr-language`; or HTTP OCR server
OCR slow	`--no-ocr` if not needed; reduce pages; increase `num_workers`
Air-gapped OCR	`export TESSDATA_PREFIX=/path/to/tessdata` or `--tessdata-path`
`ParseError` on bytes	Ensure input is valid PDF bytes (Office bytes need a file path + conversion)

Resources

GitHub: https://github.com/run-llama/liteparse
Docs: https://developers.llamaindex.ai/liteparse/
PyPI: https://pypi.org/project/liteparse/2.0.0/
npm: https://www.npmjs.com/package/@llamaindex/liteparse
OCR API spec: https://github.com/run-llama/liteparse/blob/main/OCR_API_SPEC.md

信息

Category 数据科学

Name liteparse

版本 v20260611

大小 13.94KB

Source K-Dense-AI/scientific-agent-skills

更新时间 2026-06-13