A comprehensive, layout-aware PDF parser for extracting text, images, and tables from digitally-born PDF documents. Supports multiple extraction libraries for quality comparison and optimal results.
- Metadata Extraction: Extracts complete PDF metadata (title, author, dates, page info, etc.)
- Layout-Aware Text Extraction: Preserves document structure, font information, and reading order
- Advanced Column Detection:
- Intelligently detects multi-column layouts (research papers, newspapers, magazines)
- Fast mode: 50-100x faster than detailed analysis
- Detailed mode: Handles text with different background colors and images
- Respects column boundaries and proper reading order
- Configurable header/footer margins for improved accuracy
- Column Visualization: Debug tool to visualize detected column boundaries
- Formula Detection & LaTeX Conversion:
- Detects mathematical formulas using heuristic analysis
- Strict Mode: Reduces false positives by filtering non-formula text
- External OCR Support: Interface for high-quality LaTeX conversion (e.g., Mathpix)
- Image Extraction: Extracts embedded images with position and metadata
- Table Extraction: Advanced table detection and extraction
- Multiple Extraction Methods: Compare different libraries for optimal results
- Token-Efficient Export:
- TOON format (default): 10-60% fewer tokens vs JSON - ideal for LLM input
- JSON format: Standard JSON export when needed
- Built-in token comparison to measure savings
- Modular Architecture: Clean, extensible design with separate modules for text, image, table, and formula extraction
TOON (Token-Oriented Object Notation) is a space-efficient data format designed specifically for LLM input. It reduces token counts by 10-60% compared to JSON while remaining human-readable and structurally similar. TOON Github: https://github.com/toon-format/toon
Key Benefits:
- Token Efficiency: 30-60% fewer tokens than JSON for typical PDF data
- Cost Savings: Reduced API costs when sending data to LLMs
- Better Context Usage: More content fits in the same context window
- Easy to Use: Drop-in replacement for JSON export with
format="toon"parameter
When to use TOON vs JSON:
- Use TOON (default) when preparing data for LLM input, API calls, or when token efficiency matters
- Use JSON when you need standard JSON for interoperability with existing tools or APIs that require JSON format
- PyMuPDF (fitz): Fast, comprehensive extraction with layout awareness
- pdfplumber: Excellent text extraction with precise positioning
- Camelot: Advanced table extraction with lattice and stream modes
- Tabula: Java-based table extraction
- PyMuPDF (fitz): Complete image extraction with metadata
-
Python 3.8+
-
Java Runtime Environment (required for Tabula)
# Ubuntu/Debian sudo apt-get install default-jre # macOS brew install openjdk # Windows # Download from https://www.java.com/
-
Ghostscript (required for Camelot)
# Ubuntu/Debian sudo apt-get install ghostscript # macOS brew install ghostscript # Windows # Download from https://www.ghostscript.com/
It's recommended to use a virtual environment to isolate dependencies:
Using venv (built-in):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateUsing uv (creates venv automatically):
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateYou can install dependencies using either classic pip or the faster uv package manager:
pip install -r requirements.txtuv is a fast Python package installer written in Rust. It's 10-100x faster than pip.
# Install uv first (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies with uv
uv pip install -r requirements.txtOr on Windows (PowerShell):
# Install uv
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Install dependencies
uv pip install -r requirements.txtBenefits of using uv:
- 10-100x faster installation and etc etc.
from metadata_document_parser import PDFMetadataParser
# Initialize parser
parser = PDFMetadataParser("document.pdf")
# Parse with all features
result = parser.parse(
extract_text=True,
extract_images=True,
extract_tables=True,
extract_formulas=True, # NEW: Extract mathematical formulas
layout_aware=True,
column_aware=True # NEW: Fix reading order for multi-column layouts
)
# Access results
print(f"Title: {result.metadata.title}")
print(f"Pages: {result.metadata.num_pages}")
print(f"Text blocks: {len(result.text_blocks)}")
print(f"Images: {len(result.images)}")
print(f"Tables: {len(result.tables)}")# Run all examples
python example_usage.py your_document.pdf
# Test multi-column extraction specifically
python example_multi_column.py research_paper.pdf
# Debug and visualize column detection
python test_column_detection.py research_paper.pdf --header-margin 50 --footer-margin 50
# Test TOON export with token comparison
python example_toon_export.py research_paper.pdfparser = PDFMetadataParser("document.pdf")
result = parser.parse(
extract_text=True,
text_method="pymupdf", # or "pdfplumber"
layout_aware=True
)
# Access text blocks with position and formatting
for block in result.text_blocks:
print(f"Page {block.page_num}: {block.block_type}")
print(f"Position: {block.bbox}")
print(f"Font: {block.font_name} (Size: {block.font_size})")
print(f"Text: {block.text}\n")parser = PDFMetadataParser("document.pdf")
# Using Camelot (best for bordered tables)
result = parser.parse(
extract_tables=True,
table_method="camelot"
)
# Using Tabula (good for borderless tables)
result = parser.parse(
extract_tables=True,
table_method="tabula"
)
# Access table data
for table in result.tables:
print(f"Table {table.table_index} on page {table.page_num}")
print(f"Rows: {len(table.data)}")
print(table.data) # List of listsparser = PDFMetadataParser("document.pdf")
result = parser.parse(extract_images=True)
# Save images to disk
saved_paths = parser.save_images(result, "output_images/")
print(f"Saved {len(saved_paths)} images")parser = PDFMetadataParser("document.pdf")
comparison = parser.compare_extraction_methods()
# Results include performance metrics for each method
print(comparison)TOON format (default) - 10-60% fewer tokens vs JSON, ideal for LLM input:
parser = PDFMetadataParser("document.pdf")
result = parser.parse()
# Export to TOON format (default - token-efficient for LLMs)
toon_output = parser.export(result) # format="toon" is default
print(toon_output)
# Save to file
with open("parsed_document.toon", "w") as f:
f.write(toon_output)JSON format (explicit) - use when you need standard JSON:
# Export to JSON
json_output = parser.export(result, format="json")
# Or use dedicated method
json_output = parser.export_to_json(result, indent=2)
# Save to file
with open("parsed_document.json", "w") as f:
f.write(json_output)Compare token counts between formats:
comparison = parser.compare_export_formats(result)
print(f"JSON tokens: {comparison['json_tokens']:,}")
print(f"TOON tokens: {comparison['toon_comma_tokens']:,}")
print(f"Savings: {comparison['toon_comma_savings_percent']}%")# Basic usage with default margins
parser = PDFMetadataParser("document.pdf")
# Enable column-aware reading order
result = parser.parse(
extract_text=True,
layout_aware=True,
column_aware=True # Automatically detects and fixes column order
)
# Check detected layout
print(f"Detected layout: {result.column_layout}") # 'single', 'double', or 'multi'
# Text blocks are now in correct reading order (left column, then right column)
for block in result.text_blocks:
print(block.text)Advanced: Custom Header/Footer Margins
For PDFs with large headers or footers, adjust the margins to improve column detection:
# Initialize with custom margins (in points, 72 points = 1 inch)
parser = PDFMetadataParser(
"document.pdf",
header_margin=100, # Ignore top 100 points (large header)
footer_margin=80 # Ignore bottom 80 points (large footer)
)
result = parser.parse(
extract_text=True,
layout_aware=True,
column_aware=True
)Visualizing Column Detection
Debug and verify column detection by creating an annotated PDF:
parser = PDFMetadataParser("document.pdf")
# Creates a PDF with red borders around detected columns
output_path = parser.visualize_columns()
print(f"Annotated PDF saved to: {output_path}")Basic Usage (Heuristic):
parser = PDFMetadataParser("document.pdf")
# Extract formulas with heuristic detection (no GPU/DL required)
# Use strict_mode=True to reduce false positives
result = parser.parse(
extract_text=True,
extract_formulas=True,
strict_mode=True
)Advanced Usage (External OCR): For production-quality LaTeX, you can use an external OCR service like Mathpix.
from metadata_document_parser.extractors.ocr import MathpixOCR
# Initialize OCR strategy
ocr = MathpixOCR(app_id="YOUR_APP_ID", app_key="YOUR_APP_KEY")
result = parser.parse(
extract_text=True,
extract_formulas=True,
strict_mode=True,
ocr_strategy=ocr
)
# Access detected formulas
for formula in result.formulas:
print(f"Formula: {formula.formula_text}")
print(f"LaTeX: {formula.latex}") # High-quality LaTeX from OCR
print(f"Confidence: {formula.confidence:.2f}")parser = PDFMetadataParser("document.pdf")
result = parser.parse(
layout_aware=True,
column_aware=False # Disable column detection
)
# Sort blocks by page and vertical position (simple top-to-bottom)
sorted_blocks = sorted(
result.text_blocks,
key=lambda b: (b.page_num, b.bbox[1])
)
# Print in reading order
for block in sorted_blocks:
if block.block_type == "title":
print(f"\nTITLE: {block.text}\n")
elif block.block_type == "heading":
print(f"\nHEADING: {block.text}\n")
else:
print(block.text)__init__(pdf_path: str, footer_margin: int = 50, header_margin: int = 50, fast_column_detection: bool = True)
Initialize the parser with a PDF file path.
Parameters:
pdf_path(str): Path to the PDF filefooter_margin(int): Height in points of bottom stripe to ignore for column detection (default: 50)header_margin(int): Height in points of top stripe to ignore for column detection (default: 50)fast_column_detection(bool): Use fast column detection algorithm—50-100x faster than detailed mode (default: True)
parse(extract_text=True, extract_images=True, extract_tables=True, extract_formulas=False, text_method="pymupdf", table_method="camelot", layout_aware=True, column_aware=True) -> ParsedDocument
Parse the PDF document.
Parameters:
extract_text(bool): Extract text contentextract_images(bool): Extract imagesextract_tables(bool): Extract tablesextract_formulas(bool): Extract and detect mathematical formulas (NEW)text_method(str): Method for text extraction ("pymupdf" or "pdfplumber")table_method(str): Method for table extraction ("camelot" or "tabula")layout_aware(bool): Preserve layout informationcolumn_aware(bool): Detect columns and fix reading order (NEW)strict_mode(bool): Enable stricter formula detection to reduce false positives (default: False)ocr_strategy(ExternalOCR): Optional OCR strategy for high-quality LaTeX conversion (e.g., MathpixOCR)
Returns: ParsedDocument object containing all extracted data
Compare different extraction methods and return performance metrics.
Export parsed document to dictionary format.
export(parsed_doc: ParsedDocument, format: str = "toon", delimiter: str = ",", indent: int = 2) -> str
Export parsed document to specified format (default: TOON).
Parameters:
parsed_doc(ParsedDocument): Parsed document to exportformat(str): Output format - "toon" (default, 30-60% fewer tokens) or "json"delimiter(str): TOON delimiter - ',' (comma), '\t' (tab), or '|' (pipe). Tab often provides best token efficiencyindent(int): JSON indentation (only used if format="json")
Returns: Formatted string in requested format
Export parsed document to JSON format.
Parameters:
parsed_doc(ParsedDocument): Parsed document to exportindent(int): JSON indentation (default: 2)
Returns: JSON string representation
Export parsed document to TOON format for token-efficient LLM input.
Parameters:
parsed_doc(ParsedDocument): Parsed document to exportdelimiter(str): Array delimiter - ',' (comma), '\t' (tab), or '|' (pipe)
Returns: TOON formatted string
Note: Requires toon-format package. Install with: pip install toon-format
Compare token counts between JSON and TOON export formats.
Parameters:
parsed_doc(ParsedDocument): Parsed document to compare
Returns: Dictionary containing:
json_tokens: Token count for JSON formatjson_size_bytes: Size in bytes for JSONtoon_comma_tokens: Token count for TOON with comma delimitertoon_comma_size_bytes: Size in bytes for TOON with comma delimitertoon_comma_savings_percent: Percentage savings vs JSONtoon_tab_tokens: Token count for TOON with tab delimitertoon_tab_size_bytes: Size in bytes for TOON with tab delimitertoon_tab_savings_percent: Percentage savings vs JSONbest_format: Best performing TOON formatbest_savings_percent: Best savings percentage achieved
Save extracted images to disk.
Create a visual representation of detected columns by drawing red borders around detected column bboxes and numbering them. Useful for debugging and understanding column detection.
Parameters:
output_path(str, optional): Output path for annotated PDF. If not provided, uses<original_name>-columns.pdf
Returns: Path to the annotated PDF file
Contains PDF metadata:
title,author,subject,creator,producercreation_date,modification_datenum_pages,file_size,page_sizes
Represents a text block with layout info:
text: The text contentbbox: Bounding box (x0, y0, x1, y1)page_num: Page numberfont_size,font_name: Font informationblock_type: "text", "title", "heading", "header", "footer"
Represents an extracted image:
image_index,page_num: Position infobbox: Bounding boxwidth,height: Dimensionscolorspace: Color spaceimage_bytes: Raw image dataext: File extension
Represents an extracted table:
table_index,page_num: Position infobbox: Bounding box (if available)data: Table data as list of listsextraction_method: Method used
Represents a detected mathematical formula:
formula_index: Index of the formulapage_num: Page numberbbox: Bounding boxformula_text: Original text representationlatex: LaTeX conversion (heuristic-based)confidence: Detection confidence score (0.0-1.0)image_bytes: Formula as image (optional)
Complete parsed document data:
metadata: DocumentMetadatatext_blocks: List of TextBlockimages: List of ImageDatatables: List of TableDataformulas: List of FormulaData (NEW)extraction_method: Method usedparsing_time: Time taken to parsecolumn_layout: Detected layout ('single', 'double', 'multi') (NEW)
| Library | Speed | Layout Awareness | Font Info | Best For |
|---|---|---|---|---|
| PyMuPDF | ⚡⚡⚡ Fast | ✅ Excellent | ✅ Yes | All-purpose, comprehensive extraction |
| pdfplumber | ⚡⚡ Moderate | ✅ Excellent | ❌ Limited | Precise text positioning |
Recommendation: Use PyMuPDF for most cases. Use pdfplumber when you need very precise character-level positioning.
| Library | Speed | Bordered Tables | Borderless Tables | Best For |
|---|---|---|---|---|
| Camelot | ⚡⚡ Moderate | ✅ Excellent | ⚡ Good (stream mode) | Complex, well-structured tables |
| Tabula | ⚡⚡⚡ Fast | ✅ Good | ✅ Better | Simple tables, quick extraction |
Recommendation: Use Camelot with lattice mode for bordered tables. Use Camelot stream mode or Tabula for borderless tables.
-
Fast column detection (enabled by default): Optimized algorithm that's 50-100x faster than detailed mode
# Fast mode (default) parser = PDFMetadataParser("paper.pdf", fast_column_detection=True) # Detailed mode - only needed for PDFs with text on images or colored backgrounds parser = PDFMetadataParser("paper.pdf", fast_column_detection=False)
-
Text only: Disable image and table extraction for faster processing
result = parser.parse(extract_text=True, extract_images=False, extract_tables=False)
-
Layout-aware off: Set
layout_aware=Falsefor simple text extraction (fastest)result = parser.parse(layout_aware=False, column_aware=False)
-
Choose the right method: PyMuPDF is generally fastest for text
-
Batch processing: Process multiple PDFs in parallel using multiprocessing
-
Use uv for faster installs: Install dependencies with
uv pip installfor 10-100x faster installation
Here's a quick comparison of common commands between pip and uv:
| Task | pip | uv |
|---|---|---|
| Install from requirements.txt | pip install -r requirements.txt |
uv pip install -r requirements.txt |
| Install single package | pip install package-name |
uv pip install package-name |
| Install with version | pip install package==1.0.0 |
uv pip install package==1.0.0 |
| Upgrade package | pip install --upgrade package |
uv pip install --upgrade package |
| Uninstall package | pip uninstall package |
uv pip uninstall package |
| List installed packages | pip list |
uv pip list |
| Freeze requirements | pip freeze > requirements.txt |
uv pip freeze > requirements.txt |
| Create virtual env | python -m venv venv |
uv venv |
Note: uv commands work exactly like pip commands but are significantly faster. You can simply replace pip with uv pip in most cases.
- Scanned PDFs: This parser is designed for digitally-born PDFs. For scanned PDFs, you'll need OCR (e.g., Tesseract)
- Complex layouts: Very complex multi-column layouts may require manual tuning
- Encrypted PDFs: Password-protected PDFs need to be decrypted first
- Large files: Very large PDFs (100+ MB) may require significant memory
Try both flavors:
# For tables with borders
result = parser.parse(table_method="camelot") # Uses 'lattice' by default
# For tables without borders, modify pdf_parser.py line 350:
# flavor='stream' instead of flavor='lattice'Ensure Java is installed and in your PATH:
java -versionEnsure Ghostscript is installed:
gs --versionContributions are welcome! Please feel free to submit pull requests or open issues.
MIT License
This parser leverages several excellent open-source libraries:
For issues, questions, or contributions, please open an issue on GitHub.