PDF Metadata Document Parser

A comprehensive, layout-aware PDF parser for extracting text, images, and tables from digitally-born PDF documents. Supports multiple extraction libraries for quality comparison and optimal results.

Features

Metadata Extraction: Extracts complete PDF metadata (title, author, dates, page info, etc.)
Layout-Aware Text Extraction: Preserves document structure, font information, and reading order
Advanced Column Detection:
- Intelligently detects multi-column layouts (research papers, newspapers, magazines)
- Fast mode: 50-100x faster than detailed analysis
- Detailed mode: Handles text with different background colors and images
- Respects column boundaries and proper reading order
- Configurable header/footer margins for improved accuracy
Column Visualization: Debug tool to visualize detected column boundaries
Formula Detection & LaTeX Conversion:
- Detects mathematical formulas using heuristic analysis
- Strict Mode: Reduces false positives by filtering non-formula text
- External OCR Support: Interface for high-quality LaTeX conversion (e.g., Mathpix)
Image Extraction: Extracts embedded images with position and metadata
Table Extraction: Advanced table detection and extraction
Multiple Extraction Methods: Compare different libraries for optimal results
Token-Efficient Export:
- TOON format (default): 10-60% fewer tokens vs JSON - ideal for LLM input
- JSON format: Standard JSON export when needed
- Built-in token comparison to measure savings
Modular Architecture: Clean, extensible design with separate modules for text, image, table, and formula extraction

What is TOON?

TOON (Token-Oriented Object Notation) is a space-efficient data format designed specifically for LLM input. It reduces token counts by 10-60% compared to JSON while remaining human-readable and structurally similar. TOON Github: https://github.com/toon-format/toon

Key Benefits:

Token Efficiency: 30-60% fewer tokens than JSON for typical PDF data
Cost Savings: Reduced API costs when sending data to LLMs
Better Context Usage: More content fits in the same context window
Easy to Use: Drop-in replacement for JSON export with format="toon" parameter

When to use TOON vs JSON:

Use TOON (default) when preparing data for LLM input, API calls, or when token efficiency matters
Use JSON when you need standard JSON for interoperability with existing tools or APIs that require JSON format

Supported Libraries

Text Extraction

PyMuPDF (fitz): Fast, comprehensive extraction with layout awareness
pdfplumber: Excellent text extraction with precise positioning

Table Extraction

Camelot: Advanced table extraction with lattice and stream modes
Tabula: Java-based table extraction

Image Extraction

PyMuPDF (fitz): Complete image extraction with metadata

Installation

Prerequisites

Python 3.8+

Java Runtime Environment (required for Tabula)

# Ubuntu/Debian
sudo apt-get install default-jre

# macOS
brew install openjdk

# Windows
# Download from https://www.java.com/

Ghostscript (required for Camelot)

# Ubuntu/Debian
sudo apt-get install ghostscript

# macOS
brew install ghostscript

# Windows
# Download from https://www.ghostscript.com/

Create Virtual Environment (Recommended)

It's recommended to use a virtual environment to isolate dependencies:

Using venv (built-in):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Using uv (creates venv automatically):

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install Dependencies

You can install dependencies using either classic pip or the faster uv package manager:

Option 1: Using pip (Classic)

pip install -r requirements.txt

Option 2: Using uv (Faster Alternative)

uv is a fast Python package installer written in Rust. It's 10-100x faster than pip.

# Install uv first (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies with uv
uv pip install -r requirements.txt

Or on Windows (PowerShell):

# Install uv
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Install dependencies
uv pip install -r requirements.txt

Benefits of using uv:

10-100x faster installation and etc etc.

Quick Start

Basic Usage

from metadata_document_parser import PDFMetadataParser

# Initialize parser
parser = PDFMetadataParser("document.pdf")

# Parse with all features
result = parser.parse(
    extract_text=True,
    extract_images=True,
    extract_tables=True,
    extract_formulas=True,  # NEW: Extract mathematical formulas
    layout_aware=True,
    column_aware=True  # NEW: Fix reading order for multi-column layouts
)

# Access results
print(f"Title: {result.metadata.title}")
print(f"Pages: {result.metadata.num_pages}")
print(f"Text blocks: {len(result.text_blocks)}")
print(f"Images: {len(result.images)}")
print(f"Tables: {len(result.tables)}")

Run Examples

# Run all examples
python example_usage.py your_document.pdf

# Test multi-column extraction specifically
python example_multi_column.py research_paper.pdf

# Debug and visualize column detection
python test_column_detection.py research_paper.pdf --header-margin 50 --footer-margin 50

# Test TOON export with token comparison
python example_toon_export.py research_paper.pdf

Usage Examples

1. Extract Text with Layout Awareness

parser = PDFMetadataParser("document.pdf")

result = parser.parse(
    extract_text=True,
    text_method="pymupdf",  # or "pdfplumber"
    layout_aware=True
)

# Access text blocks with position and formatting
for block in result.text_blocks:
    print(f"Page {block.page_num}: {block.block_type}")
    print(f"Position: {block.bbox}")
    print(f"Font: {block.font_name} (Size: {block.font_size})")
    print(f"Text: {block.text}\n")

2. Extract Tables

parser = PDFMetadataParser("document.pdf")

# Using Camelot (best for bordered tables)
result = parser.parse(
    extract_tables=True,
    table_method="camelot"
)

# Using Tabula (good for borderless tables)
result = parser.parse(
    extract_tables=True,
    table_method="tabula"
)

# Access table data
for table in result.tables:
    print(f"Table {table.table_index} on page {table.page_num}")
    print(f"Rows: {len(table.data)}")
    print(table.data)  # List of lists

3. Extract and Save Images

parser = PDFMetadataParser("document.pdf")

result = parser.parse(extract_images=True)

# Save images to disk
saved_paths = parser.save_images(result, "output_images/")
print(f"Saved {len(saved_paths)} images")

4. Compare Extraction Methods

parser = PDFMetadataParser("document.pdf")

comparison = parser.compare_extraction_methods()

# Results include performance metrics for each method
print(comparison)

5. Export Parsed Data (TOON/JSON)

TOON format (default) - 10-60% fewer tokens vs JSON, ideal for LLM input:

parser = PDFMetadataParser("document.pdf")
result = parser.parse()

# Export to TOON format (default - token-efficient for LLMs)
toon_output = parser.export(result)  # format="toon" is default
print(toon_output)

# Save to file
with open("parsed_document.toon", "w") as f:
    f.write(toon_output)

JSON format (explicit) - use when you need standard JSON:

# Export to JSON
json_output = parser.export(result, format="json")

# Or use dedicated method
json_output = parser.export_to_json(result, indent=2)

# Save to file
with open("parsed_document.json", "w") as f:
    f.write(json_output)

Compare token counts between formats:

comparison = parser.compare_export_formats(result)

print(f"JSON tokens: {comparison['json_tokens']:,}")
print(f"TOON tokens: {comparison['toon_comma_tokens']:,}")
print(f"Savings: {comparison['toon_comma_savings_percent']}%")

6. Column-Aware Reading Order (for Research Papers, Newspapers)

# Basic usage with default margins
parser = PDFMetadataParser("document.pdf")

# Enable column-aware reading order
result = parser.parse(
    extract_text=True,
    layout_aware=True,
    column_aware=True  # Automatically detects and fixes column order
)

# Check detected layout
print(f"Detected layout: {result.column_layout}")  # 'single', 'double', or 'multi'

# Text blocks are now in correct reading order (left column, then right column)
for block in result.text_blocks:
    print(block.text)

Advanced: Custom Header/Footer Margins

For PDFs with large headers or footers, adjust the margins to improve column detection:

# Initialize with custom margins (in points, 72 points = 1 inch)
parser = PDFMetadataParser(
    "document.pdf",
    header_margin=100,  # Ignore top 100 points (large header)
    footer_margin=80    # Ignore bottom 80 points (large footer)
)

result = parser.parse(
    extract_text=True,
    layout_aware=True,
    column_aware=True
)

Visualizing Column Detection

Debug and verify column detection by creating an annotated PDF:

parser = PDFMetadataParser("document.pdf")

# Creates a PDF with red borders around detected columns
output_path = parser.visualize_columns()
print(f"Annotated PDF saved to: {output_path}")

7. Formula Detection and LaTeX Conversion

Basic Usage (Heuristic):

parser = PDFMetadataParser("document.pdf")

# Extract formulas with heuristic detection (no GPU/DL required)
# Use strict_mode=True to reduce false positives
result = parser.parse(
    extract_text=True,
    extract_formulas=True,
    strict_mode=True
)

Advanced Usage (External OCR): For production-quality LaTeX, you can use an external OCR service like Mathpix.

from metadata_document_parser.extractors.ocr import MathpixOCR

# Initialize OCR strategy
ocr = MathpixOCR(app_id="YOUR_APP_ID", app_key="YOUR_APP_KEY")

result = parser.parse(
    extract_text=True,
    extract_formulas=True,
    strict_mode=True,
    ocr_strategy=ocr
)

# Access detected formulas
for formula in result.formulas:
    print(f"Formula: {formula.formula_text}")
    print(f"LaTeX:   {formula.latex}")  # High-quality LaTeX from OCR
    print(f"Confidence: {formula.confidence:.2f}")

8. Simple Reading Order (No Column Detection)

parser = PDFMetadataParser("document.pdf")
result = parser.parse(
    layout_aware=True,
    column_aware=False  # Disable column detection
)

# Sort blocks by page and vertical position (simple top-to-bottom)
sorted_blocks = sorted(
    result.text_blocks,
    key=lambda b: (b.page_num, b.bbox[1])
)

# Print in reading order
for block in sorted_blocks:
    if block.block_type == "title":
        print(f"\nTITLE: {block.text}\n")
    elif block.block_type == "heading":
        print(f"\nHEADING: {block.text}\n")
    else:
        print(block.text)

API Reference

PDFMetadataParser

`init(pdf_path: str, footer_margin: int = 50, header_margin: int = 50, fast_column_detection: bool = True)`

Initialize the parser with a PDF file path.

Parameters:

pdf_path (str): Path to the PDF file
footer_margin (int): Height in points of bottom stripe to ignore for column detection (default: 50)
header_margin (int): Height in points of top stripe to ignore for column detection (default: 50)
fast_column_detection (bool): Use fast column detection algorithm—50-100x faster than detailed mode (default: True)

`parse(extract_text=True, extract_images=True, extract_tables=True, extract_formulas=False, text_method="pymupdf", table_method="camelot", layout_aware=True, column_aware=True) -> ParsedDocument`

Parse the PDF document.

Parameters:

extract_text (bool): Extract text content
extract_images (bool): Extract images
extract_tables (bool): Extract tables
extract_formulas (bool): Extract and detect mathematical formulas (NEW)
text_method (str): Method for text extraction ("pymupdf" or "pdfplumber")
table_method (str): Method for table extraction ("camelot" or "tabula")
layout_aware (bool): Preserve layout information
column_aware (bool): Detect columns and fix reading order (NEW)
strict_mode (bool): Enable stricter formula detection to reduce false positives (default: False)
ocr_strategy (ExternalOCR): Optional OCR strategy for high-quality LaTeX conversion (e.g., MathpixOCR)

Returns: ParsedDocument object containing all extracted data

`compare_extraction_methods() -> Dict`

Compare different extraction methods and return performance metrics.

`export_to_dict(parsed_doc: ParsedDocument) -> Dict`

Export parsed document to dictionary format.

`export(parsed_doc: ParsedDocument, format: str = "toon", delimiter: str = ",", indent: int = 2) -> str`

Export parsed document to specified format (default: TOON).

Parameters:

parsed_doc (ParsedDocument): Parsed document to export
format (str): Output format - "toon" (default, 30-60% fewer tokens) or "json"
delimiter (str): TOON delimiter - ',' (comma), '\t' (tab), or '|' (pipe). Tab often provides best token efficiency
indent (int): JSON indentation (only used if format="json")

Returns: Formatted string in requested format

`export_to_json(parsed_doc: ParsedDocument, indent: int = 2) -> str`

Export parsed document to JSON format.

Parameters:

parsed_doc (ParsedDocument): Parsed document to export
indent (int): JSON indentation (default: 2)

Returns: JSON string representation

`export_to_toon(parsed_doc: ParsedDocument, delimiter: str = ",") -> str`

Export parsed document to TOON format for token-efficient LLM input.

Parameters:

parsed_doc (ParsedDocument): Parsed document to export
delimiter (str): Array delimiter - ',' (comma), '\t' (tab), or '|' (pipe)

Returns: TOON formatted string

Note: Requires toon-format package. Install with: pip install toon-format

`compare_export_formats(parsed_doc: ParsedDocument) -> Dict[str, Any]`

Compare token counts between JSON and TOON export formats.

Parameters:

parsed_doc (ParsedDocument): Parsed document to compare

Returns: Dictionary containing:

json_tokens: Token count for JSON format
json_size_bytes: Size in bytes for JSON
toon_comma_tokens: Token count for TOON with comma delimiter
toon_comma_size_bytes: Size in bytes for TOON with comma delimiter
toon_comma_savings_percent: Percentage savings vs JSON
toon_tab_tokens: Token count for TOON with tab delimiter
toon_tab_size_bytes: Size in bytes for TOON with tab delimiter
toon_tab_savings_percent: Percentage savings vs JSON
best_format: Best performing TOON format
best_savings_percent: Best savings percentage achieved

`save_images(parsed_doc: ParsedDocument, output_dir: str) -> List[str]`

Save extracted images to disk.

`visualize_columns(output_path: Optional[str] = None) -> str`

Create a visual representation of detected columns by drawing red borders around detected column bboxes and numbering them. Useful for debugging and understanding column detection.

Parameters:

output_path (str, optional): Output path for annotated PDF. If not provided, uses <original_name>-columns.pdf

Returns: Path to the annotated PDF file

Data Classes

`DocumentMetadata`

Contains PDF metadata:

title, author, subject, creator, producer
creation_date, modification_date
num_pages, file_size, page_sizes

`TextBlock`

Represents a text block with layout info:

text: The text content
bbox: Bounding box (x0, y0, x1, y1)
page_num: Page number
font_size, font_name: Font information
block_type: "text", "title", "heading", "header", "footer"

`ImageData`

Represents an extracted image:

image_index, page_num: Position info
bbox: Bounding box
width, height: Dimensions
colorspace: Color space
image_bytes: Raw image data
ext: File extension

`TableData`

Represents an extracted table:

table_index, page_num: Position info
bbox: Bounding box (if available)
data: Table data as list of lists
extraction_method: Method used

`FormulaData` (NEW)

Represents a detected mathematical formula:

formula_index: Index of the formula
page_num: Page number
bbox: Bounding box
formula_text: Original text representation
latex: LaTeX conversion (heuristic-based)
confidence: Detection confidence score (0.0-1.0)
image_bytes: Formula as image (optional)

`ParsedDocument`

Complete parsed document data:

metadata: DocumentMetadata
text_blocks: List of TextBlock
images: List of ImageData
tables: List of TableData
formulas: List of FormulaData (NEW)
extraction_method: Method used
parsing_time: Time taken to parse
column_layout: Detected layout ('single', 'double', 'multi') (NEW)

Library Comparison

Text Extraction

Library	Speed	Layout Awareness	Font Info	Best For
PyMuPDF	⚡⚡⚡ Fast	✅ Excellent	✅ Yes	All-purpose, comprehensive extraction
pdfplumber	⚡⚡ Moderate	✅ Excellent	❌ Limited	Precise text positioning

Recommendation: Use PyMuPDF for most cases. Use pdfplumber when you need very precise character-level positioning.

Table Extraction

Library	Speed	Bordered Tables	Borderless Tables	Best For
Camelot	⚡⚡ Moderate	✅ Excellent	⚡ Good (stream mode)	Complex, well-structured tables
Tabula	⚡⚡⚡ Fast	✅ Good	✅ Better	Simple tables, quick extraction

Recommendation: Use Camelot with lattice mode for bordered tables. Use Camelot stream mode or Tabula for borderless tables.

Performance Tips

Fast column detection (enabled by default): Optimized algorithm that's 50-100x faster than detailed mode

# Fast mode (default)
parser = PDFMetadataParser("paper.pdf", fast_column_detection=True)

# Detailed mode - only needed for PDFs with text on images or colored backgrounds
parser = PDFMetadataParser("paper.pdf", fast_column_detection=False)

Text only: Disable image and table extraction for faster processing

result = parser.parse(extract_text=True, extract_images=False, extract_tables=False)

Layout-aware off: Set layout_aware=False for simple text extraction (fastest)
```
result = parser.parse(layout_aware=False, column_aware=False)
```
Choose the right method: PyMuPDF is generally fastest for text
Batch processing: Process multiple PDFs in parallel using multiprocessing
Use uv for faster installs: Install dependencies with uv pip install for 10-100x faster installation

Package Manager Quick Reference

Here's a quick comparison of common commands between pip and uv:

Task	pip	uv
Install from requirements.txt	`pip install -r requirements.txt`	`uv pip install -r requirements.txt`
Install single package	`pip install package-name`	`uv pip install package-name`
Install with version	`pip install package==1.0.0`	`uv pip install package==1.0.0`
Upgrade package	`pip install --upgrade package`	`uv pip install --upgrade package`
Uninstall package	`pip uninstall package`	`uv pip uninstall package`
List installed packages	`pip list`	`uv pip list`
Freeze requirements	`pip freeze > requirements.txt`	`uv pip freeze > requirements.txt`
Create virtual env	`python -m venv venv`	`uv venv`

Note: uv commands work exactly like pip commands but are significantly faster. You can simply replace pip with uv pip in most cases.

Limitations

Scanned PDFs: This parser is designed for digitally-born PDFs. For scanned PDFs, you'll need OCR (e.g., Tesseract)
Complex layouts: Very complex multi-column layouts may require manual tuning
Encrypted PDFs: Password-protected PDFs need to be decrypted first
Large files: Very large PDFs (100+ MB) may require significant memory

Troubleshooting

Camelot not finding tables

Try both flavors:

# For tables with borders
result = parser.parse(table_method="camelot")  # Uses 'lattice' by default

# For tables without borders, modify pdf_parser.py line 350:
# flavor='stream' instead of flavor='lattice'

Java errors with Tabula

Ensure Java is installed and in your PATH:

java -version

Ghostscript errors

Ensure Ghostscript is installed:

gs --version

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues.

License

MIT License

Acknowledgments

This parser leverages several excellent open-source libraries:

Support

For issues, questions, or contributions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
metadata_document_parser		metadata_document_parser
.gitignore		.gitignore
README.md		README.md
example_multi_column.py		example_multi_column.py
example_toon_export.py		example_toon_export.py
example_usage.py		example_usage.py
pdf_parser.py		pdf_parser.py
requirements.txt		requirements.txt
test_column_detection.py		test_column_detection.py

generalMG/MetadataDocumentParser

Folders and files

Latest commit

History

Repository files navigation

PDF Metadata Document Parser

Features

What is TOON?

Supported Libraries

Text Extraction

Table Extraction

Image Extraction

Installation

Prerequisites

Create Virtual Environment (Recommended)

Install Dependencies

Option 1: Using pip (Classic)

Option 2: Using uv (Faster Alternative)

Quick Start

Basic Usage

Run Examples

Usage Examples

1. Extract Text with Layout Awareness

2. Extract Tables

3. Extract and Save Images

4. Compare Extraction Methods

5. Export Parsed Data (TOON/JSON)

6. Column-Aware Reading Order (for Research Papers, Newspapers)

7. Formula Detection and LaTeX Conversion

8. Simple Reading Order (No Column Detection)

API Reference

PDFMetadataParser

__init__(pdf_path: str, footer_margin: int = 50, header_margin: int = 50, fast_column_detection: bool = True)

parse(extract_text=True, extract_images=True, extract_tables=True, extract_formulas=False, text_method="pymupdf", table_method="camelot", layout_aware=True, column_aware=True) -> ParsedDocument

compare_extraction_methods() -> Dict

export_to_dict(parsed_doc: ParsedDocument) -> Dict

export(parsed_doc: ParsedDocument, format: str = "toon", delimiter: str = ",", indent: int = 2) -> str

export_to_json(parsed_doc: ParsedDocument, indent: int = 2) -> str

export_to_toon(parsed_doc: ParsedDocument, delimiter: str = ",") -> str

compare_export_formats(parsed_doc: ParsedDocument) -> Dict[str, Any]

save_images(parsed_doc: ParsedDocument, output_dir: str) -> List[str]

visualize_columns(output_path: Optional[str] = None) -> str

Data Classes

DocumentMetadata

TextBlock

ImageData

TableData

FormulaData (NEW)

ParsedDocument

Library Comparison

Text Extraction

Table Extraction

Performance Tips

Package Manager Quick Reference

Limitations

Troubleshooting

Camelot not finding tables

Java errors with Tabula

Ghostscript errors

Contributing

License

Acknowledgments

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`init(pdf_path: str, footer_margin: int = 50, header_margin: int = 50, fast_column_detection: bool = True)`

`parse(extract_text=True, extract_images=True, extract_tables=True, extract_formulas=False, text_method="pymupdf", table_method="camelot", layout_aware=True, column_aware=True) -> ParsedDocument`

`compare_extraction_methods() -> Dict`

`export_to_dict(parsed_doc: ParsedDocument) -> Dict`

`export(parsed_doc: ParsedDocument, format: str = "toon", delimiter: str = ",", indent: int = 2) -> str`

`export_to_json(parsed_doc: ParsedDocument, indent: int = 2) -> str`

`export_to_toon(parsed_doc: ParsedDocument, delimiter: str = ",") -> str`

`compare_export_formats(parsed_doc: ParsedDocument) -> Dict[str, Any]`

`save_images(parsed_doc: ParsedDocument, output_dir: str) -> List[str]`

`visualize_columns(output_path: Optional[str] = None) -> str`

`DocumentMetadata`

`TextBlock`

`ImageData`

`TableData`

`FormulaData` (NEW)

`ParsedDocument`

Packages