forked from t-redactyl/ocr-llm-agent
-
Notifications
You must be signed in to change notification settings - Fork 0
Document Processing Tutorial
Udit Asopa edited this page Oct 16, 2025
·
1 revision
Master document processing with Vision Text Extractor! This comprehensive tutorial covers business documents, forms, contracts, and more.
- Process various business document types
- Extract structured data with custom prompts
- Handle different file formats and quality levels
- Automate document workflows
- Compare accuracy across different providers
# Extract key invoice information
python main.py invoice.pdf \
--prompt "Extract: invoice number, date, vendor, total amount, due date"
# Structured JSON output
python main.py invoice.jpg \
--prompt "Extract invoice data as JSON with fields: invoice_id, date, vendor_name, line_items, subtotal, tax, total"# Process multiple invoices
for invoice in invoices/*.pdf; do
echo "Processing: $invoice"
python main.py "$invoice" \
--prompt "Extract: invoice number, date, vendor, total" \
> "processed/$(basename "$invoice" .pdf).txt"
doneExpected Output:
Invoice Number: INV-2024-001
Date: March 15, 2024
Vendor: ABC Supply Company
Total Amount: $1,247.50
Due Date: April 15, 2024
# Extract all transactions
python main.py bank-statement.pdf \
--prompt "Extract all transactions with: date, description, amount, balance"
# Focus on specific transaction types
python main.py statement.pdf \
--prompt "Extract only deposits and their amounts with dates"# Get spending summary
python main.py cc-statement.pdf \
--prompt "Extract: statement period, previous balance, payments, new charges, current balance, minimum payment"
# Transaction details
python main.py cc-statement.pdf \
--prompt "List all transactions with date, merchant, amount, and category"# Extract form data
python main.py insurance-form.pdf \
--prompt "Extract all filled form fields and their values, including: name, policy number, claim details, date of incident"
# Medical claim forms
python main.py medical-claim.pdf \
--prompt "Extract: patient info, provider details, diagnosis codes, procedure codes, amounts"# Extract applicant information
python main.py job-application.pdf \
--prompt "Extract: applicant name, contact info, work experience, education, skills"
# Government forms
python main.py tax-form.pdf \
--prompt "Extract all numerical values and their corresponding field labels"# Extract key contract terms
python main.py contract.pdf \
--prompt "Extract: parties involved, effective date, termination date, key obligations, payment terms"
# Lease agreements
python main.py lease.pdf \
--prompt "Extract: property address, lease term, monthly rent, security deposit, special conditions"# Identify important clauses
python main.py legal-doc.pdf \
--prompt "Identify and extract: liability clauses, termination conditions, dispute resolution procedures"# Extract tables as structured data
python main.py financial-report.pdf \
--prompt "Extract the financial table preserving rows and columns. Format as: Item | Q1 | Q2 | Q3 | Q4"
# Price lists
python main.py price-list.jpg \
--prompt "Extract pricing table with: Product Name | Description | Unit Price | Bulk Price"# Process specific pages
python main.py multi-page-contract.pdf \
--prompt "Focus on signature page - extract: signatory names, titles, dates, witness information"
# Summary extraction
python main.py annual-report.pdf \
--prompt "Extract executive summary key points and financial highlights only"# Detailed extraction for clear documents
python main.py clear-scan.pdf \
--prompt "Perform detailed extraction including: all text, formatting, table structure, footnotes"# Focus on key information for unclear scans
python main.py blurry-document.jpg \
--provider openai \
--prompt "This is a poor quality scan. Extract the most important information: document type, date, key numbers"# Handwriting-specific prompts
python main.py handwritten-form.jpg \
--prompt "This contains handwriting. Carefully transcribe: name, address, phone, signature date"# Use OpenAI for best accuracy
python main.py complex-layout.pdf \
--provider openai \
--prompt "Extract this complex multi-column document preserving layout structure"# Use local processing for privacy
python main.py medical-record.pdf \
--provider huggingface \
--prompt "Extract patient information while maintaining confidentiality"# Use SmolVLM for free bulk processing
python main.py invoice-batch/*.pdf \
--provider huggingface \
--prompt "Extract invoice number and total amount only"# Business card to JSON
python main.py business-card.jpg \
--prompt "Extract as JSON: {\"name\": \"\", \"title\": \"\", \"company\": \"\", \"email\": \"\", \"phone\": \"\", \"address\": \"\"}"
# Invoice to JSON
python main.py invoice.pdf \
--prompt "Extract as JSON with fields: invoice_number, date, vendor, line_items (array), subtotal, tax, total"# Expense report processing
python main.py expense-receipts.jpg \
--prompt "Extract as CSV format: Date,Vendor,Category,Amount,Tax,Total"
# Contact list extraction
python main.py contact-sheet.pdf \
--prompt "Extract as CSV: Name,Title,Company,Email,Phone"# Document summary
python main.py quarterly-report.pdf \
--prompt "Provide a 3-sentence summary of key findings and recommendations"
# Contract summary
python main.py service-agreement.pdf \
--prompt "Summarize: what services, duration, cost, key responsibilities of each party"#!/bin/bash
# Process all PDFs in a directory
INPUT_DIR="documents"
OUTPUT_DIR="processed"
mkdir -p "$OUTPUT_DIR"
for doc in "$INPUT_DIR"/*.pdf; do
filename=$(basename "$doc" .pdf)
echo "Processing: $filename"
python main.py "$doc" \
--prompt "Extract: document type, date, key information, summary" \
> "$OUTPUT_DIR/${filename}_extracted.txt"
done
echo "Batch processing complete!"# Auto-classify document types
python main.py unknown-document.pdf \
--prompt "Classify this document type (invoice, contract, report, form, etc.) and extract the 3 most important pieces of information"# Extract and validate
python main.py form.pdf \
--prompt "Extract all dates and verify they are in valid format. Extract all phone numbers and verify format. List any formatting issues found."# Be specific about data format
python main.py document.pdf \
--prompt "Extract dates in YYYY-MM-DD format, amounts with currency symbol, phone numbers with country code"
# Request error handling
python main.py unclear-scan.jpg \
--prompt "If any text is unclear, mark with [UNCLEAR] and provide best guess in parentheses"# Multilingual documents
python main.py multilingual-contract.pdf \
--prompt "This document contains English and Spanish. Extract key terms and identify the language for each section"# Document information
python main.py signed-contract.pdf \
--prompt "Extract: document title, creation date, last modified date, author, number of pages, signature status"# Extract with confidence scoring
python main.py important-document.pdf \
--prompt "Extract key information and indicate confidence level (high/medium/low) for each piece of data"
# Cross-verification with multiple providers
python main.py critical-contract.pdf --provider huggingface > hf_result.txt
python main.py critical-contract.pdf --provider openai > openai_result.txt
diff hf_result.txt openai_result.txt# Flag potential issues
python main.py document.pdf \
--prompt "Extract information and flag any: missing signatures, blank required fields, inconsistent dates, unclear amounts"# Monthly invoice processing
python main.py march-invoices/*.pdf \
--prompt "Extract for accounting: vendor, invoice#, date, net amount, tax amount, total, GL account suggestions"# Resume screening
python main.py resume.pdf \
--prompt "Extract: years of experience, key skills, education level, previous companies, contact info"
# Employee forms
python main.py employee-forms/*.pdf \
--prompt "Extract: employee ID, name, department, start date, salary, benefits selections"# Contract comparison
python main.py contract-v1.pdf \
--prompt "Extract key terms for comparison: parties, duration, payment terms, termination clauses"
python main.py contract-v2.pdf \
--prompt "Extract key terms for comparison: parties, duration, payment terms, termination clauses"- Use SmolVLM for sensitive documents (HIPAA, GDPR compliance)
- Never use OpenAI for confidential business data
- Always review extracted data before using in business processes
- Use OpenAI for complex layouts and critical accuracy
- Try multiple providers for important documents
- Validate extracted data against original documents
- Use specific prompts to reduce processing time
- Batch process similar document types together
- Cache results to avoid reprocessing same documents
Master these document processing techniques to transform your business document workflows with AI-powered text extraction!