Convert Excel files to PDF format using Databricks structured streaming for downstream OCR and document processing.
This solution processes Microsoft Excel files by:
- Reading Excel files from a Databricks volume using structured streaming
- Converting specific cell ranges to HTML tables
- Generating PDF documents from the HTML content
- Storing results in Databricks volumes for further processing
Use Case: R&D and experiment files with complex structured content, multiple tables, or variable formats that are difficult to parse directly. Converting to PDF enables consistent OCR processing and information extraction without custom parsing logic for each file variant.
Why not ingest binaries to a Delta table? The reasoning is because these files are likely intended for PDF processing by AgentBricks, so we want to write them out to start.
- Databricks workspace with Unity Catalog enabled
- Source and destination volumes configured
- Python packages:
openpyxl==3.1.5,xhtml2pdf==0.2.17
- Import
excel2pdf.pyinto your Databricks workspace - Configure the required parameters using the notebook widgets
- Run the notebook to start processing
| Parameter | Description | Example |
|---|---|---|
catalog_name |
Unity Catalog name | my_catalog |
source_schema |
Schema containing source volume | raw_data |
source_volume |
Volume with Excel files | excel_files |
dest_schema |
Schema for destination volume | processed_data |
dest_volume |
Volume for PDF output | pdf_files |
dest_subfolder |
Subfolder for organization | converted_pdfs |
dest_metadata_table |
Table for processing metadata | conversion_log |
worksheet_name |
Excel worksheet to process | Database |
-
Setup volumes and permissions: Of course, the source schema and volume are only needed if the landing zone doesn't already exist.
CREATE SCHEMA IF NOT EXISTS my_catalog.raw_data; CREATE VOLUME IF NOT EXISTS my_catalog.raw_data.excel_files; CREATE SCHEMA IF NOT EXISTS my_catalog.processed_data; CREATE VOLUME IF NOT EXISTS my_catalog.processed_data.pdf_files;
-
Upload Excel files to the source volume. If the source volume already exists you can skip this step.
-
Configure and run the notebook with your parameters
-
Monitor processing through the metadata table
- Streaming: Uses Databricks Auto Loader for incremental file processing
- Scalability: Pandas UDF enables distributed processing across cluster nodes
- Error Handling: Failed conversions are logged with null destination paths
- Checkpointing: Ensures exactly-once processing and recovery
The solution expects Excel files with a configurable worksheet (default: "Database") containing:
- Basic information in range A1:B18
- Main data in range C8:I50
This can be modified easily in the html building.
Set the worksheet_name parameter to target different worksheets. Modify the xlsm_to_html() function to adjust cell ranges for different Excel structures.