validating PDF structure tree #88

bdoubrov · 2025-10-30T14:19:00Z

bdoubrov
Oct 30, 2025
Maintainer

There are several options for verifying if the PDF document is properly tagged. The strongest would be to validate the structure tree against all requirements of PDF/UA-1 (for PDF 1.7 or below) or PDF/UA-2 (for PDF 2.0). In practice, this might be too strong. We might want to check fewer rules such as:

basic containment rules in ISO 32000-1 (PDF 1.7) such as the structure of tables and lists
regularity of tables
no P inside P or, more generally, block or grouping level structure elements inside other block level elements (P, H, Hn, Caption)

These would be most critical for reliable conversion of Tagged PDF to Markdown based on the structure tree

bdoubrov · 2025-11-10T08:32:42Z

bdoubrov
Nov 10, 2025
Maintainer Author

As the primary focus of PDF OpenDataLoader is Markdown / HTML generation, we can use HTML schema to define minimal requirements for the acceptable Tagged PDF structure.

Or, alternatively, be more flexible accepting "bad" structure, but following HTML-like rules for processing invalid structures.

0 replies

denisbialy · 2025-11-11T21:28:03Z

denisbialy
Nov 11, 2025
Maintainer

From what I understand, 100% validation of UA-1/2 without human input is not really possible (artefacts, reading order for complex layouts, etc).

We may as well treat treat tagged structure as another source of truth i.e. we have our pdf based results, we have AI results and we will also have tagged tree. Once we figure out the best way to combine AI data with pdf layout data, we can use the same exact method for tagged tree data. This way we can still utilize not-so-well tagged pdfs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

validating PDF structure tree #88

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

validating PDF structure tree #88

Uh oh!

Uh oh!

bdoubrov Oct 30, 2025 Maintainer

Replies: 2 comments

Uh oh!

bdoubrov Nov 10, 2025 Maintainer Author

Uh oh!

Uh oh!

denisbialy Nov 11, 2025 Maintainer

bdoubrov
Oct 30, 2025
Maintainer

bdoubrov
Nov 10, 2025
Maintainer Author

denisbialy
Nov 11, 2025
Maintainer