Replies: 2 comments
-
|
As the primary focus of PDF OpenDataLoader is Markdown / HTML generation, we can use HTML schema to define minimal requirements for the acceptable Tagged PDF structure. Or, alternatively, be more flexible accepting "bad" structure, but following HTML-like rules for processing invalid structures. |
Beta Was this translation helpful? Give feedback.
-
|
From what I understand, 100% validation of UA-1/2 without human input is not really possible (artefacts, reading order for complex layouts, etc). We may as well treat treat tagged structure as another source of truth i.e. we have our pdf based results, we have AI results and we will also have tagged tree. Once we figure out the best way to combine AI data with pdf layout data, we can use the same exact method for tagged tree data. This way we can still utilize not-so-well tagged pdfs. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There are several options for verifying if the PDF document is properly tagged. The strongest would be to validate the structure tree against all requirements of PDF/UA-1 (for PDF 1.7 or below) or PDF/UA-2 (for PDF 2.0). In practice, this might be too strong. We might want to check fewer rules such as:
PinsidePor, more generally, block or grouping level structure elements inside other block level elements (P,H,Hn,Caption)These would be most critical for reliable conversion of Tagged PDF to Markdown based on the structure tree
Beta Was this translation helpful? Give feedback.
All reactions