You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pymupdf-layout/index.rst
+9-79Lines changed: 9 additions & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,109 +103,39 @@ So in this case we can adjust our API calls to ignore these elements as follows:
103
103
Extending Capability
104
104
----------------------------------
105
105
106
-
107
106
Using with Pro
108
107
~~~~~~~~~~~~~~~~~
109
108
110
-
We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for |PyMuPDF Pro| and unlock it before we import & activate |PyMuPDF Layout|::
109
+
We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to add the import for |PyMuPDF Pro| and unlock it::
111
110
112
111
import pymupdf.layout
113
-
import pymupdf.pro
114
112
import pymupdf4llm
113
+
import pymupdf.pro
115
114
pymupdf.pro.unlock()
116
115
117
116
Now we can happily load Office files and convert them as follows::
118
117
119
118
md = pymupdf4llm.to_markdown("sample.docx")
120
119
121
120
122
-
123
121
OCR support
124
122
~~~~~~~~~~~~~~~~~
125
123
126
124
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
125
+
126
+
If a page contains no text at all, but is covered with an image or many vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart ordinary pictures (like photographies - which we don't want to OCR) from image-based text.
127
127
128
-
If Tesseract is not installed on your platform, no OCR is attempted.
129
-
128
+
If the page does contain text but contains too many unreadable characters (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
130
129
130
+
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
131
131
132
132
----
133
133
134
134
.. _pymupdf_layout_and_pymupdf4llm_api:
135
135
136
-
PyMuPDF Layout and parameter caveats
137
-
--------------------------------------
138
-
139
-
140
-
|PyMuPDF Layout| uses |PyMuPDF4LLM| for its interface. However, if you have imported ``Layout`` then the following caveats apply to the method parameters:
If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas. New methods become available and some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. This web site is being kept up to date while we continue to work on improvements.
0 commit comments