Skip to content

Commit 8901c4e

Browse files
JorjMcKiejamie-lemon
authored andcommitted
Changes for PyMuPDF-Layout
1 parent f5a3487 commit 8901c4e

File tree

3 files changed

+85
-148
lines changed

3 files changed

+85
-148
lines changed

docs/pymupdf-layout/index.rst

Lines changed: 9 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -103,109 +103,39 @@ So in this case we can adjust our API calls to ignore these elements as follows:
103103
Extending Capability
104104
----------------------------------
105105

106-
107106
Using with Pro
108107
~~~~~~~~~~~~~~~~~
109108

110-
We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for |PyMuPDF Pro| and unlock it before we import & activate |PyMuPDF Layout|::
109+
We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to add the import for |PyMuPDF Pro| and unlock it::
111110

112111
import pymupdf.layout
113-
import pymupdf.pro
114112
import pymupdf4llm
113+
import pymupdf.pro
115114
pymupdf.pro.unlock()
116115

117116
Now we can happily load Office files and convert them as follows::
118117

119118
md = pymupdf4llm.to_markdown("sample.docx")
120119

121120

122-
123121
OCR support
124122
~~~~~~~~~~~~~~~~~
125123

126124
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
125+
126+
If a page contains no text at all, but is covered with an image or many vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart ordinary pictures (like photographies - which we don't want to OCR) from image-based text.
127127

128-
If Tesseract is not installed on your platform, no OCR is attempted.
129-
128+
If the page does contain text but contains too many unreadable characters (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
130129

130+
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
131131

132132
----
133133

134134
.. _pymupdf_layout_and_pymupdf4llm_api:
135135

136-
PyMuPDF Layout and parameter caveats
137-
--------------------------------------
138-
139-
140-
|PyMuPDF Layout| uses |PyMuPDF4LLM| for its interface. However, if you have imported ``Layout`` then the following caveats apply to the method parameters:
141-
142-
143-
+-------------------+-------------+---------+---------+----------------------------------+
144-
| Parameter | to_markdown | to_text | to_json | Comments |
145-
+===================+=============+=========+=========+==================================+
146-
| doc | ✔️ | ✔️ | ✔️ | |
147-
+-------------------+-------------+---------+---------+----------------------------------+
148-
| header | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` |
149-
+-------------------+-------------+---------+---------+----------------------------------+
150-
| footer | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` |
151-
+-------------------+-------------+---------+---------+----------------------------------+
152-
| detect_bg_color |||| |
153-
+-------------------+-------------+---------+---------+----------------------------------+
154-
| dpi | ✔️ | ✔️ | ✔️ | |
155-
+-------------------+-------------+---------+---------+----------------------------------+
156-
| embed_images | ✔️ | ✔️ | ✔️ | |
157-
+-------------------+-------------+---------+---------+----------------------------------+
158-
| extract_words | later | later | later | postponed |
159-
+-------------------+-------------+---------+---------+----------------------------------+
160-
| filename | ✔️ | ✔️ | ✔️ | |
161-
+-------------------+-------------+---------+---------+----------------------------------+
162-
| fontsize_limit |||| obsolete |
163-
+-------------------+-------------+---------+---------+----------------------------------+
164-
| force_text |||| text in pictures is always |
165-
| | | | | ignored |
166-
+-------------------+-------------+---------+---------+----------------------------------+
167-
| graphics_limit |||| obsolete |
168-
+-------------------+-------------+---------+---------+----------------------------------+
169-
| hdr_info |||| obsolete |
170-
+-------------------+-------------+---------+---------+----------------------------------+
171-
| ignore_alpha |||| |
172-
+-------------------+-------------+---------+---------+----------------------------------+
173-
| ignore_code | ✔️ | ✔️ | ✔️ | |
174-
+-------------------+-------------+---------+---------+----------------------------------+
175-
| ignore_graphics |||| obsolete |
176-
+-------------------+-------------+---------+---------+----------------------------------+
177-
| ignore_images |||| obsolete |
178-
+-------------------+-------------+---------+---------+----------------------------------+
179-
| image_format | ✔️ | ✔️ | ✔️ | |
180-
+-------------------+-------------+---------+---------+----------------------------------+
181-
| image_path | ✔️ | ✔️ | ✔️ | |
182-
+-------------------+-------------+---------+---------+----------------------------------+
183-
| image_size_limit |||| obsolete |
184-
+-------------------+-------------+---------+---------+----------------------------------+
185-
| margins |||| obsolete |
186-
+-------------------+-------------+---------+---------+----------------------------------+
187-
| page_chunks | later | later | later | postponed |
188-
+-------------------+-------------+---------+---------+----------------------------------+
189-
| page_height | later | later | later | postponed |
190-
+-------------------+-------------+---------+---------+----------------------------------+
191-
| page_separators | later | later | later | postponed |
192-
+-------------------+-------------+---------+---------+----------------------------------+
193-
| page_width | later | later | later | postponed |
194-
+-------------------+-------------+---------+---------+----------------------------------+
195-
| pages | ✔️ | ✔️ | ✔️ | |
196-
+-------------------+-------------+---------+---------+----------------------------------+
197-
| show_progress | later | later | later | postponed |
198-
+-------------------+-------------+---------+---------+----------------------------------+
199-
| table_strategy |||| obsolete |
200-
+-------------------+-------------+---------+---------+----------------------------------+
201-
| use_glyphs |||| always show &#xfffd; |
202-
+-------------------+-------------+---------+---------+----------------------------------+
203-
| write_images | ✔️ | ✔️ | ✔️ | |
204-
+-------------------+-------------+---------+---------+----------------------------------+
205-
206-
207-
208-
136+
|PyMuPDF Layout| and |PyMuPDF4LLM| parameter caveats
137+
-----------------------------------------------------
209138

139+
If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas. New methods become available and some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. This web site is being kept up to date while we continue to work on improvements.
210140

211141
.. include:: ../footer.rst

0 commit comments

Comments
 (0)