Reduce complexity #210

kba · 2025-11-26T19:04:43Z

Starting to decouple the functionality:

remove OCR from eynollah layout (which has a dedicated eynollah ocr command)
remove multi-model binarization
set the latest hybrid model as the default for binarization
separate model dists:
- all: contains everything
- layout: everything needed for the - arguably - most used functionality: layout/readingorder/binarization/enhancement (~3.8GB)
- ocr: TrOCR and CNN/RNN models (~1.8GB)
- extra everything else, alternative/older/niche models (~400MB)
Make light_mode the default
Move non-light_mode specific models to extra.
Factor extract_only_images into separate class
Factor Ground Truth textline-image pair extraction out of eynollah ocr
Refactor 1000 SLOC run method of eynollah ocr into manageable chunks
~~- [ ] Refactor eynollah ocr to use the PAGE API instead of etree~~ out of scope It's not difficult but error-prone and better done in a separate PR based on this

layout: Everything not OCR or extra ocr: trocr/cnnrnn models extra: obsolete or niche models

.

bertsky

I am surprised you did not only make -light and -tll defaults, but removed all the other modes entirely.

I was under the impression that perhaps in certain cases these other modes could still serve a purpose. But even if they don't – removing them now (instead of just the extraction and OCR parts) creates much larger diffs, making merge of #206 and my jdeskew branch much more difficult...

bertsky · 2025-12-02T10:27:19Z

src/eynollah/cli/cli_enhance.py

+    help="upper limit of columns in document image",
+)
+@click.option(
+    "--save_org_scale/--no_save_org_scale",


I suggest we get rid of these redundant negative forms.

It is confusing to users, as they don't know what's the default, so to be sure, they must list all things they don't want, making calls quite long.

bertsky · 2025-12-02T10:28:52Z

src/eynollah/cli/cli_binarize.py

+    help="directory of input images (instead of --image)",
+    type=click.Path(exists=True, file_okay=False),
+)
+@click.option(


In #206, I added an --overwrite option here for consistency (and internally restructed the binarizer to use the same run vs. run_single pattern). See 086c188

src/eynollah/cli/cli.py

bertsky · 2025-12-02T10:45:55Z

src/eynollah/sbb_binarize.py

                raise ValueError("Must pass either a opencv2 image or an image_path")
            if image_path is not None:
                image = cv2.imread(image_path)
            img_last = 0


again, I suggest merging in 086c188 to get separate run and run_single

bertsky · 2025-12-02T11:00:11Z

src/eynollah/eynollah.py

@@ -585,7 +479,7 @@ def resize_image_with_column_classifier(self, is_image_enhanced, img_bin):

        return img, img_new, is_image_enhanced

-    def resize_and_enhance_image_with_column_classifier(self, light_version):
+    def resize_and_enhance_image_with_column_classifier(self):


IMO this function should now be renamed resize_image_with_column_classifier and the one already existing under that name should be renamed resize_and_enhance_image_with_column_classifier, because in effect of these changes we would have the paradoxical effect that predict_enhancement only gets called from the other one.

bertsky · 2025-12-02T11:01:52Z

src/eynollah/eynollah.py

-
-        return img_scaled_padded#, label_scaled_padded
-
-    def do_prediction_new_concept_scatter_nd(


I highly recommend against removing this function. It is still experimental, but can further speed up prediction immensely (because patch processing will be on GPU only instead of CPU-GPU back and forth).

bertsky · 2025-12-02T11:14:17Z

src/eynollah/eynollah.py

        prediction_regions = self.do_prediction(patches, img, model_region, marginal_of_patch_percent=0.1)
        prediction_regions = resize_image(prediction_regions, img_height_h, img_width_h)
        self.logger.debug("exit extract_text_regions")
-        return prediction_regions, prediction_regions2
+        return prediction_regions, None


Something went wrong here. Did you want to remove that entire (unused) function extract_text_regions? Better remove cleanly instead of cripple it...

bertsky · 2025-12-02T11:21:11Z

src/eynollah/eynollah.py

+                    rotation_not_90_func(image_page, textline_mask_tot, text_regions_p,
+                                         table_prediction, slope_deskew)
+
+                text_regions_p_1_n = resize_image(text_regions_p_1_n,


note: this section will yield lots of conflicts with #206 ...

bertsky · 2025-12-02T11:33:10Z

src/eynollah/writer.py

    def build_pagexml_no_full_layout(
-            self, found_polygons_text_region,
-            page_coord, order_of_texts, id_of_texts,
-            all_found_textline_polygons,
-            all_box_coord,
-            found_polygons_text_region_img,
-            found_polygons_marginals_left, found_polygons_marginals_right,
-            all_found_textline_polygons_marginals_left, all_found_textline_polygons_marginals_right,
-            all_box_coord_marginals_left, all_box_coord_marginals_right,
-            slopes, slopes_marginals_left, slopes_marginals_right,
-            cont_page, polygons_seplines,
-            found_polygons_tables,
-            **kwargs):
+        self,
+        *,
+        found_polygons_text_region,
+        page_coord,
+        order_of_texts,
+        all_found_textline_polygons,
+        all_box_coord,
+        found_polygons_text_region_img,
+        found_polygons_marginals_left,
+        found_polygons_marginals_right,
+        all_found_textline_polygons_marginals_left,
+        all_found_textline_polygons_marginals_right,
+        all_box_coord_marginals_left,
+        all_box_coord_marginals_right,
+        slopes,
+        slopes_marginals_left,
+        slopes_marginals_right,
+        cont_page,
+        polygons_seplines,
+        found_polygons_tables,
+    ):
        return self.build_pagexml_full_layout(
-            found_polygons_text_region, [],
-            page_coord, order_of_texts, id_of_texts,
-            all_found_textline_polygons, [],
-            all_box_coord, [],
-            found_polygons_text_region_img, found_polygons_tables, [],
-            found_polygons_marginals_left, found_polygons_marginals_right,
-            all_found_textline_polygons_marginals_left, all_found_textline_polygons_marginals_right,
-            all_box_coord_marginals_left, all_box_coord_marginals_right,
-            slopes, [], slopes_marginals_left, slopes_marginals_right,
-            cont_page, polygons_seplines,
-            **kwargs)
+            found_polygons_text_region=found_polygons_text_region,
+            found_polygons_text_region_h=[],
+            page_coord=page_coord,
+            order_of_texts=order_of_texts,
+            all_found_textline_polygons=all_found_textline_polygons,
+            all_found_textline_polygons_h=[],
+            all_box_coord=all_box_coord,
+            all_box_coord_h=[],
+            found_polygons_text_region_img=found_polygons_text_region_img,
+            found_polygons_tables=found_polygons_tables,
+            found_polygons_drop_capitals=[],
+            found_polygons_marginals_left=found_polygons_marginals_left,
+            found_polygons_marginals_right=found_polygons_marginals_right,
+            all_found_textline_polygons_marginals_left=all_found_textline_polygons_marginals_left,
+            all_found_textline_polygons_marginals_right=all_found_textline_polygons_marginals_right,
+            all_box_coord_marginals_left=all_box_coord_marginals_left,
+            all_box_coord_marginals_right=all_box_coord_marginals_right,
+            slopes=slopes,
+            slopes_h=[],
+            slopes_marginals_left=slopes_marginals_left,
+            slopes_marginals_right=slopes_marginals_right,
+            cont_page=cont_page,
+            polygons_seplines=polygons_seplines,
+        )



I am not really in favour of turning these identifiers into proper kwargs. But if you must, then why still repeat all of them (in the delegation pattern), when you could just

def build_pagexml_no_full_layout(**kwargs): return self.build_pagexml_full_layout( found_polygons_text_region_h=[], all_found_textline_polygons_h=[], all_box_coord_h=[], found_polygons_drop_capitals=[], slopes_h=[], **kwargs)

?

...or (as discussed at last) even better provide None as default everywhere, so the caller does not need to pass empty lists. Then there will also no longer be any need to differentiate no/full layout calls, only build_pagexml().

I converted this to kwargs only because there are so many arguments to that method now, it was hard to find the errors resulting from the refactoring. I am open to simplifying this again once the dust has settled.

kba · 2025-12-02T13:58:31Z

I was under the impression that perhaps in certain cases these other modes could still serve a purpose. But even if they don't – removing them now (instead of just the extraction and OCR parts) creates much larger diffs, making merge of #206 and my jdeskew branch much more difficult...

If it's just about reducing conflict, I can reinstate those code branches temporarily - @vahidrezanezhad knows best under what circumstances the original "heavy" approach might still be useful - but we have agreed that the burden of maintaining both is too high to keep "heavy".

Let's discuss in a call how I can make your life easier with the git conflicts.

…eaned up the argument handling

kba · 2025-12-04T14:10:10Z

src/eynollah/eynollah.py

@@ -1615,52 +1370,10 @@ def extract_text_regions(self, img, patches, cols):
        img_width_h = img.shape[1]
        model_region = self.model_zoo.get("region_fl") if patches else self.model_zoo.get("region_fl_np")

-        if not patches:


TODO for self: Why only delete these if-clauses, deleted from wrong part of the code?

kba and others added 8 commits November 26, 2025 18:12

🔥 remove OCR option from eynollah layout

5a1900e

reorganize cli

82266f8

drop obsolete multi-model binarization

e503c1a

🔥 remove torch pinning

000af16

models: split into layout, extra and ocr

095b36c

layout: Everything not OCR or extra ocr: trocr/cnnrnn models extra: obsolete or niche models

fix imports from src/cli/cli_*/*_cli

ca83cf9

🔥 drop light_version/textline_light (now default and implied)

83e8b28

factor out extract_only_images as eynollah extract-images

177d555

kba force-pushed the reduce-complexity branch from 61fab65 to 177d555 Compare November 26, 2025 20:37

remove more branches after textline_light default true

4aa9543

kba force-pushed the reduce-complexity branch 2 times, most recently from 1d3ca0d to 766ed50 Compare November 28, 2025 11:50

kba and others added 3 commits November 28, 2025 12:52

enforce kwargs for writer.build_...

c24cf94

eynollah.py: fix kwargs to writer

5171e09

💀 remove dead code from eynollah.py

9bcfeab

kba force-pushed the reduce-complexity branch 2 times, most recently from 55e7d7a to 4277bc2 Compare November 28, 2025 13:58

kba added 3 commits November 28, 2025 15:12

CI: do not upgrade (now-unpineed) torch

951bd2f

move line-gt extraction out of ocr to eynollah-training

30f9c69

🔥 refactor eynollah ocr

b161e33

.

kba force-pushed the reduce-complexity branch from 72dc885 to b161e33 Compare November 28, 2025 14:45

kba marked this pull request as ready for review November 28, 2025 15:54

kba requested a review from vahidrezanezhad November 28, 2025 15:54

bertsky reviewed Dec 2, 2025

View reviewed changes

kba and others added 4 commits December 2, 2025 15:00

log to STDERR not STDOUT

51abe96

Restored correct functionality of the extract_only_images mode and cl…

d687d86

…eaned up the argument handling

Fix eynollah ocr --help so it works again

6ac37af

Restore correct execution of export_textline_images_and_text

7bf5e07

vahidrezanezhad approved these changes Dec 3, 2025

View reviewed changes

kba commented Dec 4, 2025

View reviewed changes


		return img_scaled_padded#, label_scaled_padded

		def do_prediction_new_concept_scatter_nd(

Reduce complexity #210

Are you sure you want to change the base?

Reduce complexity #210

Uh oh!

Conversation

kba commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertsky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kba commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kba commented Nov 26, 2025 •

edited

Loading