Skip to content

Conversation

@kba
Copy link
Contributor

@kba kba commented Nov 26, 2025

Starting to decouple the functionality:

  • remove OCR from eynollah layout (which has a dedicated eynollah ocr command)
  • remove multi-model binarization
  • set the latest hybrid model as the default for binarization
  • separate model dists:
    • all: contains everything
    • layout: everything needed for the - arguably - most used functionality: layout/readingorder/binarization/enhancement (~3.8GB)
    • ocr: TrOCR and CNN/RNN models (~1.8GB)
    • extra everything else, alternative/older/niche models (~400MB)
  • Make light_mode the default
  • Move non-light_mode specific models to extra.
  • Factor extract_only_images into separate class
  • Factor Ground Truth textline-image pair extraction out of eynollah ocr
  • Refactor 1000 SLOC run method of eynollah ocr into manageable chunks
    - [ ] Refactor eynollah ocr to use the PAGE API instead of etree out of scope It's not difficult but error-prone and better done in a separate PR based on this

@kba kba force-pushed the reduce-complexity branch from 61fab65 to 177d555 Compare November 26, 2025 20:37
@kba kba force-pushed the reduce-complexity branch 2 times, most recently from 1d3ca0d to 766ed50 Compare November 28, 2025 11:50
@kba kba force-pushed the reduce-complexity branch 2 times, most recently from 55e7d7a to 4277bc2 Compare November 28, 2025 13:58
@kba kba force-pushed the reduce-complexity branch from 72dc885 to b161e33 Compare November 28, 2025 14:45
@kba kba marked this pull request as ready for review November 28, 2025 15:54
@kba kba requested a review from vahidrezanezhad November 28, 2025 15:54
Copy link
Contributor

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised you did not only make -light and -tll defaults, but removed all the other modes entirely.

I was under the impression that perhaps in certain cases these other modes could still serve a purpose. But even if they don't – removing them now (instead of just the extraction and OCR parts) creates much larger diffs, making merge of #206 and my jdeskew branch much more difficult...

help="upper limit of columns in document image",
)
@click.option(
"--save_org_scale/--no_save_org_scale",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we get rid of these redundant negative forms.

It is confusing to users, as they don't know what's the default, so to be sure, they must list all things they don't want, making calls quite long.

help="directory of input images (instead of --image)",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #206, I added an --overwrite option here for consistency (and internally restructed the binarizer to use the same run vs. run_single pattern). See 086c188

raise ValueError("Must pass either a opencv2 image or an image_path")
if image_path is not None:
image = cv2.imread(image_path)
img_last = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I suggest merging in 086c188 to get separate run and run_single

@@ -585,7 +479,7 @@ def resize_image_with_column_classifier(self, is_image_enhanced, img_bin):

return img, img_new, is_image_enhanced

def resize_and_enhance_image_with_column_classifier(self, light_version):
def resize_and_enhance_image_with_column_classifier(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this function should now be renamed resize_image_with_column_classifier and the one already existing under that name should be renamed resize_and_enhance_image_with_column_classifier, because in effect of these changes we would have the paradoxical effect that predict_enhancement only gets called from the other one.


return img_scaled_padded#, label_scaled_padded

def do_prediction_new_concept_scatter_nd(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly recommend against removing this function. It is still experimental, but can further speed up prediction immensely (because patch processing will be on GPU only instead of CPU-GPU back and forth).

prediction_regions = self.do_prediction(patches, img, model_region, marginal_of_patch_percent=0.1)
prediction_regions = resize_image(prediction_regions, img_height_h, img_width_h)
self.logger.debug("exit extract_text_regions")
return prediction_regions, prediction_regions2
return prediction_regions, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something went wrong here. Did you want to remove that entire (unused) function extract_text_regions? Better remove cleanly instead of cripple it...

Comment on lines +2321 to +2324
rotation_not_90_func(image_page, textline_mask_tot, text_regions_p,
table_prediction, slope_deskew)

text_regions_p_1_n = resize_image(text_regions_p_1_n,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this section will yield lots of conflicts with #206 ...

Comment on lines 83 to 130
def build_pagexml_no_full_layout(
self, found_polygons_text_region,
page_coord, order_of_texts, id_of_texts,
all_found_textline_polygons,
all_box_coord,
found_polygons_text_region_img,
found_polygons_marginals_left, found_polygons_marginals_right,
all_found_textline_polygons_marginals_left, all_found_textline_polygons_marginals_right,
all_box_coord_marginals_left, all_box_coord_marginals_right,
slopes, slopes_marginals_left, slopes_marginals_right,
cont_page, polygons_seplines,
found_polygons_tables,
**kwargs):
self,
*,
found_polygons_text_region,
page_coord,
order_of_texts,
all_found_textline_polygons,
all_box_coord,
found_polygons_text_region_img,
found_polygons_marginals_left,
found_polygons_marginals_right,
all_found_textline_polygons_marginals_left,
all_found_textline_polygons_marginals_right,
all_box_coord_marginals_left,
all_box_coord_marginals_right,
slopes,
slopes_marginals_left,
slopes_marginals_right,
cont_page,
polygons_seplines,
found_polygons_tables,
):
return self.build_pagexml_full_layout(
found_polygons_text_region, [],
page_coord, order_of_texts, id_of_texts,
all_found_textline_polygons, [],
all_box_coord, [],
found_polygons_text_region_img, found_polygons_tables, [],
found_polygons_marginals_left, found_polygons_marginals_right,
all_found_textline_polygons_marginals_left, all_found_textline_polygons_marginals_right,
all_box_coord_marginals_left, all_box_coord_marginals_right,
slopes, [], slopes_marginals_left, slopes_marginals_right,
cont_page, polygons_seplines,
**kwargs)
found_polygons_text_region=found_polygons_text_region,
found_polygons_text_region_h=[],
page_coord=page_coord,
order_of_texts=order_of_texts,
all_found_textline_polygons=all_found_textline_polygons,
all_found_textline_polygons_h=[],
all_box_coord=all_box_coord,
all_box_coord_h=[],
found_polygons_text_region_img=found_polygons_text_region_img,
found_polygons_tables=found_polygons_tables,
found_polygons_drop_capitals=[],
found_polygons_marginals_left=found_polygons_marginals_left,
found_polygons_marginals_right=found_polygons_marginals_right,
all_found_textline_polygons_marginals_left=all_found_textline_polygons_marginals_left,
all_found_textline_polygons_marginals_right=all_found_textline_polygons_marginals_right,
all_box_coord_marginals_left=all_box_coord_marginals_left,
all_box_coord_marginals_right=all_box_coord_marginals_right,
slopes=slopes,
slopes_h=[],
slopes_marginals_left=slopes_marginals_left,
slopes_marginals_right=slopes_marginals_right,
cont_page=cont_page,
polygons_seplines=polygons_seplines,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really in favour of turning these identifiers into proper kwargs. But if you must, then why still repeat all of them (in the delegation pattern), when you could just

def build_pagexml_no_full_layout(**kwargs):
    return self.build_pagexml_full_layout(
        found_polygons_text_region_h=[],
        all_found_textline_polygons_h=[],
        all_box_coord_h=[],
        found_polygons_drop_capitals=[],
        slopes_h=[],
        **kwargs)

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or (as discussed at last) even better provide None as default everywhere, so the caller does not need to pass empty lists. Then there will also no longer be any need to differentiate no/full layout calls, only build_pagexml().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted this to kwargs only because there are so many arguments to that method now, it was hard to find the errors resulting from the refactoring. I am open to simplifying this again once the dust has settled.

@kba
Copy link
Contributor Author

kba commented Dec 2, 2025

I was under the impression that perhaps in certain cases these other modes could still serve a purpose. But even if they don't – removing them now (instead of just the extraction and OCR parts) creates much larger diffs, making merge of #206 and my jdeskew branch much more difficult...

If it's just about reducing conflict, I can reinstate those code branches temporarily - @vahidrezanezhad knows best under what circumstances the original "heavy" approach might still be useful - but we have agreed that the burden of maintaining both is too high to keep "heavy".

Let's discuss in a call how I can make your life easier with the git conflicts.

@@ -1615,52 +1370,10 @@ def extract_text_regions(self, img, patches, cols):
img_width_h = img.shape[1]
model_region = self.model_zoo.get("region_fl") if patches else self.model_zoo.get("region_fl_np")

if not patches:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for self: Why only delete these if-clauses, deleted from wrong part of the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants