Skip to content

Conversation

@grider-transwithai
Copy link

@grider-transwithai grider-transwithai commented Nov 25, 2025

Purpose

Fix two critical bugs in HunyuanVL implementation that cause incorrect spatial localization (e.g., "squeezed" Y coordinates in detection masks):

  • XD-RoPE height index zeroing: In get_xdrope_input_positions, the h_index was incorrectly zeroed after being computed, removing vertical spatial awareness for image tokens. Fixed by setting t_index = 0 instead (correct for static images since temporal dimension is always 1).
  • smart_resize argument swap: The function signature is smart_resize(height, width, ...) returning (h_bar, w_bar), but it was called with (width, height) and assigned to (resized_width, resized_height). This caused incorrect resizing for non-square aspect ratios.

Test Plan

Tested with HunyuanOCR model on images with various aspect ratios. Before the fix, detected text bounding boxes were vertically compressed. After the fix, bounding boxes correctly match text positions.

Test Result

  • Before: Detection mask Y coordinates were "squeezed" - vertical positions incorrect
  • After: Detection mask correctly covers full image height with accurate text localization

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Fix two critical bugs in HunyuanVL implementation:

1. XD-RoPE height index zeroing: The h_index was incorrectly being
   zeroed instead of t_index, removing vertical spatial awareness
   for image tokens. Fixed by setting t_index to 0 (correct for
   static images) instead of h_index.

2. smart_resize argument swap: The function was called with (width,
   height) but expects (height, width), and return values were
   assigned incorrectly. Fixed argument order and assignment to
   match function signature.

Signed-off-by: grider-transwithai <grider@transwith.ai>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two critical bugs in the HunyuanVL implementation. The first fix correctly handles the temporal index (t_index) for XD-RoPE, which resolves an issue with vertical spatial awareness for image tokens. The second fix corrects a swapped argument issue in the smart_resize function call, ensuring correct image resizing for non-square aspect ratios. Both fixes are correct and well-justified. I've added one suggestion to improve code robustness by using keyword arguments in the smart_resize call, which would have prevented the original bug.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this bug! I'v checked versus the official impl and these look reasonable to me. cc @sergey-tc @ManaEstras

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025
@ywang96 ywang96 enabled auto-merge (squash) November 25, 2025 18:58
@ywang96 ywang96 disabled auto-merge November 27, 2025 07:58
@ywang96
Copy link
Member

ywang96 commented Nov 27, 2025

Hey @grider-transwithai! We have communicated with the Tencent HunyuanOCR team and there's another bug we need to fix on top of this PR, but unfortunately I don't have access to modify your PR so I'm going to open another one that carries over your change and make you co-author. Thanks!

@ywang96 ywang96 mentioned this pull request Nov 27, 2025
5 tasks
@ywang96
Copy link
Member

ywang96 commented Nov 27, 2025

Closing as superseded by #29593

@ywang96 ywang96 closed this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants