Skip to content

[BUG] BERTUnfactoredDisambiguator.pretrained() clips text when special character � exists #159

@ahmadabousetta

Description

@ahmadabousetta

Describe the bug
BERTUnfactoredDisambiguator.pretrained() clips last token of text when text contains the special character �.
MLEDisambiguator.pretrained() works fine and doesn't clip tokens.

To Reproduce

from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.disambig.bert import BERTUnfactoredDisambiguator

mle = MLEDisambiguator.pretrained()
brt = BERTUnfactoredDisambiguator.pretrained()

mle.disambiguate("أعضاء مجلس � الإدارة المحترمون يتفضلون".split())
brt.disambiguate("أعضاء مجلس � الإدارة المحترمون يتفضلون".split())

Expected behavior
To provide disambiguation for all tokens.

Screenshots

Image

Desktop (please complete the following information):

  • WSL2 under Windows 11 home.
  • WSL version: 2.5.9.0
  • Kernel version: 6.6.87.2-1
  • WSLg version: 1.0.66
  • MSRDC version: 1.2.6074
  • Direct3D version: 1.611.1-81528511
  • DXCore version: 10.0.26100.1-240331-1435.ge-release
  • Windows version: 10.0.26100.4349
  • Python version: 3.12
  • CAMeL Tools version as well as installation source (pip, conda, source). camel_tools.version '1.5.6' using pip

Additional context
I understand it's not an expected character in Arabic text. But it happens occasionally in automated workflows. Thank you.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions