-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Labels
Description
Describe the bug
BERTUnfactoredDisambiguator.pretrained() clips last token of text when text contains the special character �.
MLEDisambiguator.pretrained() works fine and doesn't clip tokens.
To Reproduce
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.disambig.bert import BERTUnfactoredDisambiguator
mle = MLEDisambiguator.pretrained()
brt = BERTUnfactoredDisambiguator.pretrained()
mle.disambiguate("أعضاء مجلس � الإدارة المحترمون يتفضلون".split())
brt.disambiguate("أعضاء مجلس � الإدارة المحترمون يتفضلون".split())Expected behavior
To provide disambiguation for all tokens.
Screenshots
Desktop (please complete the following information):
- WSL2 under Windows 11 home.
- WSL version: 2.5.9.0
- Kernel version: 6.6.87.2-1
- WSLg version: 1.0.66
- MSRDC version: 1.2.6074
- Direct3D version: 1.611.1-81528511
- DXCore version: 10.0.26100.1-240331-1435.ge-release
- Windows version: 10.0.26100.4349
- Python version: 3.12
- CAMeL Tools version as well as installation source (pip, conda, source). camel_tools.version '1.5.6' using pip
Additional context
I understand it's not an expected character in Arabic text. But it happens occasionally in automated workflows. Thank you.
