-
Notifications
You must be signed in to change notification settings - Fork 986
fix: binary file check #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@NicolasIRAGNE i've fix encoding issue on windows, we now use utf-16 le instead of utf-16. I let you decide what we need to do from now. |
|
I'll take a look whenever I have a bit of time, I'm not sure what this does but it seems better than initial check But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars? Also, #375 did this fix the problem as well? |
The current problem:
Yes. This is the main purpose of this with another one : The context should now ignore a lot more (if it's not all) binaries, so the context should be a lot more usable for LLMs on various repositories. |
|
Hi there! We haven’t seen activity on this pull request for 45 days, so I’m marking it as stale. |
|
Hi there! We haven’t heard anything for 10 days, so I’m closing this pull request. Feel free to reopen if you’d like to continue working on it. Thanks! |
Closes #375
Topic
tiktoken crash sometimes with binary files, after diging i've found that we did some check to ignore binary files but not strong enough.