Commit a7ba4a2
committed
"Fix" sequence breaker tokenization
Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result.
See LostRuins/koboldcpp#982 for a correct solution to this problem.1 parent 8177528 commit a7ba4a2
1 file changed
+7
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
119 | | - | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
120 | 125 | | |
121 | 126 | | |
122 | 127 | | |
123 | 128 | | |
124 | 129 | | |
125 | 130 | | |
126 | | - | |
| 131 | + | |
127 | 132 | | |
128 | 133 | | |
129 | 134 | | |
| |||
0 commit comments