Skip to content

Commit a7ba4a2

Browse files
committed
"Fix" sequence breaker tokenization
Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem.
1 parent 8177528 commit a7ba4a2

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

mistralrs-core/src/sampler.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -116,14 +116,19 @@ impl DrySamplingParamsInner {
116116
.into_iter()
117117
.map(|breaker| {
118118
tokenizer
119-
.encode(breaker.clone(), true)
119+
// Prefix with 'a' to get the correct encoding of the token at the end of a text.
120+
//
121+
// FIXME: This is a hack. See https://github.com/LostRuins/koboldcpp/pull/982
122+
// for the correct solution which covers multi-token sequence breakers
123+
// and ambiguous encodings.
124+
.encode(format!("a{breaker}"), true)
120125
.map_err(anyhow::Error::msg)
121126
.map(|enc| {
122127
let ids = enc.get_ids();
123128
if !ids.is_empty() {
124129
None
125130
} else {
126-
Some(ids[0])
131+
Some(ids[ids.len() - 1])
127132
}
128133
})
129134
})

0 commit comments

Comments
 (0)