Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 15 additions & 17 deletions src/whisper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ struct whisper_vocab {
}

int num_languages() const {
return n_vocab - 51765 - (is_multilingual() ? 1 : 0);
return n_vocab - token_to_id.size() - 1509;
}
Comment on lines 455 to 457
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified num_languages() Function

The num_languages() function has been redesigned to dynamically calculate the number of supported languages, replacing the original hardcoded logic that relied on fixed vocabulary size thresholds (e.g., 51865 for multilingual models).

Rationale (Aligned with OpenAI Whisper's Tokenizer Design)

Per OpenAI’s official Whisper tokenizer implementation (tokenizer.py#L340-L351):

  • Language-specific special tokens (e.g., <|ja|>, <|en|>) are consecutively arranged between the <|startoftranscript|> and <|translate|> tokens in the vocabulary.
  • The total number of non-language special tokens is fixed at 1509 (1501 timestamp tokens + 8 core functional tokens: <|endoftext|>, <|startoftranscript|>, <|translate|>, <|transcribe|>, <|startoflm|>, <|startofprev|>, <|nospeech|>, <|notimestamps|>).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method which caculates the token ids between <|startoftext|> and <|translate|> is better, because the sizes of token_to_id and id_to_token will change after loading special token in whisper.cpp#L1641-L1672

};

Expand Down Expand Up @@ -1621,22 +1621,20 @@ static bool whisper_model_load(struct whisper_model_loader * loader, whisper_con
//printf("%s: vocab[%d] = '%s'\n", __func__, i, word.c_str());
}

vocab.n_vocab = model.hparams.n_vocab;
if (vocab.is_multilingual()) {
vocab.token_eot++;
vocab.token_sot++;

// account for variable number of language tokens
const int dt = vocab.num_languages() - 98;

vocab.token_translate += dt;
vocab.token_transcribe += dt;
vocab.token_solm += dt;
vocab.token_prev += dt;
vocab.token_nosp += dt;
vocab.token_not += dt;
vocab.token_beg += dt;
}
size_t common_vocab_size = vocab.token_to_id.size(); // common vocab size, excluding special tokens
vocab.n_vocab = model.hparams.n_vocab; // all tokens, including special tokens

vocab.token_eot = common_vocab_size; // <|endoftext|>
vocab.token_sot = common_vocab_size + 1; // <|startoftext|>
// [common_vocab_size + 2, vocab.n_vocab - 1504) are language tokens
// num_language = vocab.token_translate - vocab.token_sot = vocab.n_vocab - vocab.token_to_id.size() - 1509
vocab.token_translate = vocab.n_vocab - 1507; // <|translate|>
vocab.token_transcribe = vocab.n_vocab - 1506; // <|transcribe|>
vocab.token_solm = vocab.n_vocab - 1505; // <|startoflm|>
vocab.token_prev = vocab.n_vocab - 1504; // <|startofprev|>
vocab.token_nosp = vocab.n_vocab - 1503; // <|nospeech|>
vocab.token_not = vocab.n_vocab - 1502; // <|notimestamps|>
vocab.token_beg = vocab.n_vocab - 1501; // timestamps from <|0.00|> to <|30.00|>, 1501 tokens

if (n_vocab < model.hparams.n_vocab) {
WHISPER_LOG_INFO("%s: adding %d extra tokens\n", __func__, model.hparams.n_vocab - n_vocab);
Expand Down