Skip to content

Conversation

@Jaffe2718
Copy link

@Jaffe2718 Jaffe2718 commented Dec 4, 2025

Description

This PR addresses the tokenizer index out-of-bounds issue when using custom Whisper models with modified vocabulary sizes, as reported in #3392.

The problem occurs when converting models like efwkjn/whisper-ja-anime-v0.1, efwkjn/whisper-ja-anime-v0.2, and efwkjn/whisper-ja-anime-v0.3 to ggml format using convert-h5-to-ggml.py and running them with whisper.cpp. These models have a vocabulary size of 20480 (including special tokens), which differs from the official Whisper models (51864 for monolingual, 51865 for multilingual). The hardcoded special token IDs in whisper.cpp cause index out-of-bounds errors when using these custom models.

Solution

The solution dynamically calculates special token IDs based on the actual vocabulary size and structure, instead of using hardcoded values:

  1. After loading the vocabulary and establishing id-to-token mappings, we determine:

    • vocab.n_vocab: The total size of the embedding layer
    • common_vocab_size: The number of regular (non-special) tokens (size_t common_vocab_size = vocab.token_to_id.size())
  2. Following OpenAI's Whisper token arrangement principles (special tokens are placed consecutively after regular tokens), we calculate the ranges:

    • [0, common_vocab_size): Regular tokens
    • common_vocab_size: <|endoftext|>
    • common_vocab_size + 1: <|startoftranscript|>
    • [common_vocab_size + 2, emb_size - 1507): Language mark tokens
    • emb_size - 1507: <|translate|>
    • emb_size - 1506: <|transcribe|>
    • emb_size - 1505: <|startoflm|>
    • emb_size - 1504: <|startofprev|>
    • emb_size - 1503: <|nospeech|>
    • emb_size - 1502: <|notimestamps|>
    • [emb_size - 1501, emb_size): Timestamp tokens (1501 tokens from <|0.00|> to <|30.00|>)
  3. The total number of non-language special tokens is 1509 (1501 timestamps + 8 other special tokens).

  4. The number of language tokens is calculated as: vocab.n_vocab - common_vocab_size - 1509

This approach dynamically adapts to different vocabulary sizes and maintains compatibility with both official and custom Whisper models.

Fix #3392

Comment on lines 455 to 457
int num_languages() const {
return n_vocab - 51765 - (is_multilingual() ? 1 : 0);
return n_vocab - token_to_id.size() - 1509;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified num_languages() Function

The num_languages() function has been redesigned to dynamically calculate the number of supported languages, replacing the original hardcoded logic that relied on fixed vocabulary size thresholds (e.g., 51865 for multilingual models).

Rationale (Aligned with OpenAI Whisper's Tokenizer Design)

Per OpenAI’s official Whisper tokenizer implementation (tokenizer.py#L340-L351):

  • Language-specific special tokens (e.g., <|ja|>, <|en|>) are consecutively arranged between the <|startoftranscript|> and <|translate|> tokens in the vocabulary.
  • The total number of non-language special tokens is fixed at 1509 (1501 timestamp tokens + 8 core functional tokens: <|endoftext|>, <|startoftranscript|>, <|translate|>, <|transcribe|>, <|startoflm|>, <|startofprev|>, <|nospeech|>, <|notimestamps|>).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method which caculates the token ids between <|startoftext|> and <|translate|> is better, because the sizes of token_to_id and id_to_token will change after loading special token in whisper.cpp#L1641-L1672

@Jaffe2718 Jaffe2718 changed the title Avoid hard-coding definition vocabulary to be compatible with costom model Avoid hard-coding definition vocabulary to be compatible with custom model Dec 4, 2025
@Jaffe2718 Jaffe2718 changed the title Avoid hard-coding definition vocabulary to be compatible with custom model fix: eliminate hard-coded vocab definitions to make the Whisper model compatible with custom vocabularies and embedding layer lengths #3392 Dec 5, 2025
@Jaffe2718 Jaffe2718 changed the title fix: eliminate hard-coded vocab definitions to make the Whisper model compatible with custom vocabularies and embedding layer lengths #3392 fix: eliminate hard-coded vocab definitions to make the Whisper model compatible with custom vocabularies and embedding layer lengths Dec 5, 2025
@Jaffe2718 Jaffe2718 marked this pull request as draft December 6, 2025 10:44
Comment on lines +110 to +111
if "<|endoftext|>" in tokens:
del tokens["<|endoftext|>"]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I use ggml-tiny.en.bin recognition, the result is also empty, the same reason why the tests fail

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the last commit was Migrate from HG dataset into HG model, it is necessary that these models need to be reconverted if they were last generated with this script, otherwise, <|endoftext|> will be written into common tokens.

@Jaffe2718 Jaffe2718 marked this pull request as ready for review December 6, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segment fault on custom model

1 participant