-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Description
Description
When using the client.speech_to_text.convert API with diarization enabled, the returned word-level timestamps occasionally become "stuck" - multiple consecutive words are assigned identical start and end timestamps. This breaks the continuity of the transcription timeline.
Example of Problem:
json
{
"text": "休",
"start": 452.3,
"end": 452.3
},
{
"text": "息",
"start": 452.3,
"end": 452.3
}
Impact:
Timeline Accuracy: Renders timestamp information useless for applications requiring precise timing
Speaker Diarization: Prevents accurate speaker attribution when timestamps don't progress
Audio Alignment: Makes it impossible to sync transcription with original audio
Data Processing: Requires complex workarounds to handle corrupted timing data
Steps to Reproduce
Use a long audio file (>10 minutes) with multiple speaker changes
Call the API with parameters:
python
ElevenLabs.speech_to_text.convert(
file=audio_data,
model_id="scribe_v1",
diarize=True,
language_code="zh", # Also reproducible with other languages
tag_audio_events=True
)
Inspect word-level timestamps in the response
Observe duplicate timestamps for consecutive words, especially after speaker changes
Expected Behavior
Each word should have a unique timestamp range (start < end)
Timestamps should monotonically increase throughout the transcription
Speaker changes should not cause timestamp stagnation
Consecutive words should have increasing end timestamps
Code example
No response
Additional context
No response