text_to_dialogue.convert mixes all speakers into one voice if there are any bracket instructions present

### Description

I have a simple case of an audiobook extract with narrator and two characters. The language is Finnish. When executing, all voices are outputted as the same default voice.

Debugging this suggests that if there are **any** bracket instuctions present, voice actors do not change. I've tried following:
- removing all emotion brackets ([excited]) - no help, still messed up
- removing all language brackerts ([English] Amouranth [Finnish] A S M R) - no help, still messed up
- removed all brackets, including [pause 0.3 s]) - this works and narrators change.

I am running this on Ubuntu WSL and have updated the latest version of elevenlabs from Github.

ossi@TUF23:/mnt/c/temp$ pip show elevenlabs
Name: elevenlabs
Version: 2.22.1
Summary:
Home-page:
Author:
Author-email:
License: MIT
Location: /home/ossi/.local/lib/python3.10/site-packages
Requires: httpx, pydantic, pydantic-core, requests, typing_extensions, websockets
Required-by:
ossi@TUF23:/mnt/c/temp$
Python 3.10 on WSL2 (Ubuntu)
Model: eleven_v3

Usage instructions: first set ELEVENLABS_API_KEY in the environment.

Then execute the script. 

The script has two test switches -t and --smoke. 
`-t` says each character in its own voice. This works fine. Each character sound is different as they should.
`--smoke` says one simple phrase in each character's voice. This works fine. Each character sound is different as they should.
`--debug` shows what voice is being used.

However, the actual content speaks the whole text in one voice. Even tought the calls are for different voices.

```
ossi@TUF23:/mnt/c/temp$ python3 lue.py --debug -o sample5b.mp3
[DEBUG] 1: voice_id=Dkbbg7k9Ir9TNzn5GYLp text='Vastapäätä istuva Miksu löi kätensä yhteen ja Elias havahtui mietteistään. Miksu'...
[DEBUG] 2: voice_id=6n4YmXLiuP4C7cZqYOJl text='Tajuatko sä Elias minkälainen kohu tästä on syntymässä? [pause 0.3s] [excited]Su'...
[DEBUG] 3: voice_id=IKne3meq5aSn9XLyUdCD text='[surprised]Ai mitä?'...
[DEBUG] 4: voice_id=6n4YmXLiuP4C7cZqYOJl text='Aistivastevideota. [pause 0.2s] Kuvankauniit naiset pulikoivat kahluualtaissa ja'...
[DEBUG] 5: voice_id=Dkbbg7k9Ir9TNzn5GYLp text='Elias ei vastannut heti. Hän katsoi Miksua suoraan silmiin. [pause 0.6s]'...
[DEBUG] 6: voice_id=IKne3meq5aSn9XLyUdCD text='[calm]Oli. [pause 1.0s]'...
```

This problem is 100 % reproducible.``

### Code example

```
#!/usr/bin/env python3
import os
import sys
import argparse
from pathlib import Path
from typing import List, Dict, Tuple

from elevenlabs.client import ElevenLabs
from elevenlabs.core.api_error import ApiError

# ---------------------------
# 1) Readers in variables
# ---------------------------
VOICES: Dict[str, str] = {
    "Kertoja": "Dkbbg7k9Ir9TNzn5GYLp",  # Rachel
    "Miksu":   "6n4YmXLiuP4C7cZqYOJl",  # Callum
    "Elias":   "IKne3meq5aSn9XLyUdCD",  # Charlie
    # "CherryKitten": "<voice_id>",
}

# ---------------------------
# 2) Sample read (name, phrase)
# ---------------------------
SCENE: List[Tuple[str, str]] = [
    ("Kertoja",
     "Vastapäätä istuva Miksu löi kätensä yhteen ja Elias havahtui mietteistään. "
     "Miksu oli esittänyt hänelle liikeideansa: Miksu alkaisi hänen managerikseen "
     "ja kaupallistajakseen. [pause 1.0s]"),
    ("Miksu",
     "Tajuatko sä Elias minkälainen kohu tästä on syntymässä? [pause 0.3s] [excited]"
     "Sun kanavan tilaajaluvut ovat nousseet jo lähes kolmeensataan[normal] ja Vuosaari-keikan "
     "lopusta on klippailtu jo ainakin kymmenelle YouTube-julkaisulle oma osuutensa. "
     "Streamaajaskenessä tämä on juuri nyt kaikkein kuuminta hottia. "
     "Jengi odottaa sulta uutta jaksoa kuin Amouranthin seuraavaa A S M R:ää."),
    ("Elias", "[surprised]Ai mitä?"),
    ("Miksu",
     "Aistivastevideota. [pause 0.2s] Kuvankauniit naiset pulikoivat kahluualtaissa ja kuiskailevat "
     "mikrofoniin lempeällä äänellä helliä sanoja. Yleisö kokee kehossaan kihelmöintiä. "
     "Mutta älä siitä välitä, keskitytään nyt suhun. Oliko siinä Vuosaari-videolla ihan oikeasti "
     "[fearful]ruumis[normal], vai [happy]trollasitko[normal] vain?"),
    ("Kertoja", "Elias ei vastannut heti. Hän katsoi Miksua suoraan silmiin. [pause 0.6s]"),
    ("Elias", "[calm]Oli. [pause 1.0s]"),
]

# ---------------------------
# Utilit
# ---------------------------
def unique_output_path(base: Path, allow_increment: bool = True) -> Path:
    """If base is free -> return as is. If occupied and allow_increment=True -> return _###-version."""
    if not base.exists() or not allow_increment:
        return base
    stem, suffix, parent = base.stem, base.suffix, base.parent
    i = 1
    while True:
        candidate = parent / f"{stem}_{i:03d}{suffix}"
        if not candidate.exists():
            return candidate
        i += 1

def build_inputs_from_scene(scene: List[Tuple[str, str]], voices: Dict[str, str]):
    inputs = []
    for name, text in scene:
        try:
            voice_id = voices[name]
        except KeyError:
            raise KeyError(f"Character '{name}' not found in VOICES dictionary.")
        inputs.append({"voice_id": voice_id, "text": text})
    return inputs

def build_inputs_for_test(voices: Dict[str, str]):
    # Each voice says its name -> quick voice test
    return [{"voice_id": vid, "text": name} for name, vid in voices.items()]

def write_audio_chunks_to_file(audio_iter, out_path: Path):
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with open(out_path, "wb") as f:
        for chunk in audio_iter:
            if chunk:
                f.write(chunk)

# ---------------------------
# Main program   
# ---------------------------
def main(argv=None):
    parser = argparse.ArgumentParser(
        description="Moniääni-dialog generation with ElevenLabs Text-to-Dialogue endpoint."
    )
    parser.add_argument("-o", "--output", default="sample.mp3",
                        help="Output file path (default: sample.mp3).")
    parser.add_argument("-t", "--test", action="store_true",
                        help="Voice test: each defined voice says its own name.")
    parser.add_argument("--overwrite", action="store_true",
                        help="Overwrite existing file (do not add running suffix).")
    parser.add_argument("--debug", action="store_true",
                        help="Print built inputs objects before generation.")
    parser.add_argument("--smoke", action="store_true",
                        help="Short two-replica scene without tags (quick dialog space test).")
    args = parser.parse_args(argv)

    api_key = os.environ.get("ELEVENLABS_API_KEY")
    if not api_key:
        print("WARNING: ELEVENLABS_API_KEY is not set in the environment.", file=sys.stderr)

    client = ElevenLabs(api_key=api_key)

    # Build inputs in ONE place (no overwrites later)
    if args.smoke:
        scene = [("Kertoja", "Tämä on kertoja."), ("Elias", "Tämä on Elias."), ("Miksu", "Tämä on Miksu.")]
        inputs = build_inputs_from_scene(scene, VOICES)
    elif args.test:
        inputs = build_inputs_for_test(VOICES)
    else:
        inputs = build_inputs_from_scene(SCENE, VOICES)

    if args.debug:
        for i, it in enumerate(inputs, 1):
            print(f"[DEBUG] {i}: voice_id={it['voice_id']} text={it['text'][:80]!r}...")

    # Generate audio
    try:
        audio = client.text_to_dialogue.convert(
            inputs=inputs,
            output_format="mp3_44100_128",
            seed=42,
        )
    except ApiError as e:
        # Show common reasons neatly
        print("ElevenLabs API error:", file=sys.stderr)
        print(f"  status_code: {getattr(e, 'status_code', '?')}", file=sys.stderr)
        body = getattr(e, 'body', None)
        if body:
            print(f"  body: {body}", file=sys.stderr)
        print("Tips: check ELEVENLABS_API_KEY, voice_id (do they belong to your account), and SDK version.", file=sys.stderr)
        sys.exit(1)

    # Output file + automatic numbering
    out_base = Path(args.output)
    out_path = out_base if args.overwrite else unique_output_path(out_base, allow_increment=True)
    write_audio_chunks_to_file(audio, out_path)
    print(f"OK -> {out_path}")

if __name__ == "__main__":
    main()

```


However, if you change the generated string section to following, it works:

SCENE: List[Tuple[str, str]] = [
    ("Kertoja",
     "Vastapäätä istuva Miksu löi kätensä yhteen ja Elias havahtui mietteistään. "
     "Miksu oli esittänyt hänelle liikeideansa: Miksu alkaisi hänen managerikseen "
     "ja kaupallistajakseen. "),
    ("Miksu",
     "Tajuatko sä Elias minkälainen kohu tästä on syntymässä? "
     "Sun kanavan tilaajaluvut ovat nousseet jo lähes kolmeensataan ja Vuosaari-keikan "
     "lopusta on klippailtu jo ainakin kymmenelle YouTube-julkaisulle oma osuutensa. "
     "Streamaajaskenessä tämä on juuri nyt kaikkein kuuminta hottia. "
     "Jengi odottaa sulta uutta jaksoa kuin Amouranthin seuraavaa A S M R:ää."),
    ("Elias", "Ai mitä?"),
    ("Miksu",
     "Aistivastevideota.  Kuvankauniit naiset pulikoivat kahluualtaissa ja kuiskailevat "
     "mikrofoniin lempeällä äänellä helliä sanoja. Yleisö kokee kehossaan kihelmöintiä. "
     "Mutta älä siitä välitä, keskitytään nyt suhun. Oliko siinä Vuosaari-videolla ihan oikeasti "
     "ruumis, vai trollasitko vain?"),
    ("Kertoja", "Elias ei vastannut heti. Hän katsoi Miksua suoraan silmiin. "),
    ("Elias", "Oli. "),
]

### Additional context

[sample_004_smoketest.mp3](https://github.com/user-attachments/files/23502292/sample_004_smoketest.mp3)
[sample004_actual_output.mp3](https://github.com/user-attachments/files/23502293/sample004_actual_output.mp3)
[sample006_without_any_brackets.mp3](https://github.com/user-attachments/files/23502294/sample006_without_any_brackets.mp3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text_to_dialogue.convert mixes all speakers into one voice if there are any bracket instructions present #677

Description

Code example

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

text_to_dialogue.convert mixes all speakers into one voice if there are any bracket instructions present #677

Description

Description

Code example

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions