From 9275bd307f48536d661f1b7b8c5f6626fc302744 Mon Sep 17 00:00:00 2001 From: minh-hoque Date: Mon, 24 Nov 2025 23:05:47 -0500 Subject: [PATCH 1/5] Enhance Realtime out-of-band transcription example with detailed explanations, cost estimates, and improved formatting. Update execution counts and fix minor typos for clarity. --- .../Realtime_out_of_band_transcription.ipynb | 327 ++++++++++++++++-- 1 file changed, 295 insertions(+), 32 deletions(-) diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index 2ae3c26e5c..6abbf940a4 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -8,11 +8,11 @@ "\n", "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n", "\n", - "> We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n", + "We call this [out-of-band](https://platform.openai.com/docs/guides/realtime-conversations#create-responses-outside-the-default-conversation) transcription using the Realtime model. It’s simply a second response.create request issued on the same Realtime WebSocket, but tagged so it doesn’t write to the active conversation state. It runs the model again with different instructions (a transcription prompt. This triggers a new inference pass by the Realtime model, separate from the assistant’s main speech turn.)\n", "\n", "It covers how to build a server-to-server client that:\n", "\n", - "- Streams microphone audio to an OpenAI Realtime voice agent.\n", + "- Streams microphone audio to an OpenAI realtime voice agent.\n", "- Plays back the agent's spoken replies.\n", "- After each user turn, generates a high-quality text-only transcript using the **same Realtime model**.\n", "\n", @@ -92,9 +92,9 @@ "\n", "- Other Considerations:\n", "\n", - " - Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", + " - Implementing transcription via the Realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", "\n", - "> Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", + "**Note**: Ouf-of-band responses using the Realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", "\n", "\"drawing\"\n" ] @@ -126,14 +126,12 @@ "\n", " ```bash\n", " export OPENAI_API_KEY=sk-...\n", - " ```\n", - "\n", - "```\n" + " ```" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 11, "id": "c399f440", "metadata": {}, "outputs": [], @@ -150,15 +148,15 @@ "\n", "We use **two distinct prompts**:\n", "\n", - "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the realtime model for the Speech 2 Speech interactions.\n", + "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the Realtime model for the Speech 2 Speech interactions.\n", "2. **Transcription Prompt** (`REALTIME_MODEL_TRANSCRIPTION_PROMPT`): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.\n", "\n", - "> For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" + "For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "id": "ac3afaab", "metadata": {}, "outputs": [], @@ -214,7 +212,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 13, "id": "4b952a29", "metadata": {}, "outputs": [ @@ -222,7 +220,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", + "/var/folders/cn/p1ryy08146b7vvvhbh24j9b00000gn/T/ipykernel_48882/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", " from websockets.client import WebSocketClientProtocol\n" ] } @@ -251,7 +249,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 14, "id": "7254080a", "metadata": {}, "outputs": [], @@ -300,19 +298,19 @@ "- Audio input/output\n", "- Server‑side VAD\n", "- Set built‑in transcription (`input_audio_transcription_model`)\n", - " + We set this so that we can compare to the realtime model transcription\n", + " + We set this so that we can compare to the Realtime model transcription\n", "\n", - "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n", + "The out‑of‑band transcription is a `response.create` triggered after user input audio is committed `input_audio_buffer.committed`:\n", "\n", - "- `conversation: \"none\"` – use session state but don’t write to the main conversation session state\n", - "- `output_modalities: [\"text\"]` – get a text transcript only\n", + "- [`conversation: \"none\"`](https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-conversation) – use session state but don’t write to the main conversation session state\n", + "- [`output_modalities: [\"text\"]`](https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-output_modalities) – get a text transcript only\n", "\n", - "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" + "**Note**: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 15, "id": "4baf1870", "metadata": {}, "outputs": [], @@ -408,7 +406,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 16, "id": "11218bbb", "metadata": {}, "outputs": [], @@ -527,7 +525,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "cb6acbf0", "metadata": {}, "outputs": [], @@ -537,6 +535,7 @@ "\n", " pending_prints: deque | None = shared_state.get(\"pending_transcription_prints\")\n", " input_transcripts: deque | None = shared_state.get(\"input_transcripts\")\n", + " transcription_model_costs: deque | None = shared_state.get(\"transcription_model_costs\")\n", "\n", " if not pending_prints or not input_transcripts:\n", " return\n", @@ -547,10 +546,35 @@ " print(\"=== User turn (Transcription model) ===\")\n", " if comparison_text:\n", " print(comparison_text, flush=True)\n", - " print()\n", " else:\n", " print(\"\", flush=True)\n", - " print()\n" + "\n", + " # After printing the transcription text, print any stored granular cost.\n", + " cost_info = None\n", + " if transcription_model_costs:\n", + " cost_info = transcription_model_costs.popleft()\n", + "\n", + " if cost_info:\n", + " audio_input_cost = cost_info.get(\"audio_input_cost\", 0.0)\n", + " text_input_cost = cost_info.get(\"text_input_cost\", 0.0)\n", + " text_output_cost = cost_info.get(\"text_output_cost\", 0.0)\n", + " total_cost = cost_info.get(\"total_cost\", 0.0)\n", + "\n", + " usage = cost_info.get(\"usage\")\n", + " if usage:\n", + " print(\"[Transcription model usage]\")\n", + " print(json.dumps(usage, indent=2))\n", + "\n", + " print(\n", + " \"[Transcription model cost estimate] \"\n", + " f\"audio_in=${audio_input_cost:.6f}, \"\n", + " f\"text_in=${text_input_cost:.6f}, \"\n", + " f\"text_out=${text_output_cost:.6f}, \"\n", + " f\"total=${total_cost:.6f}\",\n", + " flush=True,\n", + " )\n", + "\n", + " print()\n" ] }, { @@ -570,7 +594,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "id": "d099babd", "metadata": {}, "outputs": [], @@ -594,6 +618,22 @@ " pending_transcription_prints = shared_state.setdefault(\n", " \"pending_transcription_prints\", deque()\n", " )\n", + " transcription_model_costs = shared_state.setdefault(\n", + " \"transcription_model_costs\", deque()\n", + " )\n", + "\n", + " # Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", + " # gpt-4o-transcribe\n", + " GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M = 6.00\n", + " GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M = 2.50\n", + " GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M = 10.00\n", + "\n", + " REALTIME_TEXT_INPUT_PRICE_PER_1M = 4\n", + " REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M = 0.4\n", + " REALTIME_TEXT_OUTPUT_PRICE_PER_1M = 16.00\n", + " REALTIME_AUDIO_INPUT_PRICE_PER_1M = 32.00\n", + " REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.4\n", + " REALTIME_AUDIO_OUTPUT_PRICE_PER_1M = 64.00\n", "\n", " async for raw in ws:\n", " if stop_event.is_set():\n", @@ -648,7 +688,39 @@ " final_text = item.get(\"transcription\")\n", " final_text = final_text or \"\"\n", "\n", - " final_text = final_text.strip()\n", + " # Compute and store cost estimate for the transcription model (e.g., gpt-4o-transcribe).\n", + " usage = message.get(\"usage\") or {}\n", + " cost_info: dict | None = None\n", + " if usage:\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " output_tokens = usage.get(\"output_tokens\") or 0\n", + "\n", + " audio_input_cost = (\n", + " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_input_cost = (\n", + " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", + "\n", + " cost_info = {\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"text_input_cost\": text_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"total_cost\": total_cost,\n", + " }\n", + "\n", + " transcription_model_costs.append(cost_info)\n", + "\n", + " final_text = (final_text or \"\").strip()\n", " if final_text:\n", " input_transcripts.append(final_text)\n", " flush_pending_transcription_prints(shared_state)\n", @@ -706,11 +778,86 @@ " responses[response_id][\"done\"] = True\n", "\n", " is_transcription = responses[response_id][\"is_transcription\"]\n", + "\n", + " # For out-of-band transcription responses, compute usage-based cost estimates.\n", + " usage = response.get(\"usage\") or {}\n", + " oob_cost_info: dict | None = None\n", + " if usage and is_transcription:\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " output_details = usage.get(\"output_token_details\") or {}\n", + " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", + "\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " cached_text_tokens = (\n", + " cached_details.get(\"text_tokens\")\n", + " or input_details.get(\"cached_tokens\")\n", + " or 0\n", + " )\n", + " non_cached_text_input_tokens = max(\n", + " text_input_tokens - cached_text_tokens, 0\n", + " )\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + "\n", + " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", + " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", + "\n", + " text_input_cost = (\n", + " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_text_input_cost = (\n", + " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_input_cost = (\n", + " audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_output_cost = (\n", + " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + "\n", + " total_cost = (\n", + " text_input_cost\n", + " + cached_text_input_cost\n", + " + audio_input_cost\n", + " + text_output_cost\n", + " + audio_output_cost\n", + " )\n", + "\n", + " oob_cost_info = {\n", + " \"text_input_cost\": text_input_cost,\n", + " \"cached_text_input_cost\": cached_text_input_cost,\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"audio_output_cost\": audio_output_cost,\n", + " \"total_cost\": total_cost,\n", + " }\n", + "\n", " text = buffers.get(response_id, \"\").strip()\n", " if text:\n", " if is_transcription:\n", " print(\"\\n=== User turn (Realtime transcript) ===\")\n", " print(text, flush=True)\n", + " if usage:\n", + " print(\"[Realtime out-of-band transcription usage]\")\n", + " print(json.dumps(usage, indent=2))\n", + " if oob_cost_info:\n", + " print(\n", + " \"[Realtime out-of-band transcription cost estimate] \"\n", + " f\"text_in=${oob_cost_info['text_input_cost']:.6f}, \"\n", + " f\"text_in_cached=${oob_cost_info['cached_text_input_cost']:.6f}, \"\n", + " f\"audio_in=${oob_cost_info['audio_input_cost']:.6f}, \"\n", + " f\"text_out=${oob_cost_info['text_output_cost']:.6f}, \"\n", + " f\"audio_out=${oob_cost_info['audio_output_cost']:.6f}, \"\n", + " f\"total=${oob_cost_info['total_cost']:.6f}\",\n", + " flush=True,\n", + " )\n", " print()\n", " pending_transcription_prints.append(object())\n", " flush_pending_transcription_prints(shared_state)\n", @@ -743,7 +890,7 @@ "source": [ "# 9. Run Script\n", "\n", - "In this step, we run the the code which will allow us to view the realtime model transcription vs transcription model transcriptions. The code does the following:\n", + "In this step, we run the the code which will allow us to view the Realtime model transcription vs transcription model transcriptions. The code does the following:\n", "\n", "- Loads configuration and prompts\n", "- Establishes a WebSocket connection\n", @@ -774,7 +921,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "id": "35c4d7b5", "metadata": {}, "outputs": [], @@ -856,6 +1003,117 @@ " )" ] }, + { + "cell_type": "code", + "execution_count": 20, + "id": "32401814", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001192, text_in_cached=$0.000000, audio_in=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.001256\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Hello! I’m here to help you file an insurance claim. Let’s start with the basics. Could you please tell me your full name?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full name is M I N H A J U L H O Q U E\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000132, text_in_cached=$0.000128, audio_in=$0.000864, text_out=$0.000304, audio_out=$0.000000, total=$0.001428\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My full name is Minhajul Hoque.\n", + "[Transcription model cost estimate] audio_in=$0.000384, text_in=$0.000000, text_out=$0.000120, total=$0.000504\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. I heard your full name is Minhajul Hoque. Could you please confirm that I got it right?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yup, you got it right exactly.\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001552, text_in_cached=$0.000000, audio_in=$0.001600, text_out=$0.000176, audio_out=$0.000000, total=$0.003328\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep, you got it right exactly.\n", + "[Transcription model cost estimate] audio_in=$0.000240, text_in=$0.000000, text_out=$0.000100, total=$0.000340\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Great, thank you for confirming. Now, could you provide your policy number, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "[client] Speech detected; streaming...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "D-\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001672, text_in_cached=$0.000000, audio_in=$0.001120, text_out=$0.000064, audio_out=$0.000000, total=$0.002856\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Доб.\n", + "[Transcription model cost estimate] audio_in=$0.000066, text_in=$0.000000, text_out=$0.000040, total=$0.000106\n", + "\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "P022-4567\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000168, text_in_cached=$0.000154, audio_in=$0.004192, text_out=$0.000112, audio_out=$0.000000, total=$0.004626\n", + "\n", + "=== User turn (Transcription model) ===\n", + "policy number is something like P022456.\n", + "[Transcription model cost estimate] audio_in=$0.000576, text_in=$0.000000, text_out=$0.000110, total=$0.000686\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. I heard your policy number is P-0-2-2-4-5-6. Could you confirm if that’s correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "it...\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001876, text_in_cached=$0.000000, audio_in=$0.006112, text_out=$0.000064, audio_out=$0.000000, total=$0.008052\n", + "\n", + "=== User turn (Transcription model) ===\n", + "It\n", + "[Transcription model cost estimate] audio_in=$0.000072, text_in=$0.000000, text_out=$0.000030, total=$0.000102\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you for confirming. Now, could you tell me the type of accident? For example, is it auto, home, or something else?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session()" + ] + }, { "cell_type": "code", "execution_count": 29, @@ -976,8 +1234,8 @@ "source": [ "From the above example, we can notice:\n", "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", - "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", - "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + "- The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", + "- With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." ] }, { @@ -996,7 +1254,12 @@ "If you decide to pursue this method, make sure you:\n", "\n", "* Set up the transcription trigger correctly, ensuring it activates after the audio commit.\n", - "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n" + "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n", + "\n", + "## Documentation:\n", + "- https://platform.openai.com/docs/guides/realtime-conversations#create-responses-outside-the-default-conversation\n", + "- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-conversation\n", + "- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-output_modalities" ] } ], @@ -1016,7 +1279,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.8" + "version": "3.12.9" } }, "nbformat": 4, From 8e5f7a0ef97ad0f8d8df8a1cfd7a264307cef2c8 Mon Sep 17 00:00:00 2001 From: minh-hoque Date: Tue, 25 Nov 2025 10:17:52 -0500 Subject: [PATCH 2/5] Update Realtime out-of-band transcription example with enhanced instructions, improved structure, and detailed cost estimation functions. Adjust execution counts and refine formatting for clarity. --- .../Realtime_out_of_band_transcription.ipynb | 2541 +++++++++++++++-- 1 file changed, 2357 insertions(+), 184 deletions(-) diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index 6abbf940a4..0b212a8c83 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -131,7 +131,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 21, "id": "c399f440", "metadata": {}, "outputs": [], @@ -156,44 +156,285 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 76, "id": "ac3afaab", "metadata": {}, "outputs": [], "source": [ - "REALTIME_MODEL_PROMPT = \"\"\"You are a calm insurance claims intake voice agent. Follow this script strictly:\n", + "REALTIME_MODEL_PROMPT = \"\"\"\n", + "You are a calm, professional, and empathetic insurance claims intake voice agent working for OpenAI Insurance Solutions. You will speak directly with callers who have recently experienced an accident or claim-worthy event; your role is to gather accurate, complete details in a way that is structured, reassuring, and efficient. Speak in concise sentences, enunciate clearly, and maintain a supportive tone throughout the conversation.\n", "\n", - "## Phase 1 – Basics\n", - "Collect the caller's full name, policy number, and type of accident (for example: auto, home, or other). Ask for each item clearly and then repeat the values back to confirm.\n", + "## OVERVIEW\n", "\n", - "## Phase 2 – Yes/No questions\n", - "Ask 2–3 simple yes/no questions, such as whether anyone was injured, whether the vehicle is still drivable, and whether a police report was filed. Confirm each yes/no answer in your own words.\n", + "Your job is to walk every caller methodically through three main phases:\n", "\n", - "## Phase 3 – Submit claim\n", - "Once you have the basics and yes/no answers, briefly summarize the key facts in one or two sentences.\n", + "1. **Phase 1: Basics Collection**\n", + "2. **Phase 2: Incident Clarification and Yes/No Questions**\n", + "3. **Phase 3: Summary, Confirmation, and Submission**\n", + "\n", + "You should strictly adhere to this structure, make no guesses, never skip required fields, and always confirm critical facts directly with the caller.\n", + "\n", + "---\n", + "\n", + "## PHASE 1: BASICS COLLECTION\n", + "\n", + "- **Greet the caller**: Briefly introduce yourself (“Thank you for calling OpenAI Insurance Claims. My name is [Assistant Name], and I’ll help you file your claim today.”).\n", + "- **Gather the following details:**\n", + " - Full legal name of the policyholder (“May I please have your full legal name as it appears on your policy?”).\n", + " - Policy number (ask for and repeat back, following the `XXXX-XXXX` format, and clarify spelling or numbers if uncertain).\n", + " - Type of accident (auto, home, or other; if ‘other’, ask for brief clarification, e.g., “Can you tell me what type of claim you’d like to file?”).\n", + " - Preferred phone number for follow-up.\n", + " - Date and time of the incident.\n", + "- **Repeat and confirm all collected details at the end of this phase** (“Just to confirm, I have... [summarize each field]. Is that correct?”).\n", + "\n", + "---\n", + "\n", + "## PHASE 2: INCIDENT CLARIFICATION AND YES/NO QUESTIONS\n", + "\n", + "- **Ask YES/NO questions tailored to the incident type:**\n", + " - Was anyone injured?\n", + " - For vehicle claims: Is the vehicle still drivable?\n", + " - For home claims: Is the property currently safe to occupy?\n", + " - Was a police or official report filed? If yes, request report/reference number if available.\n", + " - Are there any witnesses to the incident?\n", + "- **For each YES/NO answer:** Restate the caller’s response in your own words to confirm understanding.\n", + "- **If a caller is unsure or does not have information:** Note it politely and move on without pressing (“That’s okay, we can always collect it later if needed.”).\n", + "\n", + "---\n", + "\n", + "## PHASE 3: SUMMARY, CONFIRMATION & CLAIM SUBMISSION\n", + "\n", + "- **Concise Recap**: Summarize all key facts in a single, clear paragraph (“To quickly review, you, [caller’s name], experienced [incident description] on [date] and provided the following answers... Is that all correct?”).\n", + "- **Final Confirmation**: Ask if there is any other relevant information they wish to add about the incident.\n", + "- **Submission**: Inform the caller you will submit the claim and briefly outline next steps (“I’ll now submit your claim. Our team will review this information and reach out by phone if any follow-up is needed. You'll receive an initial update within [X] business days.”).\n", + "- **Thank the caller**: Express appreciation for their patience.\n", + "\n", + "---\n", + "\n", + "## GENERAL GUIDELINES\n", + "\n", + "- Always state the purpose of each question before asking it.\n", + "- Be patient: Adjust your pacing if the caller seems upset or confused.\n", + "- Provide reassurance but do not make guarantees about claim approvals.\n", + "- If the caller asks a question outside your scope, politely redirect (“That’s a great question, and our adjusters will be able to give you more information after your claim is submitted.”).\n", + "- Never provide legal advice.\n", + "- Do not deviate from the script structure, but feel free to use natural language and slight rephrasings to maintain human-like flow.\n", + "- Spell out any confusing words, numbers, or codes as needed.\n", + "\n", + "---\n", + "\n", + "## COMMUNICATION STYLE\n", + "\n", + "- Use warm, professional language.\n", + "- If at any point the caller becomes upset, acknowledge their feelings (“I understand this situation can be stressful. I'm here to make the process as smooth as possible for you.”).\n", + "- When confirming, always explicitly state the value you are confirming.\n", + "- Never speculate or invent information. All responses must be grounded in the caller’s direct answers.\n", + "\n", + "---\n", + "\n", + "## SPECIAL SCENARIOS\n", + "\n", + "- **Caller does not know policy number:** Ask for alternative identification such as address or date of birth, and note that the claim will be linked once verified.\n", + "- **Multiple incidents:** Politely explain that each claim must be filed separately, and help with the first; offer instructions for subsequent claims if necessary.\n", + "- **Caller wishes to pause or end:** Respect their wishes, provide information on how to resume the claim, and thank them for their time.\n", + "\n", + "Remain calm and methodical for every call. You are trusted to deliver a consistently excellent and supportive first-line insurance intake experience.\n", "\"\"\"\n", "\n", + "\n", "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", - "# Role\n", - "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n", - "Follow the instructions and output format below.\n", - "\n", - "# Instructions\n", - "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n", - "- Preserve every spoken detail: intent, tense, grammar quirks, filler words, repetitions, disfluencies, numbers, and casing.\n", - "- Keep timing words, partial words, hesitations (e.g., \"um\", \"uh\").\n", - "- Do not correct mistakes, infer meaning, answer questions, or insert punctuation beyond what the model already supplies.\n", - "- Do not invent or add any information that is not directly present in the user's latest turn.\n", - "\n", - "# Output format\n", - "- Output the raw verbatim transcript as a single block of text. No labels, prefixes, quotes, bullets, or markdown.\n", - "- If the realtime model produced nothing for the latest turn, output nothing (empty response). Never fabricate content.\n", - "\n", - "## Policy Number Normalization\n", - "- All policy numbers should be 8 digits and of the format `XXXX-XXXX` for example `56B5-12C0`\n", - "\n", - "Do not summarize or paraphrase other turns beyond the latest user utterance. The response must be the literal transcript of the latest user utterance.\n", - "\"\"\"" + "# Task: Verbatim Transcription of the Latest User Turn\n", + "\n", + "You are a **strict transcription engine**. Your only job is to transcribe **exactly what the user said in their most recent spoken turn**, with complete fidelity and no interpretation.\n", + "\n", + "You must produce a **literal, unedited transcript** of the latest user utterance only. Read and follow all instructions below carefully.\n", + "\n", + "---\n", + "\n", + "## 1. Scope of Your Task\n", + "\n", + "1. **Only the latest user turn**\n", + " - Transcribe **only** the most recent spoken user turn.\n", + " - Do **not** include text from any earlier user turns or system / assistant messages.\n", + " - Do **not** summarize, merge, or stitch together content across multiple turns.\n", + "\n", + "2. **Use past context only for disambiguation**\n", + " - You may look at earlier turns **only** to resolve ambiguity (e.g., a spelled word, a reference like “that thing I mentioned before”).\n", + " - Even when using context, the actual transcript must still contain **only the words spoken in the latest turn**.\n", + "\n", + "3. **No conversation management**\n", + " - You are **not** a dialogue agent.\n", + " - You do **not** answer questions, give advice, or continue the conversation.\n", + " - You only output the text of what the user just said.\n", + "\n", + "---\n", + "\n", + "## 2. Core Transcription Principles\n", + "\n", + "Your goal is to create a **perfectly faithful** transcript of the latest user turn.\n", + "\n", + "1. **Verbatim fidelity**\n", + " - Capture the user’s speech **exactly as spoken**.\n", + " - Preserve:\n", + " - All words (including incomplete or cut-off words)\n", + " - Mispronunciations\n", + " - Grammatical mistakes\n", + " - Slang and informal language\n", + " - Filler words (“um”, “uh”, “like”, “you know”, etc.)\n", + " - Self-corrections and restarts\n", + " - Repetitions and stutters\n", + "\n", + "2. **No rewriting or cleaning**\n", + " - Do **not**:\n", + " - Fix grammar or spelling\n", + " - Replace slang with formal language\n", + " - Reorder words\n", + " - Simplify or rewrite sentences\n", + " - “Smooth out” repetitions or disfluencies\n", + " - If the user says something awkward, incorrect, or incomplete, your transcript must **match that awkwardness or incompleteness exactly**.\n", + "\n", + "3. **Spelling and letter sequences**\n", + " - If the user spells a word (e.g., “That’s M-A-R-I-A.”), transcribe it exactly as spoken.\n", + " - If they spell something unclearly, still reflect what you received, even if it seems wrong.\n", + " - Do **not** infer the “intended” spelling; transcribe the letters as they were given.\n", + "\n", + "4. **Numerals and formatting**\n", + " - If the user says a number in words (e.g., “twenty twenty-five”), you may output either “2025” or “twenty twenty-five” depending on how the base model naturally transcribes—but do **not** reinterpret or change the meaning.\n", + " - Do **not**:\n", + " - Convert numbers into different units or formats.\n", + " - Expand abbreviations or acronyms beyond what was spoken.\n", + "\n", + "5. **Language and code-switching**\n", + " - If the user switches languages mid-sentence, reflect that in the transcript.\n", + " - Transcribe non-English content as accurately as possible.\n", + " - Do **not** translate; keep everything in the language(s) spoken.\n", + "\n", + "---\n", + "\n", + "## 3. Disfluencies, Non-Speech Sounds, and Ambiguity\n", + "\n", + "1. **Disfluencies**\n", + " - Always include:\n", + " - “Um”, “uh”, “er”\n", + " - Repeated words (“I I I think…”)\n", + " - False starts (“I went to the— I mean, I stayed home.”)\n", + " - Do not remove or compress them.\n", + "\n", + "2. **Non-speech vocalizations**\n", + " - If the model’s transcription capabilities represent non-speech sounds (e.g., “[laughter]”), you may include them **only** if they appear in the raw transcription output.\n", + " - Do **not** invent labels like “[cough]”, “[sigh]”, or “[laughs]” on your own.\n", + " - If the model does not explicitly provide such tokens, **omit them** rather than inventing them.\n", + "\n", + "3. **Unclear or ambiguous audio**\n", + " - If parts of the audio are unclear and the base transcription gives partial or uncertain tokens, you must **not** guess or fill in missing material.\n", + " - Do **not** replace unclear fragments with what you “think” the user meant.\n", + " - Your duty is to preserve exactly what the transcription model produced, even if it looks incomplete or strange.\n", + "\n", + "---\n", + "\n", + "## 4. Special Case: Policy Numbers\n", + "\n", + "The user may sometimes mention **policy numbers**. These must be handled with extra care.\n", + "\n", + "1. **General rule**\n", + " - Always transcribe the policy number exactly as it was spoken.\n", + "\n", + "2. **Expected pattern**\n", + " - When the policy number fits the pattern `XXXX-XXXX`:\n", + " - `X` can be any letter (A–Z) or digit (0–9).\n", + " - Example: `56B5-12C0`\n", + " - If the user clearly speaks this pattern, preserve it exactly.\n", + "\n", + "3. **Do not “fix” policy numbers**\n", + " - If the spoken policy number does **not** match `XXXX-XXXX` (e.g., different length or missing hyphen), **do not**:\n", + " - Invent missing characters\n", + " - Add or remove hyphens\n", + " - Correct perceived mistakes\n", + " - Transcribe **exactly what was said**, even if it seems malformed.\n", + "\n", + "---\n", + "\n", + "## 5. Punctuation and Casing\n", + "\n", + "1. **Punctuation**\n", + " - Use the punctuation that the underlying transcription model naturally produces.\n", + " - Do **not**:\n", + " - Add extra punctuation for clarity or style.\n", + " - Re-punctuate sentences to “improve” them.\n", + " - If the transcription model emits text with **no punctuation**, leave it that way.\n", + "\n", + "2. **Casing**\n", + " - Preserve the casing (uppercase/lowercase) as the model output provides.\n", + " - Do not change “i” to “I” or adjust capitalization at sentence boundaries unless the model already did so.\n", + "\n", + "---\n", + "\n", + "## 6. Output Format Requirements\n", + "\n", + "Your final output must be a **single, plain-text transcript** of the latest user turn.\n", + "\n", + "1. **Single block of text**\n", + " - Output only the transcript content.\n", + " - Do **not** include:\n", + " - Labels (e.g., “Transcript:”, “User said:”)\n", + " - Section headers\n", + " - Bullet points or numbering\n", + " - Markdown formatting or code fences\n", + " - Quotes or extra brackets\n", + "\n", + "2. **No additional commentary**\n", + " - Do not output:\n", + " - Explanations\n", + " - Apologies\n", + " - Notes about uncertainty\n", + " - References to these instructions\n", + " - The output must **only** be the words of the user’s last turn, as transcribed.\n", + "\n", + "3. **Empty turns**\n", + " - If the latest user turn contains **no transcribable content** (e.g., silence, noise, or the transcription model produces an empty string), you must:\n", + " - Return an **empty output** (no text at all).\n", + " - Do **not** insert placeholders like “[silence]”, “[no audio]”, or “(no transcript)”.\n", + "\n", + "---\n", + "\n", + "## 7. What You Must Never Do\n", + "\n", + "1. **No responses or conversation**\n", + " - Do **not**:\n", + " - Address the user.\n", + " - Answer questions.\n", + " - Provide suggestions.\n", + " - Continue or extend the conversation.\n", + "\n", + "2. **No mention of rules or prompts**\n", + " - Do **not** refer to:\n", + " - These instructions\n", + " - The system prompt\n", + " - Internal reasoning or process\n", + " - The user should see **only** the transcript of their own speech.\n", + "\n", + "3. **No multi-turn aggregation**\n", + " - Do not combine the latest user turn with any previous turns.\n", + " - Do not produce summaries or overviews across turns.\n", + "\n", + "4. **No rewriting or “helpfulness”**\n", + " - Even if the user’s statement appears:\n", + " - Incorrect\n", + " - Confusing\n", + " - Impolite\n", + " - Incomplete\n", + " - Your job is **not** to fix or improve it. Your only job is to **transcribe** it exactly.\n", + "\n", + "---\n", + "\n", + "## 8. Summary of Your Role\n", + "\n", + "- You are **not** a chat assistant.\n", + "- You are **not** an editor, summarizer, or interpreter.\n", + "- You **are** a **verbatim transcription tool** for the latest user turn.\n", + "\n", + "Your output must be the **precise, literal, and complete transcript of the most recent user utterance**—with no additional content, no corrections, and no commentary.\n", + "\"\"\"\n" ] }, { @@ -212,7 +453,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 77, "id": "4b952a29", "metadata": {}, "outputs": [ @@ -249,7 +490,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 78, "id": "7254080a", "metadata": {}, "outputs": [], @@ -310,7 +551,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 91, "id": "4baf1870", "metadata": {}, "outputs": [], @@ -384,7 +625,7 @@ " \"conversation\": \"none\", # <--- out-of-band\n", " \"output_modalities\": [\"text\"],\n", " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs\n", - " \"instructions\": transcription_instructions,\n", + " \"instructions\": \"repeat the user's last turn\"#transcription_instructions,\n", " },\n", " }\n" ] @@ -406,7 +647,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 92, "id": "11218bbb", "metadata": {}, "outputs": [], @@ -525,7 +766,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 93, "id": "cb6acbf0", "metadata": {}, "outputs": [], @@ -536,6 +777,7 @@ " pending_prints: deque | None = shared_state.get(\"pending_transcription_prints\")\n", " input_transcripts: deque | None = shared_state.get(\"input_transcripts\")\n", " transcription_model_costs: deque | None = shared_state.get(\"transcription_model_costs\")\n", + " debug_usage_and_cost: bool = bool(shared_state.get(\"debug_usage_and_cost\", False))\n", "\n", " if not pending_prints or not input_transcripts:\n", " return\n", @@ -554,7 +796,7 @@ " if transcription_model_costs:\n", " cost_info = transcription_model_costs.popleft()\n", "\n", - " if cost_info:\n", + " if cost_info and debug_usage_and_cost:\n", " audio_input_cost = cost_info.get(\"audio_input_cost\", 0.0)\n", " text_input_cost = cost_info.get(\"text_input_cost\", 0.0)\n", " text_output_cost = cost_info.get(\"text_output_cost\", 0.0)\n", @@ -594,7 +836,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 94, "id": "d099babd", "metadata": {}, "outputs": [], @@ -622,19 +864,121 @@ " \"transcription_model_costs\", deque()\n", " )\n", "\n", + " debug_usage_and_cost: bool = bool(shared_state.get(\"debug_usage_and_cost\", False))\n", + "\n", " # Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", " # gpt-4o-transcribe\n", " GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M = 6.00\n", " GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M = 2.50\n", " GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M = 10.00\n", "\n", + " # gpt-realtime\n", " REALTIME_TEXT_INPUT_PRICE_PER_1M = 4\n", " REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M = 0.4\n", " REALTIME_TEXT_OUTPUT_PRICE_PER_1M = 16.00\n", " REALTIME_AUDIO_INPUT_PRICE_PER_1M = 32.00\n", - " REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.4\n", + " REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.40\n", " REALTIME_AUDIO_OUTPUT_PRICE_PER_1M = 64.00\n", "\n", + " def _compute_transcription_model_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " output_tokens = usage.get(\"output_tokens\") or 0\n", + "\n", + " audio_input_cost = (\n", + " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_input_cost = (\n", + " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", + "\n", + " return {\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"text_input_cost\": text_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }\n", + "\n", + " def _compute_realtime_oob_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " output_details = usage.get(\"output_token_details\") or {}\n", + " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", + "\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " cached_text_tokens = (\n", + " cached_details.get(\"text_tokens\")\n", + " or input_details.get(\"cached_tokens\")\n", + " or 0\n", + " )\n", + " non_cached_text_input_tokens = max(text_input_tokens - cached_text_tokens, 0)\n", + "\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " cached_audio_tokens = cached_details.get(\"audio_tokens\") or 0\n", + " non_cached_audio_input_tokens = max(audio_input_tokens - cached_audio_tokens, 0)\n", + "\n", + " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", + " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", + "\n", + " text_input_cost = (\n", + " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_text_input_cost = (\n", + " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_input_cost = (\n", + " non_cached_audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_audio_input_cost = (\n", + " cached_audio_tokens * REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_output_cost = (\n", + " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + "\n", + " total_cost = (\n", + " text_input_cost\n", + " + cached_text_input_cost\n", + " + audio_input_cost\n", + " + cached_audio_input_cost\n", + " + text_output_cost\n", + " + audio_output_cost\n", + " )\n", + "\n", + " return {\n", + " \"text_input_cost\": text_input_cost,\n", + " \"cached_text_input_cost\": cached_text_input_cost,\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"cached_audio_input_cost\": cached_audio_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"audio_output_cost\": audio_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }\n", + "\n", " async for raw in ws:\n", " if stop_event.is_set():\n", " break\n", @@ -690,34 +1034,7 @@ "\n", " # Compute and store cost estimate for the transcription model (e.g., gpt-4o-transcribe).\n", " usage = message.get(\"usage\") or {}\n", - " cost_info: dict | None = None\n", - " if usage:\n", - " input_details = usage.get(\"input_token_details\") or {}\n", - " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", - " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", - " output_tokens = usage.get(\"output_tokens\") or 0\n", - "\n", - " audio_input_cost = (\n", - " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_input_cost = (\n", - " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_output_cost = (\n", - " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", - "\n", - " cost_info = {\n", - " \"audio_input_cost\": audio_input_cost,\n", - " \"text_input_cost\": text_input_cost,\n", - " \"text_output_cost\": text_output_cost,\n", - " \"total_cost\": total_cost,\n", - " }\n", - "\n", + " cost_info = _compute_transcription_model_cost(usage)\n", " transcription_model_costs.append(cost_info)\n", "\n", " final_text = (final_text or \"\").strip()\n", @@ -783,76 +1100,24 @@ " usage = response.get(\"usage\") or {}\n", " oob_cost_info: dict | None = None\n", " if usage and is_transcription:\n", - " input_details = usage.get(\"input_token_details\") or {}\n", - " output_details = usage.get(\"output_token_details\") or {}\n", - " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", - "\n", - " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", - " cached_text_tokens = (\n", - " cached_details.get(\"text_tokens\")\n", - " or input_details.get(\"cached_tokens\")\n", - " or 0\n", - " )\n", - " non_cached_text_input_tokens = max(\n", - " text_input_tokens - cached_text_tokens, 0\n", - " )\n", - " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", - "\n", - " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", - " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", - "\n", - " text_input_cost = (\n", - " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " cached_text_input_cost = (\n", - " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " audio_input_cost = (\n", - " audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_output_cost = (\n", - " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " audio_output_cost = (\n", - " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - "\n", - " total_cost = (\n", - " text_input_cost\n", - " + cached_text_input_cost\n", - " + audio_input_cost\n", - " + text_output_cost\n", - " + audio_output_cost\n", - " )\n", - "\n", - " oob_cost_info = {\n", - " \"text_input_cost\": text_input_cost,\n", - " \"cached_text_input_cost\": cached_text_input_cost,\n", - " \"audio_input_cost\": audio_input_cost,\n", - " \"text_output_cost\": text_output_cost,\n", - " \"audio_output_cost\": audio_output_cost,\n", - " \"total_cost\": total_cost,\n", - " }\n", + " oob_cost_info = _compute_realtime_oob_cost(usage)\n", "\n", " text = buffers.get(response_id, \"\").strip()\n", " if text:\n", " if is_transcription:\n", " print(\"\\n=== User turn (Realtime transcript) ===\")\n", " print(text, flush=True)\n", - " if usage:\n", - " print(\"[Realtime out-of-band transcription usage]\")\n", - " print(json.dumps(usage, indent=2))\n", - " if oob_cost_info:\n", + " if debug_usage_and_cost and oob_cost_info:\n", + " usage_for_print = oob_cost_info.get(\"usage\")\n", + " if usage_for_print:\n", + " print(\"[Realtime out-of-band transcription usage]\")\n", + " print(json.dumps(usage_for_print, indent=2))\n", " print(\n", " \"[Realtime out-of-band transcription cost estimate] \"\n", " f\"text_in=${oob_cost_info['text_input_cost']:.6f}, \"\n", " f\"text_in_cached=${oob_cost_info['cached_text_input_cost']:.6f}, \"\n", " f\"audio_in=${oob_cost_info['audio_input_cost']:.6f}, \"\n", + " f\"audio_in_cached=${oob_cost_info['cached_audio_input_cost']:.6f}, \"\n", " f\"text_out=${oob_cost_info['text_output_cost']:.6f}, \"\n", " f\"audio_out=${oob_cost_info['audio_output_cost']:.6f}, \"\n", " f\"total=${oob_cost_info['total_cost']:.6f}\",\n", @@ -862,6 +1127,9 @@ " pending_transcription_prints.append(object())\n", " flush_pending_transcription_prints(shared_state)\n", " else:\n", + " if debug_usage_and_cost and usage:\n", + " print(\"[Realtime usage]\")\n", + " print(json.dumps(usage, indent=2))\n", " print(\"\\n=== Assistant response ===\")\n", " print(text, flush=True)\n", " print()\n", @@ -921,7 +1189,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 95, "id": "35c4d7b5", "metadata": {}, "outputs": [], @@ -940,6 +1208,7 @@ " idle_timeout_ms: int | None = None,\n", " max_turns: int | None = None,\n", " timeout_seconds: int = 0,\n", + " debug_usage_and_cost: bool = True,\n", ") -> None:\n", " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", @@ -963,6 +1232,7 @@ " \"mute_mic\": False,\n", " \"input_transcripts\": deque(),\n", " \"pending_transcription_prints\": deque(),\n", + " \"debug_usage_and_cost\": debug_usage_and_cost,\n", " }\n", "\n", " async with websockets.connect(\n", @@ -1005,8 +1275,96 @@ }, { "cell_type": "code", - "execution_count": 20, - "id": "32401814", + "execution_count": 96, + "id": "b397b67e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello!\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 10,\n", + " \"input_tokens\": 6,\n", + " \"output_tokens\": 4,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 6,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 4,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000024, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.000088\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 22,\n", + " \"input_tokens\": 19,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 19\n", + " },\n", + " \"output_tokens\": 3\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000114, text_in=$0.000000, text_out=$0.000030, total=$0.000144\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1305,\n", + " \"input_tokens\": 1051,\n", + " \"output_tokens\": 254,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1032,\n", + " \"audio_tokens\": 19,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 55,\n", + " \"audio_tokens\": 199\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9a2a33b", "metadata": {}, "outputs": [ { @@ -1020,43 +1378,37 @@ "\n", "=== User turn (Realtime transcript) ===\n", "Hello.\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001192, text_in_cached=$0.000000, audio_in=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.001256\n", "\n", "=== User turn (Transcription model) ===\n", "Hello\n", - "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", "\n", "\n", "=== Assistant response ===\n", - "Hello! I’m here to help you file an insurance claim. Let’s start with the basics. Could you please tell me your full name?\n", + "Hello! Let's get started with your claim. Can you tell me your full name, please?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My full name is M I N H A J U L H O Q U E\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000132, text_in_cached=$0.000128, audio_in=$0.000864, text_out=$0.000304, audio_out=$0.000000, total=$0.001428\n", + "My name is M I N H A J U L H O Q U E\n", "\n", "=== User turn (Transcription model) ===\n", - "My full name is Minhajul Hoque.\n", - "[Transcription model cost estimate] audio_in=$0.000384, text_in=$0.000000, text_out=$0.000120, total=$0.000504\n", + "My name is Minhajul Hoque.\n", "\n", "\n", "=== Assistant response ===\n", - "Thank you. I heard your full name is Minhajul Hoque. Could you please confirm that I got it right?\n", + "Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yup, you got it right exactly.\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001552, text_in_cached=$0.000000, audio_in=$0.001600, text_out=$0.000176, audio_out=$0.000000, total=$0.003328\n", + "Yep.\n", "\n", "=== User turn (Transcription model) ===\n", - "Yep, you got it right exactly.\n", - "[Transcription model cost estimate] audio_in=$0.000240, text_in=$0.000000, text_out=$0.000100, total=$0.000340\n", + "Yep.\n", "\n", "\n", "=== Assistant response ===\n", @@ -1066,58 +1418,85 @@ "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number is X077-B025.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My policy number is X077B025.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?\n", + "\n", + "\n", "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== Assistant response ===\n", + "Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?\n", + "\n", "\n", "=== User turn (Realtime transcript) ===\n", - "D-\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001672, text_in_cached=$0.000000, audio_in=$0.001120, text_out=$0.000064, audio_out=$0.000000, total=$0.002856\n", + "Yeah, can you ask me my name again?\n", "\n", "=== User turn (Transcription model) ===\n", - "Доб.\n", - "[Transcription model cost estimate] audio_in=$0.000066, text_in=$0.000000, text_out=$0.000040, total=$0.000106\n", + "Can you ask me my name again?\n", "\n", + "\n", + "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "P022-4567\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000168, text_in_cached=$0.000154, audio_in=$0.004192, text_out=$0.000112, audio_out=$0.000000, total=$0.004626\n", + "No, can you ask me my name again, this is important.\n", "\n", "=== User turn (Transcription model) ===\n", - "policy number is something like P022456.\n", - "[Transcription model cost estimate] audio_in=$0.000576, text_in=$0.000000, text_out=$0.000110, total=$0.000686\n", + "No, can you ask me by name again?\n", "\n", "\n", "=== Assistant response ===\n", - "Thank you. I heard your policy number is P-0-2-2-4-5-6. Could you confirm if that’s correct?\n", + "Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "it...\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001876, text_in_cached=$0.000000, audio_in=$0.006112, text_out=$0.000064, audio_out=$0.000000, total=$0.008052\n", + "My name is Minhajul Hoque.\n", "\n", "=== User turn (Transcription model) ===\n", - "It\n", - "[Transcription model cost estimate] audio_in=$0.000072, text_in=$0.000000, text_out=$0.000030, total=$0.000102\n", - "\n", - "\n", - "=== Assistant response ===\n", - "Thank you for confirming. Now, could you tell me the type of accident? For example, is it auto, home, or something else?\n", + "My name is Minhaj ul Haq.\n", "\n", "Session cancelled; closing.\n" ] } ], "source": [ - "await run_realtime_session()" + "await run_realtime_session(debug_usage_and_cost=False)" + ] + }, + { + "cell_type": "markdown", + "id": "efabdbf5", + "metadata": {}, + "source": [ + "From the above example, we can notice:\n", + "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", + "- The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", + "- With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + ] + }, + { + "cell_type": "markdown", + "id": "bd1f343b", + "metadata": {}, + "source": [ + "## Example with Cost Calculations" ] }, { "cell_type": "code", - "execution_count": 29, - "id": "c9a2a33b", + "execution_count": 57, + "id": "32401814", "metadata": {}, "outputs": [ { @@ -1131,83 +1510,917 @@ "\n", "=== User turn (Realtime transcript) ===\n", "Hello.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 307,\n", + " \"input_tokens\": 303,\n", + " \"output_tokens\": 4,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 303,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 4,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001212, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.001276\n", "\n", "=== User turn (Transcription model) ===\n", - "Hello\n", + "Hello.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 20,\n", + " \"input_tokens\": 16,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 16\n", + " },\n", + " \"output_tokens\": 4\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000040, total=$0.000136\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 295,\n", + " \"input_tokens\": 167,\n", + " \"output_tokens\": 128,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 151,\n", + " \"audio_tokens\": 16,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 31,\n", + " \"audio_tokens\": 97\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Hello! Let’s get started with your claim. Could you please tell me your full name?\n", + "\n", "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full name is Minhajul Haq.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 356,\n", + " \"input_tokens\": 344,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 344,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 320,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 320,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000096, text_in_cached=$0.000128, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000416\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My full name is Minhajul Haq.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 49,\n", + " \"input_tokens\": 37,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 37\n", + " },\n", + " \"output_tokens\": 12\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000222, text_in=$0.000000, text_out=$0.000120, total=$0.000342\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 503,\n", + " \"input_tokens\": 342,\n", + " \"output_tokens\": 161,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 192,\n", + " \"audio_tokens\": 150,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 320,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 192,\n", + " \"audio_tokens\": 128,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 37,\n", + " \"audio_tokens\": 124\n", + " }\n", + "}\n", "\n", "=== Assistant response ===\n", - "Hello! Let's get started with your claim. Can you tell me your full name, please?\n", + "Thank you. I’ve got Minhajul Haq. Now, could you provide your policy number, please?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My name is M I N H A J U L H O Q U E\n", + "My policy number is 00X7-B725.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 446,\n", + " \"input_tokens\": 433,\n", + " \"output_tokens\": 13,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 381,\n", + " \"audio_tokens\": 52,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 13,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001524, text_in_cached=$0.000000, audio_in=$0.001664, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.003396\n", "\n", "=== User turn (Transcription model) ===\n", - "My name is Minhajul Hoque.\n", + "My policy number is 00X7B725.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 85,\n", + " \"input_tokens\": 72,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 72\n", + " },\n", + " \"output_tokens\": 13\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000432, text_in=$0.000000, text_out=$0.000130, total=$0.000562\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 755,\n", + " \"input_tokens\": 478,\n", + " \"output_tokens\": 277,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 229,\n", + " \"audio_tokens\": 249,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 51,\n", + " \"audio_tokens\": 226\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. I have the policy number as 00X7-B725. Let’s confirm: 00X7-B725. Is that correct?\n", "\n", "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full name is Minhajul Haq.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 677,\n", + " \"input_tokens\": 665,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 430,\n", + " \"audio_tokens\": 235,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001720, text_in_cached=$0.000000, audio_in=$0.007520, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.009432\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yeah, that's pretty much correct. That's pretty good, but I think you got my name wrong. Can you ask me again?\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 109,\n", + " \"input_tokens\": 81,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 81\n", + " },\n", + " \"output_tokens\": 28\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000486, text_in=$0.000000, text_out=$0.000280, total=$0.000766\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 857,\n", + " \"input_tokens\": 710,\n", + " \"output_tokens\": 147,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 278,\n", + " \"audio_tokens\": 432,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 37,\n", + " \"audio_tokens\": 110\n", + " }\n", + "}\n", + "\n", "=== Assistant response ===\n", - "Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?\n", + "Of course, let’s make sure we get it right. Could you please repeat your full name clearly for me?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yep.\n", + "Yeah, sure, M-I-N-H-A-J-U-L H-O-Q-U-E.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 645,\n", + " \"input_tokens\": 625,\n", + " \"output_tokens\": 20,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 461,\n", + " \"audio_tokens\": 164,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 20,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001844, text_in_cached=$0.000000, audio_in=$0.005248, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.007412\n", "\n", "=== User turn (Transcription model) ===\n", - "Yep.\n", + "Niajer Miahjul Hoque\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 55,\n", + " \"input_tokens\": 45,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 45\n", + " },\n", + " \"output_tokens\": 10\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000270, text_in=$0.000000, text_out=$0.000100, total=$0.000370\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 865,\n", + " \"input_tokens\": 670,\n", + " \"output_tokens\": 195,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 309,\n", + " \"audio_tokens\": 361,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 46,\n", + " \"audio_tokens\": 149\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for spelling that out. I have it now as Minhajul Hoque. Let’s confirm: Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep, that's correct.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 736,\n", + " \"input_tokens\": 729,\n", + " \"output_tokens\": 7,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 505,\n", + " \"audio_tokens\": 224,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 7,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.002020, text_in_cached=$0.000000, audio_in=$0.007168, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.009300\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep, that's correct.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 28,\n", + " \"input_tokens\": 21,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 21\n", + " },\n", + " \"output_tokens\": 7\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000126, text_in=$0.000000, text_out=$0.000070, total=$0.000196\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1011,\n", + " \"input_tokens\": 774,\n", + " \"output_tokens\": 237,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 353,\n", + " \"audio_tokens\": 421,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 57,\n", + " \"audio_tokens\": 180\n", + " }\n", + "}\n", "\n", "=== Assistant response ===\n", - "Great, thank you for confirming. Now, could you provide your policy number, please?\n", + "Great, we’ve got your name as Minhajul Hoque. Now, let’s move on. What’s the type of accident? For example, is it auto, home, or something else?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My policy number is X077-B025.\n", + "It’s like… auto, for sure.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 867,\n", + " \"input_tokens\": 856,\n", + " \"output_tokens\": 11,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 560,\n", + " \"audio_tokens\": 296,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 11,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.002240, text_in_cached=$0.000000, audio_in=$0.009472, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.011888\n", "\n", "=== User turn (Transcription model) ===\n", - "My policy number is X077B025.\n", + "It's like auto for sure.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 49,\n", + " \"input_tokens\": 41,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 41\n", + " },\n", + " \"output_tokens\": 8\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000246, text_in=$0.000000, text_out=$0.000080, total=$0.000326\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1324,\n", + " \"input_tokens\": 901,\n", + " \"output_tokens\": 423,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 408,\n", + " \"audio_tokens\": 493,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 88,\n", + " \"audio_tokens\": 335\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. So the type of accident is auto. Let’s confirm: we have your full name as Minhajul Hoque, your policy number as 00X7-B725, and the accident type is auto. Let’s move on to a few yes/no questions. \n", + "\n", + "First, was anyone injured?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "26fa9399", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello, I'm trying to do an example.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 732,\n", + " \"input_tokens\": 721,\n", + " \"output_tokens\": 11,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 721,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 11,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.002884, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.003060\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello, I'm trying to do an example.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 42,\n", + " \"input_tokens\": 31,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 31\n", + " },\n", + " \"output_tokens\": 11\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000186, text_in=$0.000000, text_out=$0.000110, total=$0.000296\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1308,\n", + " \"input_tokens\": 1063,\n", + " \"output_tokens\": 245,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1032,\n", + " \"audio_tokens\": 31,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 56,\n", + " \"audio_tokens\": 189\n", + " }\n", + "}\n", "\n", "=== Assistant response ===\n", - "Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?\n", + "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. Let’s start by getting your full legal name as it appears on your policy.\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", + "=== User turn (Realtime transcript) ===\n", + "Sounds good, my full legal name would be M I N H H A J U L H O Q U E\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 892,\n", + " \"input_tokens\": 867,\n", + " \"output_tokens\": 25,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 787,\n", + " \"audio_tokens\": 80,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 25,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000332, text_in_cached=$0.000282, audio_in=$0.002560, audio_in_cached=$0.000000, text_out=$0.000400, audio_out=$0.000000, total=$0.003574\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Sounds good. My full legal name would be Minhajul Hoque.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 74,\n", + " \"input_tokens\": 57,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 57\n", + " },\n", + " \"output_tokens\": 17\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000170, total=$0.000512\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1533,\n", + " \"input_tokens\": 1375,\n", + " \"output_tokens\": 158,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1098,\n", + " \"audio_tokens\": 277,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1280,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1088,\n", + " \"audio_tokens\": 192,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 42,\n", + " \"audio_tokens\": 116\n", + " }\n", + "}\n", + "\n", "=== Assistant response ===\n", - "Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?\n", + "Thank you, Minhajul Hoque. Could you please spell that out for me, just to make sure I have it exactly right?\n", "\n", "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", "=== User turn (Realtime transcript) ===\n", - "Yeah, can you ask me my name again?\n", + "Yeah, sure, it would be M I N H A J U L H O Q U E\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 939,\n", + " \"input_tokens\": 917,\n", + " \"output_tokens\": 22,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 825,\n", + " \"audio_tokens\": 92,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 22,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000484, text_in_cached=$0.000282, audio_in=$0.002944, audio_in_cached=$0.000000, text_out=$0.000352, audio_out=$0.000000, total=$0.004062\n", "\n", "=== User turn (Transcription model) ===\n", - "Can you ask me my name again?\n", + "Yeah sure, it would be m i n h a j a u l h o q u e.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 108,\n", + " \"input_tokens\": 85,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 85\n", + " },\n", + " \"output_tokens\": 23\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000510, text_in=$0.000000, text_out=$0.000230, total=$0.000740\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1681,\n", + " \"input_tokens\": 1425,\n", + " \"output_tokens\": 256,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1136,\n", + " \"audio_tokens\": 289,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 54,\n", + " \"audio_tokens\": 202\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for spelling that. Just to confirm, I have M-I-N-H-A-J-U-L, and the last name is H-O-Q-U-E. Is that correct?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "No, can you ask me my name again, this is important.\n", + "Yep, that's correct, let's continue.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1091,\n", + " \"input_tokens\": 1081,\n", + " \"output_tokens\": 10,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 877,\n", + " \"audio_tokens\": 204,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 10,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000692, text_in_cached=$0.000282, audio_in=$0.006528, audio_in_cached=$0.000000, text_out=$0.000160, audio_out=$0.000000, total=$0.007662\n", "\n", "=== User turn (Transcription model) ===\n", - "No, can you ask me by name again?\n", + "Yep, that's correct, let's continue.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 36,\n", + " \"input_tokens\": 26,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 26\n", + " },\n", + " \"output_tokens\": 10\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000100, total=$0.000256\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1715,\n", + " \"input_tokens\": 1589,\n", + " \"output_tokens\": 126,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1188,\n", + " \"audio_tokens\": 401,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 31,\n", + " \"audio_tokens\": 95\n", + " }\n", + "}\n", "\n", + "=== Assistant response ===\n", + "Great. Now let’s gather your policy number. Could you provide that for me, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yeah, my policy number is P075-BB72\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1079,\n", + " \"input_tokens\": 1066,\n", + " \"output_tokens\": 13,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 902,\n", + " \"audio_tokens\": 164,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 13,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000792, text_in_cached=$0.000282, audio_in=$0.005248, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.006530\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yeah, my policy number is P075-BB72.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 81,\n", + " \"input_tokens\": 67,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 67\n", + " },\n", + " \"output_tokens\": 14\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000402, text_in=$0.000000, text_out=$0.000140, total=$0.000542\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1757,\n", + " \"input_tokens\": 1574,\n", + " \"output_tokens\": 183,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1213,\n", + " \"audio_tokens\": 361,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 35,\n", + " \"audio_tokens\": 148\n", + " }\n", + "}\n", "\n", "=== Assistant response ===\n", - "Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?\n", + "Thank you. Let me confirm: your policy number is P075-BB72. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yeah, but I think you got my name wrong, uh, can you ask it again?\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1232,\n", + " \"input_tokens\": 1211,\n", + " \"output_tokens\": 21,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 937,\n", + " \"audio_tokens\": 274,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 21,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000932, text_in_cached=$0.000282, audio_in=$0.008768, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.010318\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yeah, but I think you got my name wrong. Can you ask it again?\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 76,\n", + " \"input_tokens\": 57,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 57\n", + " },\n", + " \"output_tokens\": 19\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000190, total=$0.000532\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1882,\n", + " \"input_tokens\": 1719,\n", + " \"output_tokens\": 163,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1248,\n", + " \"audio_tokens\": 471,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 36,\n", + " \"audio_tokens\": 127\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Of course, let's correct that together. Could you please spell your full legal name for me again, carefully?\n", "\n", "\n", "[client] Speech detected; streaming...\n", @@ -1215,27 +2428,987 @@ "\n", "=== User turn (Realtime transcript) ===\n", "My name is Minhajul Hoque.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1261,\n", + " \"input_tokens\": 1250,\n", + " \"output_tokens\": 11,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 971,\n", + " \"audio_tokens\": 279,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 11,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001068, text_in_cached=$0.000282, audio_in=$0.008928, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.010454\n", "\n", "=== User turn (Transcription model) ===\n", - "My name is Minhaj ul Haq.\n", + "My name is Minhajul Haque\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 36,\n", + " \"input_tokens\": 26,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 26\n", + " },\n", + " \"output_tokens\": 10\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000100, total=$0.000256\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1981,\n", + " \"input_tokens\": 1758,\n", + " \"output_tokens\": 223,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1282,\n", + " \"audio_tokens\": 476,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 51,\n", + " \"audio_tokens\": 172\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let’s confirm the spelling. Is it M-I-N-H-A-J-U-L for the first name, and H-O-Q-U-E for the last name?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep, you got it right, thank you.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1383,\n", + " \"input_tokens\": 1371,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1020,\n", + " \"audio_tokens\": 351,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001264, text_in_cached=$0.000282, audio_in=$0.011232, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.012970\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep, you got it right, thank you.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 39,\n", + " \"input_tokens\": 27,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 27\n", + " },\n", + " \"output_tokens\": 12\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000162, text_in=$0.000000, text_out=$0.000120, total=$0.000282\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 2102,\n", + " \"input_tokens\": 1879,\n", + " \"output_tokens\": 223,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1331,\n", + " \"audio_tokens\": 548,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 53,\n", + " \"audio_tokens\": 170\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "You're welcome. Now that we have your name and policy number, let’s move on. Could you tell me the type of accident—was it auto, home, or something else?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "uh it was auto\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1452,\n", + " \"input_tokens\": 1446,\n", + " \"output_tokens\": 6,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1069,\n", + " \"audio_tokens\": 377,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 704,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 704,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 6,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001460, text_in_cached=$0.000282, audio_in=$0.012064, audio_in_cached=$0.000000, text_out=$0.000096, audio_out=$0.000000, total=$0.013902\n", + "\n", + "=== User turn (Transcription model) ===\n", + "It was Otto.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 34,\n", + " \"input_tokens\": 28,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 28\n", + " },\n", + " \"output_tokens\": 6\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000168, text_in=$0.000000, text_out=$0.000060, total=$0.000228\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 2065,\n", + " \"input_tokens\": 1954,\n", + " \"output_tokens\": 111,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1380,\n", + " \"audio_tokens\": 574,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1600,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1280,\n", + " \"audio_tokens\": 320,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 27,\n", + " \"audio_tokens\": 84\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Now could you provide the preferred phone number for follow-up?\n", "\n", "Session cancelled; closing.\n" ] } ], "source": [ - "await run_realtime_session()" + "await run_realtime_session(debug_usage_and_cost=True)" ] }, { "cell_type": "markdown", - "id": "efabdbf5", + "id": "0be41e7c", "metadata": {}, "source": [ - "From the above example, we can notice:\n", - "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", - "- The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", - "- With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + "In this example, out-of-band transcription using the Realtime model costs **$0.0725** versus **$0.00364** for the dedicated transcription model across 9 turns — about **$0.0689 more total (~19.9× higher)**. iThets cost is driven up by repeatedly passing the **full, growing session context** and by **uncached audio input** each turn. The dedicated transcription model remains far cheaper and more stable because it processes **only the new audio turn with a minimal prompt**, so the per-turn token load doesn’t accumulate.\n", + "\n", + "---\n", + "\n", + "## What’s happening in *this* run (key notes)\n", + "\n", + "* **Caching kicks in early but only for the stable prompt head.**\n", + " After the large assistant turn (~1k+ tokens), your OOB calls reuse a **704-token cached text prefix**. That’s why later OOB `text_in` is tiny while `text_in_cached` is non-zero.\n", + "\n", + "* **Audio dominates OOB cost here.**\n", + " Each OOB request includes the new user audio (and sometimes more audio context), and those **audio tokens are mostly uncached**, so `audio_in` grows and becomes the main cost component.\n", + "\n", + "* **First OOB turn has 0 audio tokens.**\n", + " That first OOB usage shows `audio_tokens: 0`, meaning the OOB call likely fired before the audio item was fully committed to the session. It still produced the right transcript because the text context already contained (or implied) the utterance.\n", + "\n", + "* **Dedicated transcription stays cheap because context doesn’t grow.**\n", + " It’s effectively “turn-local”: small/no prompt + only the latest audio → stable low cost.\n", + "\n", + "* **Cost-control option:**\n", + " If OOB cost is a concern, you can transcribe **only the most recent turn** (or last N turns) instead of the whole session, keeping the OOB prompt short and preventing audio/text accumulation.\n", + "\n", + "If you want, I can turn this into the exact “Cost comparison” section format you’re using in the cookbook (with the bullets aligned to your earlier style).\n", + "\n", + "\n", + "You’re basically right about the consequence, even if the mental model needs one tweak:\n", + "\n", + "response.create.instructions changes the rendered “system/instructions” text for that one inference pass.\n", + "\n", + "Prompt caching only helps when the new request begins with the same long prefix that was previously computed. It caches the longest previously-computed prefix, and caching eligibility starts once the prompt is large enough (commonly described as 1,024+ tokens, then growing in chunks with exact-prefix matching). \n", + "OpenAI\n", + "+1\n", + "\n", + "So if OOB uses a different system/instructions block than the main assistant, the two request families will not share cache, because they diverge right at the top.\n", + "\n", + "What that means for your setup (main response vs OOB transcription)\n", + "1) Main Realtime responses can still hit caching within themselves\n", + "\n", + "If the main assistant requests are using a stable session prompt (or a stable per-response instructions override), then main responses can cache across turns.\n", + "\n", + "They only stop caching well if you’re truly changing the instructions string every turn (even small differences early in the prompt bust the prefix match). \n", + "OpenAI\n", + "+1\n", + "\n", + "2) OOB transcription can still hit caching within itself\n", + "\n", + "If your OOB transcription instructions are identical each time, then OOB requests can build their own cache lineage and reuse it on later OOB turns.\n", + "\n", + "3) But main and OOB will not share cache if their system/instructions differ\n", + "\n", + "This is the key implication of the “instructions_override replaces system message” behavior:\n", + "\n", + "Main assistant request prefix begins with: SYSTEM = assistant prompt\n", + "\n", + "OOB request prefix begins with: SYSTEM = transcription prompt\n", + "\n", + "Those prefixes differ at token 1 → the cached prefix from one cannot apply to the other (because caching is prefix-based). \n", + "OpenAI\n", + "+1\n" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "482e82c2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1824,\n", + " \"input_tokens\": 1820,\n", + " \"output_tokens\": 4,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1820,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 4,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.007280, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.007344\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 19,\n", + " \"input_tokens\": 16,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 16\n", + " },\n", + " \"output_tokens\": 3\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1294,\n", + " \"input_tokens\": 1048,\n", + " \"output_tokens\": 246,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1032,\n", + " \"audio_tokens\": 16,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 57,\n", + " \"audio_tokens\": 189\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My legal name is M I N H A J U L H O Q U E\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1962,\n", + " \"input_tokens\": 1942,\n", + " \"output_tokens\": 20,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1887,\n", + " \"audio_tokens\": 55,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 20,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000380, text_in_cached=$0.000717, audio_in=$0.001760, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.003177\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My legal name is M-I-N-H-A-J-U-L H-O-Q-U-E.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 67,\n", + " \"input_tokens\": 47,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 47\n", + " },\n", + " \"output_tokens\": 20\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000282, text_in=$0.000000, text_out=$0.000200, total=$0.000482\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1518,\n", + " \"input_tokens\": 1351,\n", + " \"output_tokens\": 167,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1099,\n", + " \"audio_tokens\": 252,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1280,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1088,\n", + " \"audio_tokens\": 192,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 40,\n", + " \"audio_tokens\": 127\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm I got that right. Your full legal name is Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yes, that is my name.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1944,\n", + " \"input_tokens\": 1935,\n", + " \"output_tokens\": 9,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1921,\n", + " \"audio_tokens\": 14,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 9,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000516, text_in_cached=$0.000717, audio_in=$0.000448, audio_in_cached=$0.000000, text_out=$0.000144, audio_out=$0.000000, total=$0.001825\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yes, that is my name.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 30,\n", + " \"input_tokens\": 21,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 21\n", + " },\n", + " \"output_tokens\": 9\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000126, text_in=$0.000000, text_out=$0.000090, total=$0.000216\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1575,\n", + " \"input_tokens\": 1344,\n", + " \"output_tokens\": 231,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1133,\n", + " \"audio_tokens\": 211,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 52,\n", + " \"audio_tokens\": 179\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Great, thank you for confirming. Now, could you please provide your policy number? It should be in the format of four digits, a dash, and then four more digits.\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number would be P 0 X 7 6 5 2 0.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2120,\n", + " \"input_tokens\": 2098,\n", + " \"output_tokens\": 22,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1971,\n", + " \"audio_tokens\": 127,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 22,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000460, text_in_cached=$0.000742, audio_in=$0.004064, audio_in_cached=$0.000000, text_out=$0.000352, audio_out=$0.000000, total=$0.005618\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My policy number would be P0X7 6520.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 76,\n", + " \"input_tokens\": 61,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 61\n", + " },\n", + " \"output_tokens\": 15\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000366, text_in=$0.000000, text_out=$0.000150, total=$0.000516\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1704,\n", + " \"input_tokens\": 1507,\n", + " \"output_tokens\": 197,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1183,\n", + " \"audio_tokens\": 324,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 38,\n", + " \"audio_tokens\": 159\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm your policy number. You said P0X7-6520. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep, that's indeed correct. Do you have that?\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2163,\n", + " \"input_tokens\": 2150,\n", + " \"output_tokens\": 13,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2005,\n", + " \"audio_tokens\": 145,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 13,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000852, text_in_cached=$0.000717, audio_in=$0.004640, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.006417\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep, that's indeed correct. Do you have that?\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 51,\n", + " \"input_tokens\": 38,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 38\n", + " },\n", + " \"output_tokens\": 13\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000228, text_in=$0.000000, text_out=$0.000130, total=$0.000358\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1790,\n", + " \"input_tokens\": 1559,\n", + " \"output_tokens\": 231,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1217,\n", + " \"audio_tokens\": 342,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 53,\n", + " \"audio_tokens\": 178\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Perfect, thank you. Now, could you tell me what type of accident this claim is related to? For example, is it an auto, home, or another type of incident?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "It's a auto incident.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2256,\n", + " \"input_tokens\": 2249,\n", + " \"output_tokens\": 7,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2056,\n", + " \"audio_tokens\": 193,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 7,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001056, text_in_cached=$0.000717, audio_in=$0.006176, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.008061\n", + "\n", + "=== User turn (Transcription model) ===\n", + "It's a auto incident.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 36,\n", + " \"input_tokens\": 29,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 29\n", + " },\n", + " \"output_tokens\": 7\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000174, text_in=$0.000000, text_out=$0.000070, total=$0.000244\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1812,\n", + " \"input_tokens\": 1658,\n", + " \"output_tokens\": 154,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1268,\n", + " \"audio_tokens\": 390,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 38,\n", + " \"audio_tokens\": 116\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Now I’d like to get your preferred phone number for follow-up. Could you please provide that number?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "For sure, I can provide that—wait, can you say my name again? I think you got it wrong.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2303,\n", + " \"input_tokens\": 2277,\n", + " \"output_tokens\": 26,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2090,\n", + " \"audio_tokens\": 187,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 26,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001192, text_in_cached=$0.000717, audio_in=$0.005984, audio_in_cached=$0.000000, text_out=$0.000416, audio_out=$0.000000, total=$0.008309\n", + "\n", + "=== User turn (Transcription model) ===\n", + "For sure I can provide that but can you say my name again? I think you got it wrong.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 79,\n", + " \"input_tokens\": 56,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 56\n", + " },\n", + " \"output_tokens\": 23\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000336, text_in=$0.000000, text_out=$0.000230, total=$0.000566\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1852,\n", + " \"input_tokens\": 1686,\n", + " \"output_tokens\": 166,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1302,\n", + " \"audio_tokens\": 384,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 39,\n", + " \"audio_tokens\": 127\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Of course, I want to make sure I have it exactly right. Could you please spell your full legal name for me?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "For sure, it's Minhajul Hoque.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2363,\n", + " \"input_tokens\": 2351,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2127,\n", + " \"audio_tokens\": 224,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001340, text_in_cached=$0.000717, audio_in=$0.007168, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.009417\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Өршөөөр, энэ минь хажуу хог.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 42,\n", + " \"input_tokens\": 26,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 26\n", + " },\n", + " \"output_tokens\": 16\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000160, total=$0.000316\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1920,\n", + " \"input_tokens\": 1760,\n", + " \"output_tokens\": 160,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1339,\n", + " \"audio_tokens\": 421,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 43,\n", + " \"audio_tokens\": 117\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for clarifying. Could you spell it out for me letter by letter, so I can make sure it’s 100% correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yes, M I N H A J U L H O Q U E\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2434,\n", + " \"input_tokens\": 2417,\n", + " \"output_tokens\": 17,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2168,\n", + " \"audio_tokens\": 249,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 17,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001504, text_in_cached=$0.000717, audio_in=$0.007968, audio_in_cached=$0.000000, text_out=$0.000272, audio_out=$0.000000, total=$0.010461\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yes, M-N-S-H-A-J-U-L-H-O-Q-U-E.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 53,\n", + " \"input_tokens\": 35,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 35\n", + " },\n", + " \"output_tokens\": 18\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000210, text_in=$0.000000, text_out=$0.000180, total=$0.000390\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 2027,\n", + " \"input_tokens\": 1826,\n", + " \"output_tokens\": 201,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1380,\n", + " \"audio_tokens\": 446,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 44,\n", + " \"audio_tokens\": 157\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for that clarification. I have it as M-I-N-H-A-J-U-L H-O-Q-U-E. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep, that's correct, thank you.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2536,\n", + " \"input_tokens\": 2526,\n", + " \"output_tokens\": 10,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2210,\n", + " \"audio_tokens\": 316,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 10,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.001672, text_in_cached=$0.000717, audio_in=$0.010112, audio_in_cached=$0.000000, text_out=$0.000160, audio_out=$0.000000, total=$0.012661\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep, that's correct, thank you.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 37,\n", + " \"input_tokens\": 27,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 27\n", + " },\n", + " \"output_tokens\": 10\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000162, text_in=$0.000000, text_out=$0.000100, total=$0.000262\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 2087,\n", + " \"input_tokens\": 1935,\n", + " \"output_tokens\": 152,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1422,\n", + " \"audio_tokens\": 513,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1600,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1344,\n", + " \"audio_tokens\": 256,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 35,\n", + " \"audio_tokens\": 117\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Great, thank you for confirming. Now, could you please provide your preferred phone number for follow-up?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True)" ] }, { From 7016b1d7a50541539436dc9c533f435c6570b97c Mon Sep 17 00:00:00 2001 From: minh-hoque Date: Tue, 25 Nov 2025 12:04:10 -0500 Subject: [PATCH 3/5] Refactor Realtime out-of-band transcription example to include new parameters for selective transcription requests, update execution counts, and enhance cost analysis summary. Improve overall structure and clarity of instructions. --- .../Realtime_out_of_band_transcription.ipynb | 2038 ++++------------- 1 file changed, 439 insertions(+), 1599 deletions(-) diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index 0b212a8c83..ab7356a1e1 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -131,7 +131,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 100, "id": "c399f440", "metadata": {}, "outputs": [], @@ -156,7 +156,7 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 125, "id": "ac3afaab", "metadata": {}, "outputs": [], @@ -165,7 +165,6 @@ "You are a calm, professional, and empathetic insurance claims intake voice agent working for OpenAI Insurance Solutions. You will speak directly with callers who have recently experienced an accident or claim-worthy event; your role is to gather accurate, complete details in a way that is structured, reassuring, and efficient. Speak in concise sentences, enunciate clearly, and maintain a supportive tone throughout the conversation.\n", "\n", "## OVERVIEW\n", - "\n", "Your job is to walk every caller methodically through three main phases:\n", "\n", "1. **Phase 1: Basics Collection**\n", @@ -174,10 +173,7 @@ "\n", "You should strictly adhere to this structure, make no guesses, never skip required fields, and always confirm critical facts directly with the caller.\n", "\n", - "---\n", - "\n", "## PHASE 1: BASICS COLLECTION\n", - "\n", "- **Greet the caller**: Briefly introduce yourself (“Thank you for calling OpenAI Insurance Claims. My name is [Assistant Name], and I’ll help you file your claim today.”).\n", "- **Gather the following details:**\n", " - Full legal name of the policyholder (“May I please have your full legal name as it appears on your policy?”).\n", @@ -187,10 +183,7 @@ " - Date and time of the incident.\n", "- **Repeat and confirm all collected details at the end of this phase** (“Just to confirm, I have... [summarize each field]. Is that correct?”).\n", "\n", - "---\n", - "\n", "## PHASE 2: INCIDENT CLARIFICATION AND YES/NO QUESTIONS\n", - "\n", "- **Ask YES/NO questions tailored to the incident type:**\n", " - Was anyone injured?\n", " - For vehicle claims: Is the vehicle still drivable?\n", @@ -200,19 +193,13 @@ "- **For each YES/NO answer:** Restate the caller’s response in your own words to confirm understanding.\n", "- **If a caller is unsure or does not have information:** Note it politely and move on without pressing (“That’s okay, we can always collect it later if needed.”).\n", "\n", - "---\n", - "\n", "## PHASE 3: SUMMARY, CONFIRMATION & CLAIM SUBMISSION\n", - "\n", "- **Concise Recap**: Summarize all key facts in a single, clear paragraph (“To quickly review, you, [caller’s name], experienced [incident description] on [date] and provided the following answers... Is that all correct?”).\n", "- **Final Confirmation**: Ask if there is any other relevant information they wish to add about the incident.\n", "- **Submission**: Inform the caller you will submit the claim and briefly outline next steps (“I’ll now submit your claim. Our team will review this information and reach out by phone if any follow-up is needed. You'll receive an initial update within [X] business days.”).\n", "- **Thank the caller**: Express appreciation for their patience.\n", "\n", - "---\n", - "\n", "## GENERAL GUIDELINES\n", - "\n", "- Always state the purpose of each question before asking it.\n", "- Be patient: Adjust your pacing if the caller seems upset or confused.\n", "- Provide reassurance but do not make guarantees about claim approvals.\n", @@ -221,19 +208,13 @@ "- Do not deviate from the script structure, but feel free to use natural language and slight rephrasings to maintain human-like flow.\n", "- Spell out any confusing words, numbers, or codes as needed.\n", "\n", - "---\n", - "\n", "## COMMUNICATION STYLE\n", - "\n", "- Use warm, professional language.\n", "- If at any point the caller becomes upset, acknowledge their feelings (“I understand this situation can be stressful. I'm here to make the process as smooth as possible for you.”).\n", "- When confirming, always explicitly state the value you are confirming.\n", "- Never speculate or invent information. All responses must be grounded in the caller’s direct answers.\n", "\n", - "---\n", - "\n", "## SPECIAL SCENARIOS\n", - "\n", "- **Caller does not know policy number:** Ask for alternative identification such as address or date of birth, and note that the claim will be linked once verified.\n", "- **Multiple incidents:** Politely explain that each claim must be filed separately, and help with the first; offer instructions for subsequent claims if necessary.\n", "- **Caller wishes to pause or end:** Respect their wishes, provide information on how to resume the claim, and thank them for their time.\n", @@ -249,7 +230,6 @@ "\n", "You must produce a **literal, unedited transcript** of the latest user utterance only. Read and follow all instructions below carefully.\n", "\n", - "---\n", "\n", "## 1. Scope of Your Task\n", "\n", @@ -267,7 +247,6 @@ " - You do **not** answer questions, give advice, or continue the conversation.\n", " - You only output the text of what the user just said.\n", "\n", - "---\n", "\n", "## 2. Core Transcription Principles\n", "\n", @@ -309,7 +288,6 @@ " - Transcribe non-English content as accurately as possible.\n", " - Do **not** translate; keep everything in the language(s) spoken.\n", "\n", - "---\n", "\n", "## 3. Disfluencies, Non-Speech Sounds, and Ambiguity\n", "\n", @@ -330,9 +308,8 @@ " - Do **not** replace unclear fragments with what you “think” the user meant.\n", " - Your duty is to preserve exactly what the transcription model produced, even if it looks incomplete or strange.\n", "\n", - "---\n", "\n", - "## 4. Special Case: Policy Numbers\n", + "## 4. Policy Numbers Format\n", "\n", "The user may sometimes mention **policy numbers**. These must be handled with extra care.\n", "\n", @@ -352,7 +329,6 @@ " - Correct perceived mistakes\n", " - Transcribe **exactly what was said**, even if it seems malformed.\n", "\n", - "---\n", "\n", "## 5. Punctuation and Casing\n", "\n", @@ -367,10 +343,8 @@ " - Preserve the casing (uppercase/lowercase) as the model output provides.\n", " - Do not change “i” to “I” or adjust capitalization at sentence boundaries unless the model already did so.\n", "\n", - "---\n", "\n", "## 6. Output Format Requirements\n", - "\n", "Your final output must be a **single, plain-text transcript** of the latest user turn.\n", "\n", "1. **Single block of text**\n", @@ -395,8 +369,6 @@ " - Return an **empty output** (no text at all).\n", " - Do **not** insert placeholders like “[silence]”, “[no audio]”, or “(no transcript)”.\n", "\n", - "---\n", - "\n", "## 7. What You Must Never Do\n", "\n", "1. **No responses or conversation**\n", @@ -425,9 +397,8 @@ " - Incomplete\n", " - Your job is **not** to fix or improve it. Your only job is to **transcribe** it exactly.\n", "\n", - "---\n", "\n", - "## 8. Summary of Your Role\n", + "## 8. IMPORTANT REMINDER\n", "\n", "- You are **not** a chat assistant.\n", "- You are **not** an editor, summarizer, or interpreter.\n", @@ -453,7 +424,7 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 126, "id": "4b952a29", "metadata": {}, "outputs": [ @@ -490,7 +461,7 @@ }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 127, "id": "7254080a", "metadata": {}, "outputs": [], @@ -551,7 +522,7 @@ }, { "cell_type": "code", - "execution_count": 91, + "execution_count": 138, "id": "4baf1870", "metadata": {}, "outputs": [], @@ -616,17 +587,29 @@ " }\n", "\n", "\n", - "def build_transcription_request(transcription_instructions: str) -> dict[str, object]:\n", - " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of the latest user turn.\"\"\"\n", + "def build_transcription_request(\n", + " transcription_instructions: str,\n", + " item_ids: list[str] | None = None,\n", + ") -> dict[str, object]:\n", + " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of selected user turns.\n", + " If item_ids is provided, the model will only consider the turns with the given IDs. You can use this to limit the session context window.\n", + " \"\"\"\n", + "\n", + " response: dict[str, object] = {\n", + " \"conversation\": \"none\", # <--- out-of-band\n", + " \"output_modalities\": [\"text\"],\n", + " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # easier to identify in the logs\n", + " \"instructions\": transcription_instructions,\n", + " }\n", + "\n", + " if item_ids:\n", + " response[\"input\"] = [\n", + " {\"type\": \"item_reference\", \"id\": item_id} for item_id in item_ids\n", + " ]\n", "\n", " return {\n", " \"type\": \"response.create\",\n", - " \"response\": {\n", - " \"conversation\": \"none\", # <--- out-of-band\n", - " \"output_modalities\": [\"text\"],\n", - " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs\n", - " \"instructions\": \"repeat the user's last turn\"#transcription_instructions,\n", - " },\n", + " \"response\": response,\n", " }\n" ] }, @@ -647,7 +630,7 @@ }, { "cell_type": "code", - "execution_count": 92, + "execution_count": 139, "id": "11218bbb", "metadata": {}, "outputs": [], @@ -766,7 +749,7 @@ }, { "cell_type": "code", - "execution_count": 93, + "execution_count": 140, "id": "cb6acbf0", "metadata": {}, "outputs": [], @@ -836,7 +819,7 @@ }, { "cell_type": "code", - "execution_count": 94, + "execution_count": 141, "id": "d099babd", "metadata": {}, "outputs": [], @@ -863,8 +846,9 @@ " transcription_model_costs = shared_state.setdefault(\n", " \"transcription_model_costs\", deque()\n", " )\n", - "\n", " debug_usage_and_cost: bool = bool(shared_state.get(\"debug_usage_and_cost\", False))\n", + " only_last_user_turn: bool = bool(shared_state.get(\"only_last_user_turn\", False))\n", + " last_user_audio_item_id: str | None = None\n", "\n", " # Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", " # gpt-4o-transcribe\n", @@ -995,14 +979,42 @@ " if message_type == \"input_audio_buffer.speech_stopped\":\n", " print(\"[client] Detected silence; preparing transcript...\", flush=True)\n", "\n", - " # This is where the out-of-band transcription request is sent. <-------\n", - " if awaiting_transcription_prompt:\n", + " # Default behavior: trigger immediately after audio commit unless\n", + " # only_last_user_turn requires waiting for conversation.item.added.\n", + " if awaiting_transcription_prompt and not only_last_user_turn:\n", " request_payload = build_transcription_request(\n", - " transcription_instructions\n", + " transcription_instructions,\n", + " item_ids=None,\n", " )\n", " await ws.send(json.dumps(request_payload))\n", " awaiting_transcription_prompt = False\n", "\n", + " elif message_type == \"conversation.item.added\":\n", + " item = message.get(\"item\") or {}\n", + " item_id = item.get(\"id\")\n", + " role = item.get(\"role\")\n", + " status = item.get(\"status\")\n", + " content_blocks = item.get(\"content\") or []\n", + " has_user_audio = any(\n", + " block.get(\"type\") == \"input_audio\" for block in content_blocks\n", + " )\n", + "\n", + " if (\n", + " role == \"user\"\n", + " and status == \"completed\"\n", + " and has_user_audio\n", + " and item_id\n", + " ):\n", + " last_user_audio_item_id = item_id\n", + "\n", + " if only_last_user_turn and awaiting_transcription_prompt:\n", + " request_payload = build_transcription_request(\n", + " transcription_instructions,\n", + " item_ids=[item_id],\n", + " )\n", + " await ws.send(json.dumps(request_payload))\n", + " awaiting_transcription_prompt = False\n", + "\n", " # --- Built-in transcription model stream -------------------------------\n", " elif message_type in TRANSCRIPTION_DELTA_TYPES:\n", " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", @@ -1127,9 +1139,9 @@ " pending_transcription_prints.append(object())\n", " flush_pending_transcription_prints(shared_state)\n", " else:\n", - " if debug_usage_and_cost and usage:\n", - " print(\"[Realtime usage]\")\n", - " print(json.dumps(usage, indent=2))\n", + " # if debug_usage_and_cost and usage:\n", + " # print(\"[Realtime usage]\")\n", + " # print(json.dumps(usage, indent=2))\n", " print(\"\\n=== Assistant response ===\")\n", " print(text, flush=True)\n", " print()\n", @@ -1189,7 +1201,7 @@ }, { "cell_type": "code", - "execution_count": 95, + "execution_count": 142, "id": "35c4d7b5", "metadata": {}, "outputs": [], @@ -1209,6 +1221,7 @@ " max_turns: int | None = None,\n", " timeout_seconds: int = 0,\n", " debug_usage_and_cost: bool = True,\n", + " only_last_user_turn: bool = False,\n", ") -> None:\n", " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", @@ -1233,6 +1246,7 @@ " \"input_transcripts\": deque(),\n", " \"pending_transcription_prints\": deque(),\n", " \"debug_usage_and_cost\": debug_usage_and_cost,\n", + " \"only_last_user_turn\": only_last_user_turn,\n", " }\n", "\n", " async with websockets.connect(\n", @@ -1273,94 +1287,6 @@ " )" ] }, - { - "cell_type": "code", - "execution_count": 96, - "id": "b397b67e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Hello!\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 10,\n", - " \"input_tokens\": 6,\n", - " \"output_tokens\": 4,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 6,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 4,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000024, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.000088\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Hello\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 22,\n", - " \"input_tokens\": 19,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 19\n", - " },\n", - " \"output_tokens\": 3\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000114, text_in=$0.000000, text_out=$0.000030, total=$0.000144\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1305,\n", - " \"input_tokens\": 1051,\n", - " \"output_tokens\": 254,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1032,\n", - " \"audio_tokens\": 19,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 55,\n", - " \"audio_tokens\": 199\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", - "\n", - "Session cancelled; closing.\n" - ] - } - ], - "source": [ - "await run_realtime_session(debug_usage_and_cost=True)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -1495,8 +1421,8 @@ }, { "cell_type": "code", - "execution_count": 57, - "id": "32401814", + "execution_count": 111, + "id": "4a0f9911", "metadata": {}, "outputs": [ { @@ -1507,16 +1433,18 @@ "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_Cfpt8RCQdpsNsz2OZ4rxQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_Cfpt9JS3PCvlCxoO15mLt', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Hello.\n", + "Hello. How can I help you today?\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 307,\n", - " \"input_tokens\": 303,\n", - " \"output_tokens\": 4,\n", + " \"total_tokens\": 1841,\n", + " \"input_tokens\": 1830,\n", + " \"output_tokens\": 11,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 303,\n", + " \"text_tokens\": 1830,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 0,\n", @@ -1527,34 +1455,34 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 4,\n", + " \"text_tokens\": 11,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001212, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.001276\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.007320, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.007496\n", "\n", "=== User turn (Transcription model) ===\n", - "Hello.\n", + "Hello\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 20,\n", + " \"total_tokens\": 19,\n", " \"input_tokens\": 16,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", " \"audio_tokens\": 16\n", " },\n", - " \"output_tokens\": 4\n", + " \"output_tokens\": 3\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000040, total=$0.000136\n", + "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 295,\n", - " \"input_tokens\": 167,\n", - " \"output_tokens\": 128,\n", + " \"total_tokens\": 1327,\n", + " \"input_tokens\": 1042,\n", + " \"output_tokens\": 285,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 151,\n", + " \"text_tokens\": 1026,\n", " \"audio_tokens\": 16,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 0,\n", @@ -1565,373 +1493,452 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 31,\n", - " \"audio_tokens\": 97\n", + " \"text_tokens\": 66,\n", + " \"audio_tokens\": 219\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Hello! Let’s get started with your claim. Could you please tell me your full name?\n", + "Thank you for calling OpenAI Insurance Claims. My name is Ava, and I’ll help you file your claim today. Let’s start with your full legal name as it appears on your policy. Could you share that with me, please?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfptNPygis1UcQYQMDh1f', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfptSg4tU6WnRkdiPvR3D', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My full name is Minhajul Haq.\n", + "My full legal name would be M-I-N-H, H-O-Q-U-E.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 356,\n", - " \"input_tokens\": 344,\n", - " \"output_tokens\": 12,\n", + " \"total_tokens\": 2020,\n", + " \"input_tokens\": 2001,\n", + " \"output_tokens\": 19,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 344,\n", - " \"audio_tokens\": 0,\n", + " \"text_tokens\": 1906,\n", + " \"audio_tokens\": 95,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 320,\n", + " \"cached_tokens\": 1856,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 320,\n", + " \"text_tokens\": 1856,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 12,\n", + " \"text_tokens\": 19,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000096, text_in_cached=$0.000128, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000416\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000200, text_in_cached=$0.000742, audio_in=$0.003040, audio_in_cached=$0.000000, text_out=$0.000304, audio_out=$0.000000, total=$0.004286\n", "\n", "=== User turn (Transcription model) ===\n", - "My full name is Minhajul Haq.\n", + "My full legal name would be Minhajul Hoque.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 49,\n", - " \"input_tokens\": 37,\n", + " \"total_tokens\": 71,\n", + " \"input_tokens\": 57,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 37\n", + " \"audio_tokens\": 57\n", " },\n", - " \"output_tokens\": 12\n", + " \"output_tokens\": 14\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000222, text_in=$0.000000, text_out=$0.000120, total=$0.000342\n", + "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000140, total=$0.000482\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 503,\n", - " \"input_tokens\": 342,\n", - " \"output_tokens\": 161,\n", + " \"total_tokens\": 1675,\n", + " \"input_tokens\": 1394,\n", + " \"output_tokens\": 281,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 192,\n", - " \"audio_tokens\": 150,\n", + " \"text_tokens\": 1102,\n", + " \"audio_tokens\": 292,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 320,\n", + " \"cached_tokens\": 1344,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 192,\n", - " \"audio_tokens\": 128,\n", + " \"text_tokens\": 1088,\n", + " \"audio_tokens\": 256,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 37,\n", - " \"audio_tokens\": 124\n", + " \"text_tokens\": 63,\n", + " \"audio_tokens\": 218\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Thank you. I’ve got Minhajul Haq. Now, could you provide your policy number, please?\n", + "Thank you, Minhajul Hoque. I’ve got your full name noted. Next, may I have your policy number? Please share it in the format of four digits, a dash, and then four more digits.\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpthEQKfNqaoD86Iolvf', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfptnqCGAdlEXuAxGUvvK', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My policy number is 00X7-B725.\n", + "My policy number is P-0-0-2-X-0-7-5.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 446,\n", - " \"input_tokens\": 433,\n", - " \"output_tokens\": 13,\n", + " \"total_tokens\": 2137,\n", + " \"input_tokens\": 2116,\n", + " \"output_tokens\": 21,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 381,\n", - " \"audio_tokens\": 52,\n", + " \"text_tokens\": 1963,\n", + " \"audio_tokens\": 153,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 1856,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 13,\n", + " \"text_tokens\": 21,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001524, text_in_cached=$0.000000, audio_in=$0.001664, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.003396\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000428, text_in_cached=$0.000742, audio_in=$0.004896, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.006402\n", "\n", "=== User turn (Transcription model) ===\n", - "My policy number is 00X7B725.\n", + "My policy number is P002X075.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 85,\n", - " \"input_tokens\": 72,\n", + " \"total_tokens\": 70,\n", + " \"input_tokens\": 59,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 72\n", + " \"audio_tokens\": 59\n", " },\n", - " \"output_tokens\": 13\n", + " \"output_tokens\": 11\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000432, text_in=$0.000000, text_out=$0.000130, total=$0.000562\n", + "[Transcription model cost estimate] audio_in=$0.000354, text_in=$0.000000, text_out=$0.000110, total=$0.000464\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 755,\n", - " \"input_tokens\": 478,\n", - " \"output_tokens\": 277,\n", + " \"total_tokens\": 1811,\n", + " \"input_tokens\": 1509,\n", + " \"output_tokens\": 302,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 229,\n", - " \"audio_tokens\": 249,\n", + " \"text_tokens\": 1159,\n", + " \"audio_tokens\": 350,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 832,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 51,\n", - " \"audio_tokens\": 226\n", + " \"text_tokens\": 57,\n", + " \"audio_tokens\": 245\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Thank you. I have the policy number as 00X7-B725. Let’s confirm: 00X7-B725. Is that correct?\n", + "I want to confirm I heard that correctly. It sounded like your policy number is P002-X075. Could you please confirm if that’s correct, or provide any clarification if needed?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_Cfpu59HqXhBMHvHmW0SvX', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_Cfpu8juH7cCWuQAxCsYUT', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "My full name is Minhajul Haq.\n", + "That is indeed correct.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 677,\n", - " \"input_tokens\": 665,\n", - " \"output_tokens\": 12,\n", + " \"total_tokens\": 2233,\n", + " \"input_tokens\": 2226,\n", + " \"output_tokens\": 7,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 430,\n", - " \"audio_tokens\": 235,\n", + " \"text_tokens\": 2014,\n", + " \"audio_tokens\": 212,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 1856,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 12,\n", + " \"text_tokens\": 7,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001720, text_in_cached=$0.000000, audio_in=$0.007520, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.009432\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000632, text_in_cached=$0.000742, audio_in=$0.006784, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.008270\n", "\n", "=== User turn (Transcription model) ===\n", - "Yeah, that's pretty much correct. That's pretty good, but I think you got my name wrong. Can you ask me again?\n", + "That is indeed correct.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 109,\n", - " \"input_tokens\": 81,\n", + " \"total_tokens\": 39,\n", + " \"input_tokens\": 32,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 81\n", + " \"audio_tokens\": 32\n", " },\n", - " \"output_tokens\": 28\n", + " \"output_tokens\": 7\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000486, text_in=$0.000000, text_out=$0.000280, total=$0.000766\n", + "[Transcription model cost estimate] audio_in=$0.000192, text_in=$0.000000, text_out=$0.000070, total=$0.000262\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 857,\n", - " \"input_tokens\": 710,\n", - " \"output_tokens\": 147,\n", + " \"total_tokens\": 1818,\n", + " \"input_tokens\": 1619,\n", + " \"output_tokens\": 199,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 278,\n", - " \"audio_tokens\": 432,\n", + " \"text_tokens\": 1210,\n", + " \"audio_tokens\": 409,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 832,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 37,\n", - " \"audio_tokens\": 110\n", + " \"text_tokens\": 49,\n", + " \"audio_tokens\": 150\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Of course, let’s make sure we get it right. Could you please repeat your full name clearly for me?\n", + "Thank you for confirming. Now, could you tell me the type of accident you’re filing this claim for—whether it’s auto, home, or something else?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpuJcnmWJEzfxS2MgHv0', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfpuPtFYTrlz1uQJBKMVF', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yeah, sure, M-I-N-H-A-J-U-L H-O-Q-U-E.\n", + "It's an auto one, but I think you got my name wrong. Can you ask my name again?\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 645,\n", - " \"input_tokens\": 625,\n", - " \"output_tokens\": 20,\n", + " \"total_tokens\": 2255,\n", + " \"input_tokens\": 2232,\n", + " \"output_tokens\": 23,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 461,\n", - " \"audio_tokens\": 164,\n", + " \"text_tokens\": 2055,\n", + " \"audio_tokens\": 177,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 1856,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 20,\n", + " \"text_tokens\": 23,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001844, text_in_cached=$0.000000, audio_in=$0.005248, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.007412\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000796, text_in_cached=$0.000742, audio_in=$0.005664, audio_in_cached=$0.000000, text_out=$0.000368, audio_out=$0.000000, total=$0.007570\n", "\n", "=== User turn (Transcription model) ===\n", - "Niajer Miahjul Hoque\n", + "It's a auto one, but I think you got my name wrong, can you ask my name again?\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 55,\n", - " \"input_tokens\": 45,\n", + " \"total_tokens\": 83,\n", + " \"input_tokens\": 60,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 45\n", + " \"audio_tokens\": 60\n", " },\n", - " \"output_tokens\": 10\n", + " \"output_tokens\": 23\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000270, text_in=$0.000000, text_out=$0.000100, total=$0.000370\n", + "[Transcription model cost estimate] audio_in=$0.000360, text_in=$0.000000, text_out=$0.000230, total=$0.000590\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 865,\n", - " \"input_tokens\": 670,\n", - " \"output_tokens\": 195,\n", + " \"total_tokens\": 1779,\n", + " \"input_tokens\": 1625,\n", + " \"output_tokens\": 154,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 309,\n", - " \"audio_tokens\": 361,\n", + " \"text_tokens\": 1251,\n", + " \"audio_tokens\": 374,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 832,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 46,\n", - " \"audio_tokens\": 149\n", + " \"text_tokens\": 41,\n", + " \"audio_tokens\": 113\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Thank you for spelling that out. I have it now as Minhajul Hoque. Let’s confirm: Minhajul Hoque. Is that correct?\n", + "Of course, let’s make sure I have it correct. Could you please spell out your full legal name for me again, carefully?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpuYJBwNQubeb7uuHqQQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfpuaI6ZvKBwZG6yXxE1l', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yep, that's correct.\n", + "Minhajul Hoque.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 736,\n", - " \"input_tokens\": 729,\n", - " \"output_tokens\": 7,\n", + " \"total_tokens\": 2261,\n", + " \"input_tokens\": 2252,\n", + " \"output_tokens\": 9,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 505,\n", - " \"audio_tokens\": 224,\n", + " \"text_tokens\": 2092,\n", + " \"audio_tokens\": 160,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 1856,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 7,\n", + " \"text_tokens\": 9,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.002020, text_in_cached=$0.000000, audio_in=$0.007168, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.009300\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000944, text_in_cached=$0.000742, audio_in=$0.005120, audio_in_cached=$0.000000, text_out=$0.000144, audio_out=$0.000000, total=$0.006950\n", "\n", "=== User turn (Transcription model) ===\n", - "Yep, that's correct.\n", + "مينهاجو حق.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 28,\n", - " \"input_tokens\": 21,\n", + " \"total_tokens\": 27,\n", + " \"input_tokens\": 20,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 21\n", + " \"audio_tokens\": 20\n", " },\n", " \"output_tokens\": 7\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000126, text_in=$0.000000, text_out=$0.000070, total=$0.000196\n", + "[Transcription model cost estimate] audio_in=$0.000120, text_in=$0.000000, text_out=$0.000070, total=$0.000190\n", "\n", "[Realtime usage]\n", "{\n", - " \"total_tokens\": 1011,\n", - " \"input_tokens\": 774,\n", - " \"output_tokens\": 237,\n", + " \"total_tokens\": 1902,\n", + " \"input_tokens\": 1645,\n", + " \"output_tokens\": 257,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 353,\n", - " \"audio_tokens\": 421,\n", + " \"text_tokens\": 1288,\n", + " \"audio_tokens\": 357,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 832,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 57,\n", - " \"audio_tokens\": 180\n", + " \"text_tokens\": 54,\n", + " \"audio_tokens\": 203\n", " }\n", "}\n", "\n", "=== Assistant response ===\n", - "Great, we’ve got your name as Minhajul Hoque. Now, let’s move on. What’s the type of accident? For example, is it auto, home, or something else?\n", + "Thank you. Let me confirm: your full legal name is spelled M-I-N-H-A-J-U-L, and the last name H-O-Q-U-E. Is that correct?\n", "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True)" + ] + }, + { + "cell_type": "markdown", + "id": "7567b84c", + "metadata": {}, + "source": [ + "### Transcription Cost Comparison\n", + "\n", + "#### Costs Summary\n", + "\n", + "* **Realtime Out-of-Band (OOB):** $0.040974 total (~$0.006829 per turn)\n", + "* **Dedicated Transcription:** $0.002114 total (~$0.000352 per turn)\n", + "* **OOB is ~19× more expensive using FULL SESSION CONTEXT**\n", + "\n", + "#### Considerations\n", + "\n", + "* **Caching:** Because these conversations are short, you benefit little from caching beyond the initial system prompt.\n", + "* **Transcription System Prompt:** The transcription model uses a minimal system prompt, so input costs are not significantly increased.\n", + "\n", + "#### Recommended Cost-Saving Strategy\n", + "\n", + "* **Limit transcription to recent turns:** Minimizing audio/text context significantly reduces OOB transcription costs.\n", + "\n", + "#### Understanding Cache Behavior\n", + "\n", + "* Effective caching requires stable prompt instructions (usually 1,024+ tokens).\n", + "* Different instruction prompts between OOB and main assistant sessions result in separate caches.\n" + ] + }, + { + "cell_type": "markdown", + "id": "59f508c4", + "metadata": {}, + "source": [ + "### Cost for Transcribing Only the Latest Turn\n", + "You can transcribe only the latest user turn by providing some input item_references:\n", + "```python\n", + " if item_ids:\n", + " response[\"input\"] = [\n", + " {\"type\": \"item_reference\", \"id\": item_id} for item_id in item_ids\n", + " ]\n", + "\n", + " return {\n", + " \"type\": \"response.create\",\n", + " \"response\": response,\n", + " }\n", + "```\n", + "\n", + "Transcribing only the latest user turn reduces costs by limiting session context. However, this means the model won’t have prior conversation context to help correct simple errors (e.g., accurately choosing a previously mentioned username) and you don't end up building a large cache since you keep changing the cache prefix every turn." + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "7d42ceb8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "It’s like… auto, for sure.\n", + "Hello.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 867,\n", - " \"input_tokens\": 856,\n", - " \"output_tokens\": 11,\n", + " \"total_tokens\": 1813,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 4,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 560,\n", - " \"audio_tokens\": 296,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 0,\n", " \"cached_tokens_details\": {\n", @@ -1941,1206 +1948,93 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 11,\n", + " \"text_tokens\": 4,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.002240, text_in_cached=$0.000000, audio_in=$0.009472, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.011888\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.007236, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.007300\n", "\n", "=== User turn (Transcription model) ===\n", - "It's like auto for sure.\n", + "Hello\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 49,\n", - " \"input_tokens\": 41,\n", + " \"total_tokens\": 17,\n", + " \"input_tokens\": 14,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 41\n", + " \"audio_tokens\": 14\n", " },\n", - " \"output_tokens\": 8\n", + " \"output_tokens\": 3\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000246, text_in=$0.000000, text_out=$0.000080, total=$0.000326\n", + "[Transcription model cost estimate] audio_in=$0.000084, text_in=$0.000000, text_out=$0.000030, total=$0.000114\n", "\n", - "[Realtime usage]\n", + "\n", + "=== Assistant response ===\n", + "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full legal name is M-I-N-H A-J-U-L H-O-Q-U-E\n", + "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 1324,\n", - " \"input_tokens\": 901,\n", - " \"output_tokens\": 423,\n", + " \"total_tokens\": 1829,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 20,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 408,\n", - " \"audio_tokens\": 493,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", + " \"text_tokens\": 1792,\n", " \"audio_tokens\": 0,\n", " \"image_tokens\": 0\n", " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 88,\n", - " \"audio_tokens\": 335\n", + " \"text_tokens\": 20,\n", + " \"audio_tokens\": 0\n", " }\n", "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.001105\n", "\n", - "=== Assistant response ===\n", - "Thank you. So the type of accident is auto. Let’s confirm: we have your full name as Minhajul Hoque, your policy number as 00X7-B725, and the accident type is auto. Let’s move on to a few yes/no questions. \n", + "=== User turn (Transcription model) ===\n", + "My full legal name is Minhajul Hoque.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 87,\n", + " \"input_tokens\": 74,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 74\n", + " },\n", + " \"output_tokens\": 13\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000444, text_in=$0.000000, text_out=$0.000130, total=$0.000574\n", "\n", - "First, was anyone injured?\n", - "\n", - "Session cancelled; closing.\n" - ] - } - ], - "source": [ - "await run_realtime_session(debug_usage_and_cost=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "id": "26fa9399", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Hello, I'm trying to do an example.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 732,\n", - " \"input_tokens\": 721,\n", - " \"output_tokens\": 11,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 721,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 11,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.002884, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.003060\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Hello, I'm trying to do an example.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 42,\n", - " \"input_tokens\": 31,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 31\n", - " },\n", - " \"output_tokens\": 11\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000186, text_in=$0.000000, text_out=$0.000110, total=$0.000296\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1308,\n", - " \"input_tokens\": 1063,\n", - " \"output_tokens\": 245,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1032,\n", - " \"audio_tokens\": 31,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 56,\n", - " \"audio_tokens\": 189\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. Let’s start by getting your full legal name as it appears on your policy.\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Sounds good, my full legal name would be M I N H H A J U L H O Q U E\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 892,\n", - " \"input_tokens\": 867,\n", - " \"output_tokens\": 25,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 787,\n", - " \"audio_tokens\": 80,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 25,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000332, text_in_cached=$0.000282, audio_in=$0.002560, audio_in_cached=$0.000000, text_out=$0.000400, audio_out=$0.000000, total=$0.003574\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Sounds good. My full legal name would be Minhajul Hoque.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 74,\n", - " \"input_tokens\": 57,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 57\n", - " },\n", - " \"output_tokens\": 17\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000170, total=$0.000512\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1533,\n", - " \"input_tokens\": 1375,\n", - " \"output_tokens\": 158,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1098,\n", - " \"audio_tokens\": 277,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1280,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1088,\n", - " \"audio_tokens\": 192,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 42,\n", - " \"audio_tokens\": 116\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you, Minhajul Hoque. Could you please spell that out for me, just to make sure I have it exactly right?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yeah, sure, it would be M I N H A J U L H O Q U E\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 939,\n", - " \"input_tokens\": 917,\n", - " \"output_tokens\": 22,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 825,\n", - " \"audio_tokens\": 92,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 22,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000484, text_in_cached=$0.000282, audio_in=$0.002944, audio_in_cached=$0.000000, text_out=$0.000352, audio_out=$0.000000, total=$0.004062\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yeah sure, it would be m i n h a j a u l h o q u e.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 108,\n", - " \"input_tokens\": 85,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 85\n", - " },\n", - " \"output_tokens\": 23\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000510, text_in=$0.000000, text_out=$0.000230, total=$0.000740\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1681,\n", - " \"input_tokens\": 1425,\n", - " \"output_tokens\": 256,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1136,\n", - " \"audio_tokens\": 289,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 54,\n", - " \"audio_tokens\": 202\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you for spelling that. Just to confirm, I have M-I-N-H-A-J-U-L, and the last name is H-O-Q-U-E. Is that correct?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yep, that's correct, let's continue.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1091,\n", - " \"input_tokens\": 1081,\n", - " \"output_tokens\": 10,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 877,\n", - " \"audio_tokens\": 204,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 10,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000692, text_in_cached=$0.000282, audio_in=$0.006528, audio_in_cached=$0.000000, text_out=$0.000160, audio_out=$0.000000, total=$0.007662\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yep, that's correct, let's continue.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 36,\n", - " \"input_tokens\": 26,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 26\n", - " },\n", - " \"output_tokens\": 10\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000100, total=$0.000256\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1715,\n", - " \"input_tokens\": 1589,\n", - " \"output_tokens\": 126,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1188,\n", - " \"audio_tokens\": 401,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 31,\n", - " \"audio_tokens\": 95\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Great. Now let’s gather your policy number. Could you provide that for me, please?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yeah, my policy number is P075-BB72\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1079,\n", - " \"input_tokens\": 1066,\n", - " \"output_tokens\": 13,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 902,\n", - " \"audio_tokens\": 164,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 13,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000792, text_in_cached=$0.000282, audio_in=$0.005248, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.006530\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yeah, my policy number is P075-BB72.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 81,\n", - " \"input_tokens\": 67,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 67\n", - " },\n", - " \"output_tokens\": 14\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000402, text_in=$0.000000, text_out=$0.000140, total=$0.000542\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1757,\n", - " \"input_tokens\": 1574,\n", - " \"output_tokens\": 183,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1213,\n", - " \"audio_tokens\": 361,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 35,\n", - " \"audio_tokens\": 148\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you. Let me confirm: your policy number is P075-BB72. Is that correct?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yeah, but I think you got my name wrong, uh, can you ask it again?\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1232,\n", - " \"input_tokens\": 1211,\n", - " \"output_tokens\": 21,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 937,\n", - " \"audio_tokens\": 274,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 21,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000932, text_in_cached=$0.000282, audio_in=$0.008768, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.010318\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yeah, but I think you got my name wrong. Can you ask it again?\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 76,\n", - " \"input_tokens\": 57,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 57\n", - " },\n", - " \"output_tokens\": 19\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000190, total=$0.000532\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1882,\n", - " \"input_tokens\": 1719,\n", - " \"output_tokens\": 163,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1248,\n", - " \"audio_tokens\": 471,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 36,\n", - " \"audio_tokens\": 127\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Of course, let's correct that together. Could you please spell your full legal name for me again, carefully?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "My name is Minhajul Hoque.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1261,\n", - " \"input_tokens\": 1250,\n", - " \"output_tokens\": 11,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 971,\n", - " \"audio_tokens\": 279,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 11,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001068, text_in_cached=$0.000282, audio_in=$0.008928, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.010454\n", - "\n", - "=== User turn (Transcription model) ===\n", - "My name is Minhajul Haque\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 36,\n", - " \"input_tokens\": 26,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 26\n", - " },\n", - " \"output_tokens\": 10\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000100, total=$0.000256\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1981,\n", - " \"input_tokens\": 1758,\n", - " \"output_tokens\": 223,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1282,\n", - " \"audio_tokens\": 476,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 51,\n", - " \"audio_tokens\": 172\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you. Let’s confirm the spelling. Is it M-I-N-H-A-J-U-L for the first name, and H-O-Q-U-E for the last name?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yep, you got it right, thank you.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1383,\n", - " \"input_tokens\": 1371,\n", - " \"output_tokens\": 12,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1020,\n", - " \"audio_tokens\": 351,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 12,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001264, text_in_cached=$0.000282, audio_in=$0.011232, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.012970\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yep, you got it right, thank you.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 39,\n", - " \"input_tokens\": 27,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 27\n", - " },\n", - " \"output_tokens\": 12\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000162, text_in=$0.000000, text_out=$0.000120, total=$0.000282\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 2102,\n", - " \"input_tokens\": 1879,\n", - " \"output_tokens\": 223,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1331,\n", - " \"audio_tokens\": 548,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 53,\n", - " \"audio_tokens\": 170\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "You're welcome. Now that we have your name and policy number, let’s move on. Could you tell me the type of accident—was it auto, home, or something else?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "uh it was auto\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1452,\n", - " \"input_tokens\": 1446,\n", - " \"output_tokens\": 6,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1069,\n", - " \"audio_tokens\": 377,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 704,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 704,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 6,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001460, text_in_cached=$0.000282, audio_in=$0.012064, audio_in_cached=$0.000000, text_out=$0.000096, audio_out=$0.000000, total=$0.013902\n", - "\n", - "=== User turn (Transcription model) ===\n", - "It was Otto.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 34,\n", - " \"input_tokens\": 28,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 28\n", - " },\n", - " \"output_tokens\": 6\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000168, text_in=$0.000000, text_out=$0.000060, total=$0.000228\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 2065,\n", - " \"input_tokens\": 1954,\n", - " \"output_tokens\": 111,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1380,\n", - " \"audio_tokens\": 574,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1600,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1280,\n", - " \"audio_tokens\": 320,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 27,\n", - " \"audio_tokens\": 84\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you. Now could you provide the preferred phone number for follow-up?\n", - "\n", - "Session cancelled; closing.\n" - ] - } - ], - "source": [ - "await run_realtime_session(debug_usage_and_cost=True)" - ] - }, - { - "cell_type": "markdown", - "id": "0be41e7c", - "metadata": {}, - "source": [ - "In this example, out-of-band transcription using the Realtime model costs **$0.0725** versus **$0.00364** for the dedicated transcription model across 9 turns — about **$0.0689 more total (~19.9× higher)**. iThets cost is driven up by repeatedly passing the **full, growing session context** and by **uncached audio input** each turn. The dedicated transcription model remains far cheaper and more stable because it processes **only the new audio turn with a minimal prompt**, so the per-turn token load doesn’t accumulate.\n", - "\n", - "---\n", - "\n", - "## What’s happening in *this* run (key notes)\n", - "\n", - "* **Caching kicks in early but only for the stable prompt head.**\n", - " After the large assistant turn (~1k+ tokens), your OOB calls reuse a **704-token cached text prefix**. That’s why later OOB `text_in` is tiny while `text_in_cached` is non-zero.\n", - "\n", - "* **Audio dominates OOB cost here.**\n", - " Each OOB request includes the new user audio (and sometimes more audio context), and those **audio tokens are mostly uncached**, so `audio_in` grows and becomes the main cost component.\n", - "\n", - "* **First OOB turn has 0 audio tokens.**\n", - " That first OOB usage shows `audio_tokens: 0`, meaning the OOB call likely fired before the audio item was fully committed to the session. It still produced the right transcript because the text context already contained (or implied) the utterance.\n", - "\n", - "* **Dedicated transcription stays cheap because context doesn’t grow.**\n", - " It’s effectively “turn-local”: small/no prompt + only the latest audio → stable low cost.\n", - "\n", - "* **Cost-control option:**\n", - " If OOB cost is a concern, you can transcribe **only the most recent turn** (or last N turns) instead of the whole session, keeping the OOB prompt short and preventing audio/text accumulation.\n", - "\n", - "If you want, I can turn this into the exact “Cost comparison” section format you’re using in the cookbook (with the bullets aligned to your earlier style).\n", - "\n", - "\n", - "You’re basically right about the consequence, even if the mental model needs one tweak:\n", - "\n", - "response.create.instructions changes the rendered “system/instructions” text for that one inference pass.\n", - "\n", - "Prompt caching only helps when the new request begins with the same long prefix that was previously computed. It caches the longest previously-computed prefix, and caching eligibility starts once the prompt is large enough (commonly described as 1,024+ tokens, then growing in chunks with exact-prefix matching). \n", - "OpenAI\n", - "+1\n", - "\n", - "So if OOB uses a different system/instructions block than the main assistant, the two request families will not share cache, because they diverge right at the top.\n", - "\n", - "What that means for your setup (main response vs OOB transcription)\n", - "1) Main Realtime responses can still hit caching within themselves\n", - "\n", - "If the main assistant requests are using a stable session prompt (or a stable per-response instructions override), then main responses can cache across turns.\n", - "\n", - "They only stop caching well if you’re truly changing the instructions string every turn (even small differences early in the prompt bust the prefix match). \n", - "OpenAI\n", - "+1\n", - "\n", - "2) OOB transcription can still hit caching within itself\n", - "\n", - "If your OOB transcription instructions are identical each time, then OOB requests can build their own cache lineage and reuse it on later OOB turns.\n", - "\n", - "3) But main and OOB will not share cache if their system/instructions differ\n", - "\n", - "This is the key implication of the “instructions_override replaces system message” behavior:\n", - "\n", - "Main assistant request prefix begins with: SYSTEM = assistant prompt\n", - "\n", - "OOB request prefix begins with: SYSTEM = transcription prompt\n", - "\n", - "Those prefixes differ at token 1 → the cached prefix from one cannot apply to the other (because caching is prefix-based). \n", - "OpenAI\n", - "+1\n" - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "id": "482e82c2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Hello.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1824,\n", - " \"input_tokens\": 1820,\n", - " \"output_tokens\": 4,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1820,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 4,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.007280, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.007344\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Hello\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 19,\n", - " \"input_tokens\": 16,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 16\n", - " },\n", - " \"output_tokens\": 3\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1294,\n", - " \"input_tokens\": 1048,\n", - " \"output_tokens\": 246,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1032,\n", - " \"audio_tokens\": 16,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 0,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 57,\n", - " \"audio_tokens\": 189\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "My legal name is M I N H A J U L H O Q U E\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1962,\n", - " \"input_tokens\": 1942,\n", - " \"output_tokens\": 20,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1887,\n", - " \"audio_tokens\": 55,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1792,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1792,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 20,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000380, text_in_cached=$0.000717, audio_in=$0.001760, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.003177\n", - "\n", - "=== User turn (Transcription model) ===\n", - "My legal name is M-I-N-H-A-J-U-L H-O-Q-U-E.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 67,\n", - " \"input_tokens\": 47,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 47\n", - " },\n", - " \"output_tokens\": 20\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000282, text_in=$0.000000, text_out=$0.000200, total=$0.000482\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1518,\n", - " \"input_tokens\": 1351,\n", - " \"output_tokens\": 167,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1099,\n", - " \"audio_tokens\": 252,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1280,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1088,\n", - " \"audio_tokens\": 192,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 40,\n", - " \"audio_tokens\": 127\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you. Let me confirm I got that right. Your full legal name is Minhajul Hoque. Is that correct?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yes, that is my name.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 1944,\n", - " \"input_tokens\": 1935,\n", - " \"output_tokens\": 9,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1921,\n", - " \"audio_tokens\": 14,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1792,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1792,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 9,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000516, text_in_cached=$0.000717, audio_in=$0.000448, audio_in_cached=$0.000000, text_out=$0.000144, audio_out=$0.000000, total=$0.001825\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yes, that is my name.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 30,\n", - " \"input_tokens\": 21,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 21\n", - " },\n", - " \"output_tokens\": 9\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000126, text_in=$0.000000, text_out=$0.000090, total=$0.000216\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1575,\n", - " \"input_tokens\": 1344,\n", - " \"output_tokens\": 231,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1133,\n", - " \"audio_tokens\": 211,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 52,\n", - " \"audio_tokens\": 179\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Great, thank you for confirming. Now, could you please provide your policy number? It should be in the format of four digits, a dash, and then four more digits.\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "My policy number would be P 0 X 7 6 5 2 0.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 2120,\n", - " \"input_tokens\": 2098,\n", - " \"output_tokens\": 22,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1971,\n", - " \"audio_tokens\": 127,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1856,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1856,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 22,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000460, text_in_cached=$0.000742, audio_in=$0.004064, audio_in_cached=$0.000000, text_out=$0.000352, audio_out=$0.000000, total=$0.005618\n", - "\n", - "=== User turn (Transcription model) ===\n", - "My policy number would be P0X7 6520.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 76,\n", - " \"input_tokens\": 61,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 61\n", - " },\n", - " \"output_tokens\": 15\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000366, text_in=$0.000000, text_out=$0.000150, total=$0.000516\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1704,\n", - " \"input_tokens\": 1507,\n", - " \"output_tokens\": 197,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1183,\n", - " \"audio_tokens\": 324,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 38,\n", - " \"audio_tokens\": 159\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Thank you. Let me confirm your policy number. You said P0X7-6520. Is that correct?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "Yep, that's indeed correct. Do you have that?\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 2163,\n", - " \"input_tokens\": 2150,\n", - " \"output_tokens\": 13,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 2005,\n", - " \"audio_tokens\": 145,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1792,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1792,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 13,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.000852, text_in_cached=$0.000717, audio_in=$0.004640, audio_in_cached=$0.000000, text_out=$0.000208, audio_out=$0.000000, total=$0.006417\n", - "\n", - "=== User turn (Transcription model) ===\n", - "Yep, that's indeed correct. Do you have that?\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 51,\n", - " \"input_tokens\": 38,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 38\n", - " },\n", - " \"output_tokens\": 13\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000228, text_in=$0.000000, text_out=$0.000130, total=$0.000358\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1790,\n", - " \"input_tokens\": 1559,\n", - " \"output_tokens\": 231,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1217,\n", - " \"audio_tokens\": 342,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 53,\n", - " \"audio_tokens\": 178\n", - " }\n", - "}\n", - "\n", - "=== Assistant response ===\n", - "Perfect, thank you. Now, could you tell me what type of accident this claim is related to? For example, is it an auto, home, or another type of incident?\n", - "\n", - "\n", - "[client] Speech detected; streaming...\n", - "[client] Detected silence; preparing transcript...\n", - "\n", - "=== User turn (Realtime transcript) ===\n", - "It's a auto incident.\n", - "[Realtime out-of-band transcription usage]\n", - "{\n", - " \"total_tokens\": 2256,\n", - " \"input_tokens\": 2249,\n", - " \"output_tokens\": 7,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 2056,\n", - " \"audio_tokens\": 193,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1792,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1792,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 7,\n", - " \"audio_tokens\": 0\n", - " }\n", - "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001056, text_in_cached=$0.000717, audio_in=$0.006176, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.008061\n", - "\n", - "=== User turn (Transcription model) ===\n", - "It's a auto incident.\n", - "[Transcription model usage]\n", - "{\n", - " \"type\": \"tokens\",\n", - " \"total_tokens\": 36,\n", - " \"input_tokens\": 29,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 0,\n", - " \"audio_tokens\": 29\n", - " },\n", - " \"output_tokens\": 7\n", - "}\n", - "[Transcription model cost estimate] audio_in=$0.000174, text_in=$0.000000, text_out=$0.000070, total=$0.000244\n", - "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1812,\n", - " \"input_tokens\": 1658,\n", - " \"output_tokens\": 154,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1268,\n", - " \"audio_tokens\": 390,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 38,\n", - " \"audio_tokens\": 116\n", - " }\n", - "}\n", "\n", "=== Assistant response ===\n", - "Thank you. Now I’d like to get your preferred phone number for follow-up. Could you please provide that number?\n", + "Thank you, Minhajul Hoque. I’ve noted your full legal name. Next, could you please provide your policy number? Remember, it's usually in a format like XXXX-XXXX.\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "For sure, I can provide that—wait, can you say my name again? I think you got it wrong.\n", + "My policy number is X007-PX75.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 2303,\n", - " \"input_tokens\": 2277,\n", - " \"output_tokens\": 26,\n", + " \"total_tokens\": 1821,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 12,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 2090,\n", - " \"audio_tokens\": 187,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 1792,\n", " \"cached_tokens_details\": {\n", @@ -3150,66 +2044,45 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 26,\n", + " \"text_tokens\": 12,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001192, text_in_cached=$0.000717, audio_in=$0.005984, audio_in_cached=$0.000000, text_out=$0.000416, audio_out=$0.000000, total=$0.008309\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977\n", "\n", "=== User turn (Transcription model) ===\n", - "For sure I can provide that but can you say my name again? I think you got it wrong.\n", + "Sure, my policy number is AG007-PX75.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 79,\n", - " \"input_tokens\": 56,\n", + " \"total_tokens\": 102,\n", + " \"input_tokens\": 88,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 56\n", + " \"audio_tokens\": 88\n", " },\n", - " \"output_tokens\": 23\n", + " \"output_tokens\": 14\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000336, text_in=$0.000000, text_out=$0.000230, total=$0.000566\n", + "[Transcription model cost estimate] audio_in=$0.000528, text_in=$0.000000, text_out=$0.000140, total=$0.000668\n", "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1852,\n", - " \"input_tokens\": 1686,\n", - " \"output_tokens\": 166,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1302,\n", - " \"audio_tokens\": 384,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 39,\n", - " \"audio_tokens\": 127\n", - " }\n", - "}\n", "\n", "=== Assistant response ===\n", - "Of course, I want to make sure I have it exactly right. Could you please spell your full legal name for me?\n", + "Thank you. Just to confirm, I heard your policy number as E G 0 0 7 - P X 7 5. Is that correct?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "For sure, it's Minhajul Hoque.\n", + "No, I said X007-PX75.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 2363,\n", - " \"input_tokens\": 2351,\n", + " \"total_tokens\": 1821,\n", + " \"input_tokens\": 1809,\n", " \"output_tokens\": 12,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 2127,\n", - " \"audio_tokens\": 224,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 1792,\n", " \"cached_tokens_details\": {\n", @@ -3223,62 +2096,41 @@ " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001340, text_in_cached=$0.000717, audio_in=$0.007168, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.009417\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977\n", "\n", "=== User turn (Transcription model) ===\n", - "Өршөөөр, энэ минь хажуу хог.\n", + "No, I said X007-PX75.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 42,\n", - " \"input_tokens\": 26,\n", + " \"total_tokens\": 65,\n", + " \"input_tokens\": 53,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 26\n", + " \"audio_tokens\": 53\n", " },\n", - " \"output_tokens\": 16\n", + " \"output_tokens\": 12\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000156, text_in=$0.000000, text_out=$0.000160, total=$0.000316\n", + "[Transcription model cost estimate] audio_in=$0.000318, text_in=$0.000000, text_out=$0.000120, total=$0.000438\n", "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 1920,\n", - " \"input_tokens\": 1760,\n", - " \"output_tokens\": 160,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1339,\n", - " \"audio_tokens\": 421,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 43,\n", - " \"audio_tokens\": 117\n", - " }\n", - "}\n", "\n", "=== Assistant response ===\n", - "Thank you for clarifying. Could you spell it out for me letter by letter, so I can make sure it’s 100% correct?\n", + "Thank you for clarifying. I’ve got it now. Your policy number is E G 0 0 7 - P X 7 5. Let’s move on. Could you tell me the type of accident—is it auto, home, or something else?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yes, M I N H A J U L H O Q U E\n", + "It's an auto, but I think you got my name wrong, can you ask me again?\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 2434,\n", - " \"input_tokens\": 2417,\n", - " \"output_tokens\": 17,\n", + " \"total_tokens\": 1830,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 21,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 2168,\n", - " \"audio_tokens\": 249,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 1792,\n", " \"cached_tokens_details\": {\n", @@ -3288,66 +2140,45 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 17,\n", + " \"text_tokens\": 21,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001504, text_in_cached=$0.000717, audio_in=$0.007968, audio_in_cached=$0.000000, text_out=$0.000272, audio_out=$0.000000, total=$0.010461\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.001121\n", "\n", "=== User turn (Transcription model) ===\n", - "Yes, M-N-S-H-A-J-U-L-H-O-Q-U-E.\n", + "It's an auto, but I think you got my name wrong. Can you ask me again?\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 53,\n", - " \"input_tokens\": 35,\n", + " \"total_tokens\": 67,\n", + " \"input_tokens\": 46,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 35\n", + " \"audio_tokens\": 46\n", " },\n", - " \"output_tokens\": 18\n", + " \"output_tokens\": 21\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000210, text_in=$0.000000, text_out=$0.000180, total=$0.000390\n", + "[Transcription model cost estimate] audio_in=$0.000276, text_in=$0.000000, text_out=$0.000210, total=$0.000486\n", "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 2027,\n", - " \"input_tokens\": 1826,\n", - " \"output_tokens\": 201,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1380,\n", - " \"audio_tokens\": 446,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 832,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 832,\n", - " \"audio_tokens\": 0,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 44,\n", - " \"audio_tokens\": 157\n", - " }\n", - "}\n", "\n", "=== Assistant response ===\n", - "Thank you for that clarification. I have it as M-I-N-H-A-J-U-L H-O-Q-U-E. Is that correct?\n", + "Of course, I’m happy to correct that. Let’s go back. Could you please spell your full legal name for me, so I can make sure I’ve got it exactly right?\n", "\n", "\n", "[client] Speech detected; streaming...\n", "[client] Detected silence; preparing transcript...\n", "\n", "=== User turn (Realtime transcript) ===\n", - "Yep, that's correct, thank you.\n", + "Yeah, my full legal name is Minhajul Haque.\n", "[Realtime out-of-band transcription usage]\n", "{\n", - " \"total_tokens\": 2536,\n", - " \"input_tokens\": 2526,\n", - " \"output_tokens\": 10,\n", + " \"total_tokens\": 1824,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 15,\n", " \"input_token_details\": {\n", - " \"text_tokens\": 2210,\n", - " \"audio_tokens\": 316,\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", " \"image_tokens\": 0,\n", " \"cached_tokens\": 1792,\n", " \"cached_tokens_details\": {\n", @@ -3357,58 +2188,63 @@ " }\n", " },\n", " \"output_token_details\": {\n", - " \"text_tokens\": 10,\n", + " \"text_tokens\": 15,\n", " \"audio_tokens\": 0\n", " }\n", "}\n", - "[Realtime out-of-band transcription cost estimate] text_in=$0.001672, text_in_cached=$0.000717, audio_in=$0.010112, audio_in_cached=$0.000000, text_out=$0.000160, audio_out=$0.000000, total=$0.012661\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000240, audio_out=$0.000000, total=$0.001025\n", "\n", "=== User turn (Transcription model) ===\n", - "Yep, that's correct, thank you.\n", + "Yeah, my full legal name is Minhajul Haque.\n", "[Transcription model usage]\n", "{\n", " \"type\": \"tokens\",\n", - " \"total_tokens\": 37,\n", - " \"input_tokens\": 27,\n", + " \"total_tokens\": 60,\n", + " \"input_tokens\": 45,\n", " \"input_token_details\": {\n", " \"text_tokens\": 0,\n", - " \"audio_tokens\": 27\n", + " \"audio_tokens\": 45\n", " },\n", - " \"output_tokens\": 10\n", + " \"output_tokens\": 15\n", "}\n", - "[Transcription model cost estimate] audio_in=$0.000162, text_in=$0.000000, text_out=$0.000100, total=$0.000262\n", + "[Transcription model cost estimate] audio_in=$0.000270, text_in=$0.000000, text_out=$0.000150, total=$0.000420\n", "\n", - "[Realtime usage]\n", - "{\n", - " \"total_tokens\": 2087,\n", - " \"input_tokens\": 1935,\n", - " \"output_tokens\": 152,\n", - " \"input_token_details\": {\n", - " \"text_tokens\": 1422,\n", - " \"audio_tokens\": 513,\n", - " \"image_tokens\": 0,\n", - " \"cached_tokens\": 1600,\n", - " \"cached_tokens_details\": {\n", - " \"text_tokens\": 1344,\n", - " \"audio_tokens\": 256,\n", - " \"image_tokens\": 0\n", - " }\n", - " },\n", - " \"output_token_details\": {\n", - " \"text_tokens\": 35,\n", - " \"audio_tokens\": 117\n", - " }\n", - "}\n", "\n", "=== Assistant response ===\n", - "Great, thank you for confirming. Now, could you please provide your preferred phone number for follow-up?\n", + "Thank you for that. Just to confirm, your full legal name is Minhajul Hoque. Is that correct?\n", "\n", "Session cancelled; closing.\n" ] } ], "source": [ - "await run_realtime_session(debug_usage_and_cost=True)" + "await run_realtime_session(debug_usage_and_cost=True, only_last_user_turn=True)" + ] + }, + { + "cell_type": "markdown", + "id": "820420e5", + "metadata": {}, + "source": [ + "#### Cost Analysis Summary\n", + "\n", + "Realtime Out-of-Band Transcription (OOB)\n", + "\n", + "* **Total Cost:** $0.013354\n", + "* **Average per Turn:** ~$0.001908\n", + "\n", + "Dedicated Transcription Model\n", + "\n", + "* **Total Cost:** $0.002630\n", + "* **Average per Turn:** ~$0.000376\n", + "\n", + "\n", + "Difference in Costs\n", + "\n", + "* **Additional cost using OOB:** **+$0.010724**\n", + "* **Cost Multiplier:** OOB is about **5.08×** more expensive than the dedicated transcription model.\n", + "\n", + "Significantly less than using the full session context. You should look at the characteristic of your use case and determine if regular transcription, out-of-band transcription with full context or latest recent turn is better suited for your use case. YOu can also do ternmediate where you past the last N turns\n" ] }, { @@ -3424,6 +2260,10 @@ "* You need a more reliable and steerable method for generating transcriptions.\n", "* The current transcripts fail to normalize entities correctly, causing downstream issues.\n", "\n", + "Keep in mind the trade-offs:\n", + "- Cost: Out-of-band (OOB) transcription is more expensive. Be sure that the extra expense makes sense for your typical session lengths and business needs.\n", + "- Complexity: Implementing OOB transcription takes extra engineering effort to connect all the pieces correctly. Only choose this approach if its benefits are important for your use case.\n", + "\n", "If you decide to pursue this method, make sure you:\n", "\n", "* Set up the transcription trigger correctly, ensuring it activates after the audio commit.\n", From 54f62b1c06298812da1720ab01a49533eacadb65 Mon Sep 17 00:00:00 2001 From: minh-hoque Date: Tue, 25 Nov 2025 12:26:53 -0500 Subject: [PATCH 4/5] Fixed cost assumptions --- .../Realtime_out_of_band_transcription.ipynb | 28 ++++++++----------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index ab7356a1e1..a00e37dfcb 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -64,7 +64,7 @@ "\n", "- Realtime Model (for transcription):\n", " - Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.\n", - " - Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).\n", + " - Cached Session Context: $0.40 per 1M cached context tokens.\n", "\n", " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00\n", "\n", @@ -72,23 +72,19 @@ "\n", " - Audio Input: $6.00 per 1M audio tokens\n", "\n", - " - Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible input prompt)\n", + " - Text Input: $2.50 per 1M tokens.\n", "\n", " - Text Output: $10.00 per 1M tokens\n", "\n", " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00\n", "\n", - "- Direct Cost Comparison:\n", + "- Direct Cost Comparison (see examples in the end of the cookbook):\n", "\n", - " - Realtime Transcription: ~$48.00\n", - "\n", - " - GPT-4o Transcription: ~$16.00\n", - "\n", - " - Absolute Difference: $48.00 − $16.00 = $32.00\n", - "\n", - " - Cost Ratio: $48.00 / $16.00 = 3×\n", - "\n", - " Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.\n", + " - Using full session context: 16-22x\n", + " - The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.\n", + " - Using only latest user turn: 3-5x\n", + " - The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.\n", + " \n", "\n", "- Other Considerations:\n", "\n", @@ -156,7 +152,7 @@ }, { "cell_type": "code", - "execution_count": 125, + "execution_count": null, "id": "ac3afaab", "metadata": {}, "outputs": [], @@ -405,7 +401,7 @@ "- You **are** a **verbatim transcription tool** for the latest user turn.\n", "\n", "Your output must be the **precise, literal, and complete transcript of the most recent user utterance**—with no additional content, no corrections, and no commentary.\n", - "\"\"\"\n" + "\"\"\"" ] }, { @@ -2242,9 +2238,9 @@ "Difference in Costs\n", "\n", "* **Additional cost using OOB:** **+$0.010724**\n", - "* **Cost Multiplier:** OOB is about **5.08×** more expensive than the dedicated transcription model.\n", + "* **Cost Multiplier:** OOB is about **5×** more expensive than the dedicated transcription model.\n", "\n", - "Significantly less than using the full session context. You should look at the characteristic of your use case and determine if regular transcription, out-of-band transcription with full context or latest recent turn is better suited for your use case. YOu can also do ternmediate where you past the last N turns\n" + "This approach costs significantly less than using the full session context. You should evaluate your use case to decide whether regular transcription, out-of-band transcription with full context, or transcribing only the latest turn best fits your needs. You can also choose an intermediate strategy, such as including just the last N turns in the input.\n" ] }, { From bff3b85abe7e2942d78751a8ef7bab80f6880392 Mon Sep 17 00:00:00 2001 From: minh-hoque Date: Tue, 25 Nov 2025 14:41:07 -0500 Subject: [PATCH 5/5] Enhance Realtime out-of-band transcription example by adding detailed cost breakdowns for transcription based on usage, updating execution counts, and clarifying selective transcription request logic. Improve overall clarity and structure of the notebook. --- .../Realtime_out_of_band_transcription.ipynb | 254 +++++++++--------- 1 file changed, 130 insertions(+), 124 deletions(-) diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index a00e37dfcb..0d573e3a29 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -80,9 +80,9 @@ "\n", "- Direct Cost Comparison (see examples in the end of the cookbook):\n", "\n", - " - Using full session context: 16-22x\n", + " - Using full session context: 16-22x (if transcription cost is 0.001$/session, realtime transcription will be 0.016$/session)\n", " - The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.\n", - " - Using only latest user turn: 3-5x\n", + " - Using only latest user turn: 3-5x (if transcription cost is 0.001$/session, realtime transcription will be 0.003$/session)\n", " - The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.\n", " \n", "\n", @@ -808,14 +808,136 @@ "`listen_for_events` drives the session:\n", "\n", "- Watches for `speech_started` / `speech_stopped` / `committed`\n", - "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`)\n", + "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`) when only_last_user_turn == False\n", + "- Sends the out‑of‑band transcription request when a user turn is added to conversation (`conversation.item.added\"`) when only_last_user_turn == True\n", "- Streams assistant audio to the playback queue\n", "- Buffers text deltas per `response_id`" ] }, { "cell_type": "code", - "execution_count": 141, + "execution_count": 148, + "id": "32dc2aac", + "metadata": {}, + "outputs": [], + "source": [ + "# Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", + "# gpt-4o-transcribe\n", + "GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M = 6.00\n", + "GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M = 2.50\n", + "GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M = 10.00\n", + "\n", + "# gpt-realtime\n", + "REALTIME_TEXT_INPUT_PRICE_PER_1M = 4\n", + "REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M = 0.4\n", + "REALTIME_TEXT_OUTPUT_PRICE_PER_1M = 16.00\n", + "REALTIME_AUDIO_INPUT_PRICE_PER_1M = 32.00\n", + "REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.40\n", + "REALTIME_AUDIO_OUTPUT_PRICE_PER_1M = 64.00\n", + "\n", + "def _compute_transcription_model_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " output_tokens = usage.get(\"output_tokens\") or 0\n", + "\n", + " audio_input_cost = (\n", + " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_input_cost = (\n", + " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", + "\n", + " return {\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"text_input_cost\": text_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }\n", + "\n", + "def _compute_realtime_oob_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " output_details = usage.get(\"output_token_details\") or {}\n", + " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", + "\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " cached_text_tokens = (\n", + " cached_details.get(\"text_tokens\")\n", + " or input_details.get(\"cached_tokens\")\n", + " or 0\n", + " )\n", + " non_cached_text_input_tokens = max(text_input_tokens - cached_text_tokens, 0)\n", + "\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " cached_audio_tokens = cached_details.get(\"audio_tokens\") or 0\n", + " non_cached_audio_input_tokens = max(audio_input_tokens - cached_audio_tokens, 0)\n", + "\n", + " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", + " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", + "\n", + " text_input_cost = (\n", + " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_text_input_cost = (\n", + " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_input_cost = (\n", + " non_cached_audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_audio_input_cost = (\n", + " cached_audio_tokens * REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_output_cost = (\n", + " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + "\n", + " total_cost = (\n", + " text_input_cost\n", + " + cached_text_input_cost\n", + " + audio_input_cost\n", + " + cached_audio_input_cost\n", + " + text_output_cost\n", + " + audio_output_cost\n", + " )\n", + "\n", + " return {\n", + " \"text_input_cost\": text_input_cost,\n", + " \"cached_text_input_cost\": cached_text_input_cost,\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"cached_audio_input_cost\": cached_audio_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"audio_output_cost\": audio_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": 149, "id": "d099babd", "metadata": {}, "outputs": [], @@ -846,119 +968,6 @@ " only_last_user_turn: bool = bool(shared_state.get(\"only_last_user_turn\", False))\n", " last_user_audio_item_id: str | None = None\n", "\n", - " # Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", - " # gpt-4o-transcribe\n", - " GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M = 6.00\n", - " GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M = 2.50\n", - " GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M = 10.00\n", - "\n", - " # gpt-realtime\n", - " REALTIME_TEXT_INPUT_PRICE_PER_1M = 4\n", - " REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M = 0.4\n", - " REALTIME_TEXT_OUTPUT_PRICE_PER_1M = 16.00\n", - " REALTIME_AUDIO_INPUT_PRICE_PER_1M = 32.00\n", - " REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.40\n", - " REALTIME_AUDIO_OUTPUT_PRICE_PER_1M = 64.00\n", - "\n", - " def _compute_transcription_model_cost(usage: dict | None) -> dict | None:\n", - " if not usage:\n", - " return None\n", - "\n", - " input_details = usage.get(\"input_token_details\") or {}\n", - " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", - " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", - " output_tokens = usage.get(\"output_tokens\") or 0\n", - "\n", - " audio_input_cost = (\n", - " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_input_cost = (\n", - " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_output_cost = (\n", - " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", - "\n", - " return {\n", - " \"audio_input_cost\": audio_input_cost,\n", - " \"text_input_cost\": text_input_cost,\n", - " \"text_output_cost\": text_output_cost,\n", - " \"total_cost\": total_cost,\n", - " \"usage\": usage,\n", - " }\n", - "\n", - " def _compute_realtime_oob_cost(usage: dict | None) -> dict | None:\n", - " if not usage:\n", - " return None\n", - "\n", - " input_details = usage.get(\"input_token_details\") or {}\n", - " output_details = usage.get(\"output_token_details\") or {}\n", - " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", - "\n", - " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", - " cached_text_tokens = (\n", - " cached_details.get(\"text_tokens\")\n", - " or input_details.get(\"cached_tokens\")\n", - " or 0\n", - " )\n", - " non_cached_text_input_tokens = max(text_input_tokens - cached_text_tokens, 0)\n", - "\n", - " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", - " cached_audio_tokens = cached_details.get(\"audio_tokens\") or 0\n", - " non_cached_audio_input_tokens = max(audio_input_tokens - cached_audio_tokens, 0)\n", - "\n", - " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", - " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", - "\n", - " text_input_cost = (\n", - " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " cached_text_input_cost = (\n", - " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " audio_input_cost = (\n", - " non_cached_audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " cached_audio_input_cost = (\n", - " cached_audio_tokens * REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " text_output_cost = (\n", - " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - " audio_output_cost = (\n", - " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", - " / 1_000_000\n", - " )\n", - "\n", - " total_cost = (\n", - " text_input_cost\n", - " + cached_text_input_cost\n", - " + audio_input_cost\n", - " + cached_audio_input_cost\n", - " + text_output_cost\n", - " + audio_output_cost\n", - " )\n", - "\n", - " return {\n", - " \"text_input_cost\": text_input_cost,\n", - " \"cached_text_input_cost\": cached_text_input_cost,\n", - " \"audio_input_cost\": audio_input_cost,\n", - " \"cached_audio_input_cost\": cached_audio_input_cost,\n", - " \"text_output_cost\": text_output_cost,\n", - " \"audio_output_cost\": audio_output_cost,\n", - " \"total_cost\": total_cost,\n", - " \"usage\": usage,\n", - " }\n", - "\n", " async for raw in ws:\n", " if stop_event.is_set():\n", " break\n", @@ -1135,9 +1144,6 @@ " pending_transcription_prints.append(object())\n", " flush_pending_transcription_prints(shared_state)\n", " else:\n", - " # if debug_usage_and_cost and usage:\n", - " # print(\"[Realtime usage]\")\n", - " # print(json.dumps(usage, indent=2))\n", " print(\"\\n=== Assistant response ===\")\n", " print(text, flush=True)\n", " print()\n", @@ -1197,7 +1203,7 @@ }, { "cell_type": "code", - "execution_count": 142, + "execution_count": 150, "id": "35c4d7b5", "metadata": {}, "outputs": [], @@ -1871,12 +1877,12 @@ "\n", "* **Realtime Out-of-Band (OOB):** $0.040974 total (~$0.006829 per turn)\n", "* **Dedicated Transcription:** $0.002114 total (~$0.000352 per turn)\n", - "* **OOB is ~19× more expensive using FULL SESSION CONTEXT**\n", + "* **OOB is ~19× more expensive using full session context**\n", "\n", "#### Considerations\n", "\n", "* **Caching:** Because these conversations are short, you benefit little from caching beyond the initial system prompt.\n", - "* **Transcription System Prompt:** The transcription model uses a minimal system prompt, so input costs are not significantly increased.\n", + "* **Transcription System Prompt:** The transcription model uses a minimal system prompt, so input costs would typically be higher.\n", "\n", "#### Recommended Cost-Saving Strategy\n", "\n", @@ -1885,7 +1891,7 @@ "#### Understanding Cache Behavior\n", "\n", "* Effective caching requires stable prompt instructions (usually 1,024+ tokens).\n", - "* Different instruction prompts between OOB and main assistant sessions result in separate caches.\n" + "* Different instruction prompts between OOB and main assistant sessions result in separate caches." ] }, {