From fbfcb6b235ebac804b2257a855f5d90fdb523400 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 10:21:22 -0700 Subject: [PATCH 01/22] Adding agent websocket for web and mobile --- .../integrations/web-and-mobile-clients.mdx | 217 ++++++++++++++++++ fern/versions/2025-04-16.yml | 6 + 2 files changed, 223 insertions(+) create mode 100644 fern/agents/integrations/web-and-mobile-clients.mdx diff --git a/fern/agents/integrations/web-and-mobile-clients.mdx b/fern/agents/integrations/web-and-mobile-clients.mdx new file mode 100644 index 0000000..7e460d8 --- /dev/null +++ b/fern/agents/integrations/web-and-mobile-clients.mdx @@ -0,0 +1,217 @@ +# Web & Mobile Clients + +Some Cartesia customers might be interested in integrating their Line agents with their web or mobile application, rather than a telephony provider. For these folks, the Agents WebSocket provides real-time, bidirectional communication with voice agents through a streaming interface. This allows you to send audio data and receive agent responses in real-time for any client. + +graph LR + A1[Web Client] <-->|General Websocket Events| B[Cartesia API] + A2[Mobile Client] <-->|General Websocket Events| B + B <-->|Agent Events| C[Inferno Agents] + +## Connection + +Connect to the WebSocket endpoint: + +``` +ws://api.cartesia.ai/agents/stream/{agent_id} +``` + +**Headers:** +- `Authorization: Bearer {your_api_key}` +- `Cartesia-Version: 2025-04-16` + +## Protocol Overview + +The WebSocket protocol uses JSON messages for control and base64-encoded audio for media. The connection follows this flow: + +1. Client sends `start` event to initialize the stream +2. Server responds with `ack` event confirming configuration +3. Bidirectional exchange of events +4. Connection closes using WebSocket close frames + +Note that if you don't specify `streamSid` in the initial `start` message, one will be assigned and returned in the `ack` response. + +## Input Events (Client → Server) + +### Start Event + +Initializes the audio stream configuration. **This must be the first message sent.** + +```json +{ + "event": "start", + // Optionally provide your own stream sid + "streamSid": "example_id", + "config": { + "inputFormat": "pcm_44100", + "outputFormat": "pcm_44100" + } +} +``` + +**Fields:** +- `streamSid` (optional): Stream identifier. If not provided, server generates one +- `config.inputFormat`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) +- `config.outputFormat`: Audio format for server audio output (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) + +### Media Event + +Sends audio data to the agent. + +```json +{ + "event": "media", + "streamSid": "example_id", + "media": { + "payload": "base64_encoded_audio_data" + } +} +``` + +**Fields:** +- `streamSid`: Stream identifier from the ack response +- `media.payload`: Base64-encoded audio data in the format specified in the start event + +### DTMF Event + +Sends DTMF (dual-tone multi-frequency) tones. + +```json +{ + "event": "dtmf", + "streamSid": "example_id", + "dtmf": { + "digit": "1" + } +} +``` + +**Fields:** +- `streamSid`: Stream identifier +- `dtmf.digit`: DTMF digit (0-9, *, #) + +### Metadata Event + +Sends custom metadata to the agent. + +```json +{ + "event": "metadata", + "streamSid": "example_id", + "metadata": { + "user_id": "user123", + "session_info": "custom_data" + } +} +``` + +**Fields:** +- `streamSid`: Stream identifier +- `metadata`: Object containing key-value pairs of custom data + +## Output Events (Server → Client) + +### Ack Event + +Server acknowledgment of the start event, confirming stream configuration. + +```json +{ + "event": "ack", + "streamSid": "example_id", + "config": { + "inputFormat": "pcm_44100", + "outputFormat": "pcm_44100" + } +} +``` + +### Media Event + +Server sends agent audio response. + +```json +{ + "event": "media", + "streamSid": "example_id", + "media": { + "payload": "base64_encoded_audio_data" + } +} +``` + +### Clear Event + +Indicates the agent wants to clear/interrupt the current audio stream. + +```json +{ + "event": "clear", + "streamSid": "example_id" +} +``` + +### DTMF Event + +Server sends DTMF tones from the agent. + +```json +{ + "event": "dtmf", + "streamSid": "example_id", + "dtmf": { + "digit": "5" + } +} +``` + +### Metadata Event + +Server sends metadata from the agent. + +```json +{ + "event": "metadata", + "streamSid": "example_id", + "metadata": { + "agent_state": "processing", + "confidence": 0.95, + "custom_data": "value" + } +} +``` + +## Connection Management + +### Ping/Pong Keepalive + +The WebSocket supports standard ping/pong frames for periodic connection healthchecks: + +```python +# Client sends ping +pong_waiter = await websocket.ping() +latency = await pong_waiter +``` + +The server automatically responds to ping frames with pong frames. + +### Connection Close + +The connection can be closed by either the client or server using WebSocket close frames. + +**Client-initiated close:** +```python +await websocket.close(code=1000, reason="session completed") +``` + +**Server-initiated close:** +When the agent ends the call, the server closes the connection with: +- **Code:** 1000 (Normal Closure) +- **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available + +## Best Practices + +1. **Always send start event first** - The connection will be closed if any other event is sent before start +2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often MULAW 8k, while for web clients this will often be 44.1k +3. **Handle connection close gracefully** - Monitor close events and reasons for debugging +4. **Implement ping/pong for long connections** - Use WebSocket ping frames to periodically confirm connection health +5. **Monitor streamSid consistency** - Maintain your own streamSid's for the best observability diff --git a/fern/versions/2025-04-16.yml b/fern/versions/2025-04-16.yml index b8a4f34..2cd32a2 100644 --- a/fern/versions/2025-04-16.yml +++ b/fern/versions/2025-04-16.yml @@ -200,6 +200,12 @@ navigation: path: ../agents/integrations/telephony/agents-telephony-overview.mdx - page: Outbound Dialing path: ../agents/integrations/telephony/agents-telephony-outbound.mdx + - section: Web & Mobile Clients + icon: fa-browser + contents: + - page: Web & Mobile Clients + path: ../agents/integrations/web-and-mobile-clients.mdx + icon: fa-browser - section: Infrastructure contents: - page: Deployments From 0d54454f39e38631f609df0aa48f8922dd87a3fc Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 10:58:31 -0700 Subject: [PATCH 02/22] Fixing fern build --- fern/agents/integrations/web-and-mobile-clients.mdx | 3 ++- fern/versions/2025-04-16.yml | 9 +++------ 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/fern/agents/integrations/web-and-mobile-clients.mdx b/fern/agents/integrations/web-and-mobile-clients.mdx index 7e460d8..69067b8 100644 --- a/fern/agents/integrations/web-and-mobile-clients.mdx +++ b/fern/agents/integrations/web-and-mobile-clients.mdx @@ -2,10 +2,12 @@ Some Cartesia customers might be interested in integrating their Line agents with their web or mobile application, rather than a telephony provider. For these folks, the Agents WebSocket provides real-time, bidirectional communication with voice agents through a streaming interface. This allows you to send audio data and receive agent responses in real-time for any client. +```mermaid graph LR A1[Web Client] <-->|General Websocket Events| B[Cartesia API] A2[Mobile Client] <-->|General Websocket Events| B B <-->|Agent Events| C[Inferno Agents] +``` ## Connection @@ -39,7 +41,6 @@ Initializes the audio stream configuration. **This must be the first message sen ```json { "event": "start", - // Optionally provide your own stream sid "streamSid": "example_id", "config": { "inputFormat": "pcm_44100", diff --git a/fern/versions/2025-04-16.yml b/fern/versions/2025-04-16.yml index 2cd32a2..d5da93f 100644 --- a/fern/versions/2025-04-16.yml +++ b/fern/versions/2025-04-16.yml @@ -200,12 +200,9 @@ navigation: path: ../agents/integrations/telephony/agents-telephony-overview.mdx - page: Outbound Dialing path: ../agents/integrations/telephony/agents-telephony-outbound.mdx - - section: Web & Mobile Clients - icon: fa-browser - contents: - - page: Web & Mobile Clients - path: ../agents/integrations/web-and-mobile-clients.mdx - icon: fa-browser + - page: Web & Mobile Clients + path: ../agents/integrations/web-and-mobile-clients.mdx + icon: fa-solid fa-browser - section: Infrastructure contents: - page: Deployments From f07da06ef37e9e10bab22c98555292104c59a0f7 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 11:02:34 -0700 Subject: [PATCH 03/22] Whoops it's wss not ws --- fern/agents/integrations/web-and-mobile-clients.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-and-mobile-clients.mdx b/fern/agents/integrations/web-and-mobile-clients.mdx index 69067b8..306373e 100644 --- a/fern/agents/integrations/web-and-mobile-clients.mdx +++ b/fern/agents/integrations/web-and-mobile-clients.mdx @@ -14,7 +14,7 @@ graph LR Connect to the WebSocket endpoint: ``` -ws://api.cartesia.ai/agents/stream/{agent_id} +wss://api.cartesia.ai/agents/stream/{agent_id} ``` **Headers:** From 876c2d5d58f5d7a1c0ede4a05aeefb2c98675626 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 15:40:56 -0700 Subject: [PATCH 04/22] Event names, params, and layout changes --- .../integrations/web-and-mobile-clients.mdx | 62 +++++++++---------- 1 file changed, 29 insertions(+), 33 deletions(-) diff --git a/fern/agents/integrations/web-and-mobile-clients.mdx b/fern/agents/integrations/web-and-mobile-clients.mdx index 306373e..312d4c9 100644 --- a/fern/agents/integrations/web-and-mobile-clients.mdx +++ b/fern/agents/integrations/web-and-mobile-clients.mdx @@ -30,7 +30,7 @@ The WebSocket protocol uses JSON messages for control and base64-encoded audio f 3. Bidirectional exchange of events 4. Connection closes using WebSocket close frames -Note that if you don't specify `streamSid` in the initial `start` message, one will be assigned and returned in the `ack` response. +Note that if you don't specify `stream_id` in the initial `start` message, one will be assigned and returned in the `ack` response. ## Input Events (Client → Server) @@ -41,18 +41,18 @@ Initializes the audio stream configuration. **This must be the first message sen ```json { "event": "start", - "streamSid": "example_id", + "stream_id": "example_id", "config": { - "inputFormat": "pcm_44100", - "outputFormat": "pcm_44100" + "input_format": "pcm_44100", + "output_format": "pcm_44100" } } ``` **Fields:** -- `streamSid` (optional): Stream identifier. If not provided, server generates one -- `config.inputFormat`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) -- `config.outputFormat`: Audio format for server audio output (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) +- `stream_id` (optional): Stream identifier. If not provided, server generates one +- `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) +- `config.output_format`: Audio format for server audio output (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) ### Media Event @@ -61,7 +61,7 @@ Sends audio data to the agent. ```json { "event": "media", - "streamSid": "example_id", + "stream_id": "example_id", "media": { "payload": "base64_encoded_audio_data" } @@ -69,7 +69,7 @@ Sends audio data to the agent. ``` **Fields:** -- `streamSid`: Stream identifier from the ack response +- `stream_id`: Stream identifier from the ack response - `media.payload`: Base64-encoded audio data in the format specified in the start event ### DTMF Event @@ -79,25 +79,23 @@ Sends DTMF (dual-tone multi-frequency) tones. ```json { "event": "dtmf", - "streamSid": "example_id", - "dtmf": { - "digit": "1" - } + "stream_id": "example_id", + "dtmf": "1" } ``` **Fields:** -- `streamSid`: Stream identifier -- `dtmf.digit`: DTMF digit (0-9, *, #) +- `stream_id`: Stream identifier +- `dtmf`: DTMF digit (0-9, *, #) -### Metadata Event +### Custom Event Sends custom metadata to the agent. ```json { - "event": "metadata", - "streamSid": "example_id", + "event": "custom", + "stream_id": "example_id", "metadata": { "user_id": "user123", "session_info": "custom_data" @@ -106,7 +104,7 @@ Sends custom metadata to the agent. ``` **Fields:** -- `streamSid`: Stream identifier +- `stream_id`: Stream identifier - `metadata`: Object containing key-value pairs of custom data ## Output Events (Server → Client) @@ -118,10 +116,10 @@ Server acknowledgment of the start event, confirming stream configuration. ```json { "event": "ack", - "streamSid": "example_id", + "stream_id": "example_id", "config": { - "inputFormat": "pcm_44100", - "outputFormat": "pcm_44100" + "input_format": "pcm_44100", + "output_format": "pcm_44100" } } ``` @@ -133,7 +131,7 @@ Server sends agent audio response. ```json { "event": "media", - "streamSid": "example_id", + "stream_id": "example_id", "media": { "payload": "base64_encoded_audio_data" } @@ -147,7 +145,7 @@ Indicates the agent wants to clear/interrupt the current audio stream. ```json { "event": "clear", - "streamSid": "example_id" + "stream_id": "example_id" } ``` @@ -158,21 +156,19 @@ Server sends DTMF tones from the agent. ```json { "event": "dtmf", - "streamSid": "example_id", - "dtmf": { - "digit": "5" - } + "stream_id": "example_id", + "dtmf": "5" } ``` -### Metadata Event +### Custom Event -Server sends metadata from the agent. +Server sends custom metadata from the agent. ```json { - "event": "metadata", - "streamSid": "example_id", + "event": "custom", + "stream_id": "example_id", "metadata": { "agent_state": "processing", "confidence": 0.95, @@ -215,4 +211,4 @@ When the agent ends the call, the server closes the connection with: 2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often MULAW 8k, while for web clients this will often be 44.1k 3. **Handle connection close gracefully** - Monitor close events and reasons for debugging 4. **Implement ping/pong for long connections** - Use WebSocket ping frames to periodically confirm connection health -5. **Monitor streamSid consistency** - Maintain your own streamSid's for the best observability +5. **Monitor stream_id consistency** - Maintain your own stream_id's for the best observability From c0f952b6a2ef05c2a4f22a6ed85d3c245facdee9 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 15:48:01 -0700 Subject: [PATCH 05/22] Updating to web call page, modifying description & diagram --- .../{web-and-mobile-clients.mdx => web-calls.mdx} | 12 ++++++++---- fern/versions/2025-04-16.yml | 4 ++-- 2 files changed, 10 insertions(+), 6 deletions(-) rename fern/agents/integrations/{web-and-mobile-clients.mdx => web-calls.mdx} (84%) diff --git a/fern/agents/integrations/web-and-mobile-clients.mdx b/fern/agents/integrations/web-calls.mdx similarity index 84% rename from fern/agents/integrations/web-and-mobile-clients.mdx rename to fern/agents/integrations/web-calls.mdx index 312d4c9..d8c4fb9 100644 --- a/fern/agents/integrations/web-and-mobile-clients.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -1,11 +1,10 @@ -# Web & Mobile Clients +# Web Calls -Some Cartesia customers might be interested in integrating their Line agents with their web or mobile application, rather than a telephony provider. For these folks, the Agents WebSocket provides real-time, bidirectional communication with voice agents through a streaming interface. This allows you to send audio data and receive agent responses in real-time for any client. +Some Cartesia customers might be interested in integrating their Line agents with their website, rather than a telephony provider. For these folks, the Agents WebSocket provides real-time, bidirectional communication with voice agents through a streaming interface. This allows you to send audio data and receive agent responses in real-time for any client. ```mermaid graph LR A1[Web Client] <-->|General Websocket Events| B[Cartesia API] - A2[Mobile Client] <-->|General Websocket Events| B B <-->|Agent Events| C[Inferno Agents] ``` @@ -36,7 +35,11 @@ Note that if you don't specify `stream_id` in the initial `start` message, one w ### Start Event -Initializes the audio stream configuration. **This must be the first message sent.** +Initializes the audio stream configuration. +- The `config` parameter will optionally alter the input/output audio settings, overriding what your default agent configuration might otherwise be +- The `stream_id` can be set manually if you wish to maintain this on the client end for observability purposes. If not specified, we'll generate one and return it in the `ack` event + +**This must be the first message sent.** ```json { @@ -112,6 +115,7 @@ Sends custom metadata to the agent. ### Ack Event Server acknowledgment of the start event, confirming stream configuration. +If `stream_id` wasn't provided in the initial `start` event, this is where the user can obtain the server generated `stream_id`. ```json { diff --git a/fern/versions/2025-04-16.yml b/fern/versions/2025-04-16.yml index d5da93f..997e433 100644 --- a/fern/versions/2025-04-16.yml +++ b/fern/versions/2025-04-16.yml @@ -200,8 +200,8 @@ navigation: path: ../agents/integrations/telephony/agents-telephony-overview.mdx - page: Outbound Dialing path: ../agents/integrations/telephony/agents-telephony-outbound.mdx - - page: Web & Mobile Clients - path: ../agents/integrations/web-and-mobile-clients.mdx + - page: Web Calls + path: ../agents/integrations/web-calls.mdx icon: fa-solid fa-browser - section: Infrastructure contents: From c1da309060414c76f659182fd2be9f0783fd3d28 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 16:17:21 -0700 Subject: [PATCH 06/22] Updating media event to distinguish between input and output --- fern/agents/integrations/web-calls.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index d8c4fb9..6ab5aa8 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -57,13 +57,13 @@ Initializes the audio stream configuration. - `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) - `config.output_format`: Audio format for server audio output (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) -### Media Event +### Media Input Event -Sends audio data to the agent. +Audio data sent from the client to the server. `payload` audio data should be base64 encoded. ```json { - "event": "media", + "event": "media_input", "stream_id": "example_id", "media": { "payload": "base64_encoded_audio_data" @@ -128,13 +128,13 @@ If `stream_id` wasn't provided in the initial `start` event, this is where the u } ``` -### Media Event +### Media Output Event -Server sends agent audio response. +Server sends agent audio response. `payload` is base 64 encoded audio data. ```json { - "event": "media", + "event": "media_output", "stream_id": "example_id", "media": { "payload": "base64_encoded_audio_data" From fbc69145db880caf2d7f435f77a1e74ecc9c6486 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 17:08:14 -0700 Subject: [PATCH 07/22] Updating with keepalive and timeout information --- fern/agents/integrations/web-calls.mdx | 30 ++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 6ab5aa8..451ff31 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -183,17 +183,38 @@ Server sends custom metadata from the agent. ## Connection Management +### Inactivity Timeout + +The server automatically closes idle WebSocket connections after **30 seconds** of inactivity. Activity is defined as receiving any message from the client, including: + +- Application messages (media_input, dtmf, custom events) +- Standard WebSocket ping frames +- Any other valid WebSocket message + +When the timeout occurs, the connection is closed with: +- **Code:** 1000 (Normal Closure) +- **Reason:** `"connection idle timeout"` + ### Ping/Pong Keepalive -The WebSocket supports standard ping/pong frames for periodic connection healthchecks: +To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive: ```python -# Client sends ping +# Client sends ping to reset inactivity timer pong_waiter = await websocket.ping() latency = await pong_waiter ``` -The server automatically responds to ping frames with pong frames. +```javascript +// JavaScript example +setInterval(() => { + if (websocket.readyState === WebSocket.OPEN) { + websocket.ping(); + } +}, 20000); // Send ping every 20 seconds +``` + +The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message. ### Connection Close @@ -214,5 +235,6 @@ When the agent ends the call, the server closes the connection with: 1. **Always send start event first** - The connection will be closed if any other event is sent before start 2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often MULAW 8k, while for web clients this will often be 44.1k 3. **Handle connection close gracefully** - Monitor close events and reasons for debugging -4. **Implement ping/pong for long connections** - Use WebSocket ping frames to periodically confirm connection health +4. **Implement keepalive for calls with longer periods of silence** - Send WebSocket ping frames every 20-25 seconds to prevent the 30-second inactivity timeout during periods of silence 5. **Monitor stream_id consistency** - Maintain your own stream_id's for the best observability +6. **Prepare for timeout closures** - Handle `"connection idle timeout"` close reasons gracefully in your reconnection logic From 7f11c5a8ce0e771537ee47e4025dc4e6349adf17 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 17:24:53 -0700 Subject: [PATCH 08/22] Removing output format from the config --- fern/agents/integrations/web-calls.mdx | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 451ff31..c86843a 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -36,7 +36,7 @@ Note that if you don't specify `stream_id` in the initial `start` message, one w ### Start Event Initializes the audio stream configuration. -- The `config` parameter will optionally alter the input/output audio settings, overriding what your default agent configuration might otherwise be +- The `config` parameter will optionally alter the input audio settings, overriding what your default agent configuration might otherwise be - The `stream_id` can be set manually if you wish to maintain this on the client end for observability purposes. If not specified, we'll generate one and return it in the `ack` event **This must be the first message sent.** @@ -46,8 +46,7 @@ Initializes the audio stream configuration. "event": "start", "stream_id": "example_id", "config": { - "input_format": "pcm_44100", - "output_format": "pcm_44100" + "input_format": "pcm_44100" } } ``` @@ -55,7 +54,6 @@ Initializes the audio stream configuration. **Fields:** - `stream_id` (optional): Stream identifier. If not provided, server generates one - `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) -- `config.output_format`: Audio format for server audio output (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) ### Media Input Event @@ -122,8 +120,7 @@ If `stream_id` wasn't provided in the initial `start` event, this is where the u "event": "ack", "stream_id": "example_id", "config": { - "input_format": "pcm_44100", - "output_format": "pcm_44100" + "input_format": "pcm_44100" } } ``` From 28168599272380b37a4a6e71370f9f7735ffbf87 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 17:29:24 -0700 Subject: [PATCH 09/22] Updating docs with specific metadata fields --- fern/agents/integrations/web-calls.mdx | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index c86843a..f5f44f0 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -47,6 +47,10 @@ Initializes the audio stream configuration. "stream_id": "example_id", "config": { "input_format": "pcm_44100" + }, + "metadata": { + "to": "user@example.com", + "from": "+1234567890" } } ``` @@ -54,6 +58,9 @@ Initializes the audio stream configuration. **Fields:** - `stream_id` (optional): Stream identifier. If not provided, server generates one - `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`) +- `metadata` (optional): Custom metadata object. These will be passed through to the user code, but there are some special fields you can use as well: + - `to` (optional): Destination identifier for call routing (defaults to agent ID) + - `from` (optional): Source identifier for the call (defaults to "websocket") ### Media Input Event From 9f02d2190a6157aca25ed6ddf3d394eb014bb80a Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 17:47:47 -0700 Subject: [PATCH 10/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index f5f44f0..914c790 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -5,7 +5,7 @@ Some Cartesia customers might be interested in integrating their Line agents wit ```mermaid graph LR A1[Web Client] <-->|General Websocket Events| B[Cartesia API] - B <-->|Agent Events| C[Inferno Agents] + B <-->|Agent Events| C[Cartesia Agents] ``` ## Connection From 9c55dc0bd826f1d64121d6e4d897b89d46cc60f0 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:31:40 -0700 Subject: [PATCH 11/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 914c790..c22ddc0 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -1,6 +1,6 @@ # Web Calls -Some Cartesia customers might be interested in integrating their Line agents with their website, rather than a telephony provider. For these folks, the Agents WebSocket provides real-time, bidirectional communication with voice agents through a streaming interface. This allows you to send audio data and receive agent responses in real-time for any client. +The Agents WebSocket provides real-time, bidirectional communication between web clients and Cartesia voice agents. It enables streaming audio input and real-time agent responses for browser-based or custom applications. ```mermaid graph LR From 9fe84935f8c601ab4f55e9676b422bda40b47b7d Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:31:50 -0700 Subject: [PATCH 12/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index c22ddc0..195dd4d 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -44,7 +44,7 @@ Initializes the audio stream configuration. ```json { "event": "start", - "stream_id": "example_id", + "stream_id": "unique_id", "config": { "input_format": "pcm_44100" }, From 8313ea0d6a104e38b506b07705939d100d5b6d1c Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:31:57 -0700 Subject: [PATCH 13/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 195dd4d..db03476 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -69,7 +69,7 @@ Audio data sent from the client to the server. `payload` audio data should be ba ```json { "event": "media_input", - "stream_id": "example_id", + "stream_id": "unique_id", "media": { "payload": "base64_encoded_audio_data" } From ddbf0e69ce042e0dad7408df4e145a5ab2a80eb4 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:32:07 -0700 Subject: [PATCH 14/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index db03476..8aafbe2 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -77,7 +77,7 @@ Audio data sent from the client to the server. `payload` audio data should be ba ``` **Fields:** -- `stream_id`: Stream identifier from the ack response +- `stream_id`: Unique identifier for the Stream from the ack response - `media.payload`: Base64-encoded audio data in the format specified in the start event ### DTMF Event From f812447b1661238cd54ff47a0a3b69102d4c8752 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:32:45 -0700 Subject: [PATCH 15/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 8aafbe2..b10308c 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -241,4 +241,4 @@ When the agent ends the call, the server closes the connection with: 3. **Handle connection close gracefully** - Monitor close events and reasons for debugging 4. **Implement keepalive for calls with longer periods of silence** - Send WebSocket ping frames every 20-25 seconds to prevent the 30-second inactivity timeout during periods of silence 5. **Monitor stream_id consistency** - Maintain your own stream_id's for the best observability -6. **Prepare for timeout closures** - Handle `"connection idle timeout"` close reasons gracefully in your reconnection logic +6. Always handle timeout closures (`1000 / connection idle timeout`) by reconnecting and resending a `start` event. From d224a4abe37cc1ac9fed8f1b662f88bf5265e42b Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:32:56 -0700 Subject: [PATCH 16/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index b10308c..f35f25f 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -240,5 +240,5 @@ When the agent ends the call, the server closes the connection with: 2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often MULAW 8k, while for web clients this will often be 44.1k 3. **Handle connection close gracefully** - Monitor close events and reasons for debugging 4. **Implement keepalive for calls with longer periods of silence** - Send WebSocket ping frames every 20-25 seconds to prevent the 30-second inactivity timeout during periods of silence -5. **Monitor stream_id consistency** - Maintain your own stream_id's for the best observability +5. Send your own stream_id's for the best observability 6. Always handle timeout closures (`1000 / connection idle timeout`) by reconnecting and resending a `start` event. From 7ec9d4834af539d6714e0d9717ee2503f93e6985 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:33:04 -0700 Subject: [PATCH 17/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index f35f25f..98f44cd 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -237,7 +237,7 @@ When the agent ends the call, the server closes the connection with: ## Best Practices 1. **Always send start event first** - The connection will be closed if any other event is sent before start -2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often MULAW 8k, while for web clients this will often be 44.1k +2. **Use appropriate audio formats** - Match your input format to your audio source capabilities. For telephony providers this is often `mulaw_8000`, while for web clients this will often be `pcm_44000` 3. **Handle connection close gracefully** - Monitor close events and reasons for debugging 4. **Implement keepalive for calls with longer periods of silence** - Send WebSocket ping frames every 20-25 seconds to prevent the 30-second inactivity timeout during periods of silence 5. Send your own stream_id's for the best observability From cba41cd985c70ec4f3fd034f234aeda1c16c38a7 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:33:16 -0700 Subject: [PATCH 18/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 98f44cd..7a1701e 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -22,14 +22,14 @@ wss://api.cartesia.ai/agents/stream/{agent_id} ## Protocol Overview -The WebSocket protocol uses JSON messages for control and base64-encoded audio for media. The connection follows this flow: +The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The lifecycle follows this sequence: -1. Client sends `start` event to initialize the stream -2. Server responds with `ack` event confirming configuration -3. Bidirectional exchange of events -4. Connection closes using WebSocket close frames +1. **Client → Server:** Send a start event to initialize the stream. +1. **Server → Client:** Receive an ack event confirming configuration. +1. **Bidirectional exchange:** Stream audio and control events in real time. +1. **Close:** Either side ends the session with a standard WebSocket close frame. -Note that if you don't specify `stream_id` in the initial `start` message, one will be assigned and returned in the `ack` response. +If the client doesn’t provide a `stream_id` in the initial `start` event, the server generates one and returns it in the `ack` response. ## Input Events (Client → Server) From 6e012d2bffcffe10131904c83f337b82bc685617 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:33:30 -0700 Subject: [PATCH 19/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 7a1701e..d5a42d3 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -31,7 +31,7 @@ The WebSocket connection uses JSON messages for control events and base64-encode If the client doesn’t provide a `stream_id` in the initial `start` event, the server generates one and returns it in the `ack` response. -## Input Events (Client → Server) +## Client events ### Start Event From 958ca4a6651adb442d3b966458ffca2022096b8b Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:33:37 -0700 Subject: [PATCH 20/22] Update fern/agents/integrations/web-calls.mdx Co-authored-by: Sauhard Jain --- fern/agents/integrations/web-calls.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index d5a42d3..a75db24 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -115,7 +115,7 @@ Sends custom metadata to the agent. - `stream_id`: Stream identifier - `metadata`: Object containing key-value pairs of custom data -## Output Events (Server → Client) +## Server events ### Ack Event From 2eba66c4fb2bc10b15d2a8dda5dd084aa5764d16 Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:36:23 -0700 Subject: [PATCH 21/22] Updating with nits --- fern/agents/integrations/web-calls.mdx | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index a75db24..9a55166 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -17,8 +17,11 @@ wss://api.cartesia.ai/agents/stream/{agent_id} ``` **Headers:** -- `Authorization: Bearer {your_api_key}` -- `Cartesia-Version: 2025-04-16` + +| Header | Value | +|--------|-------| +| `Authorization` | `Bearer {your_api_key}` | +| `Cartesia-Version` | `2025-04-16` | ## Protocol Overview @@ -26,7 +29,7 @@ The WebSocket connection uses JSON messages for control events and base64-encode 1. **Client → Server:** Send a start event to initialize the stream. 1. **Server → Client:** Receive an ack event confirming configuration. -1. **Bidirectional exchange:** Stream audio and control events in real time. +1. **Bidirectional exchange:** The client and server exchange streaming audio and control events until either side closes the connection, or the inactivity timeout is fired. 1. **Close:** Either side ends the session with a standard WebSocket close frame. If the client doesn’t provide a `stream_id` in the initial `start` event, the server generates one and returns it in the `ack` response. From 5a617f5d1357805e97dfc8db7ceb01316361beaf Mon Sep 17 00:00:00 2001 From: Timothy Luong Date: Thu, 25 Sep 2025 23:36:56 -0700 Subject: [PATCH 22/22] idk when the numbering changed --- fern/agents/integrations/web-calls.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fern/agents/integrations/web-calls.mdx b/fern/agents/integrations/web-calls.mdx index 9a55166..09a6c64 100644 --- a/fern/agents/integrations/web-calls.mdx +++ b/fern/agents/integrations/web-calls.mdx @@ -28,9 +28,9 @@ wss://api.cartesia.ai/agents/stream/{agent_id} The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The lifecycle follows this sequence: 1. **Client → Server:** Send a start event to initialize the stream. -1. **Server → Client:** Receive an ack event confirming configuration. -1. **Bidirectional exchange:** The client and server exchange streaming audio and control events until either side closes the connection, or the inactivity timeout is fired. -1. **Close:** Either side ends the session with a standard WebSocket close frame. +2. **Server → Client:** Receive an ack event confirming configuration. +3. **Bidirectional exchange:** The client and server exchange streaming audio and control events until either side closes the connection, or the inactivity timeout is fired. +4. **Close:** Either side ends the session with a standard WebSocket close frame. If the client doesn’t provide a `stream_id` in the initial `start` event, the server generates one and returns it in the `ack` response.