Skip to content

Commit 61d1907

Browse files
Improve Realtime (#134)
1 parent ffd2fd2 commit 61d1907

22 files changed

+2996
-265
lines changed

.coveragerc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ omit =
66
*/site-packages/*
77
setup.py
88
solana_agent/cli.py
9+
solana_agent/interfaces/providers/realtime.py # exclude interface-only module
910

1011
[report]
1112
exclude_lines =

README.md

Lines changed: 114 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ Smart workflows are as easy as combining your tools and prompts.
6262
* Simple agent definition using JSON
6363
* Designed for a multi-agent swarm
6464
* Fast multi-modal processing of text, audio, and images
65+
* Dual modality realtime streaming with simultaneous audio and text output
6566
* Smart workflows that keep flows simple and smart
6667
* Interact with the Solana blockchain with many useful tools
6768
* MCP tool usage with first-class support for [Zapier](https://zapier.com/mcp)
@@ -96,7 +97,7 @@ Smart workflows are as easy as combining your tools and prompts.
9697
**OpenAI**
9798
* [gpt-4.1](https://platform.openai.com/docs/models/gpt-4.1) (agent & router)
9899
* [text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large) (embedding)
99-
* [gpt-realtime](https://platform.openai.com/docs/models/gpt-realtime) (realtime audio agent)
100+
* [gpt-realtime](https://platform.openai.com/docs/models/gpt-realtime) (realtime audio agent with dual modality support)
100101
* [tts-1](https://platform.openai.com/docs/models/tts-1) (audio TTS)
101102
* [gpt-4o-mini-transcribe](https://platform.openai.com/docs/models/gpt-4o-mini-transcribe) (audio transcription)
102103

@@ -245,6 +246,7 @@ async for response in solana_agent.process("user123", "What is the latest news o
245246
### Audio/Text Streaming
246247

247248
```python
249+
## Realtime Usage
248250
from solana_agent import SolanaAgent
249251

250252
config = {
@@ -275,28 +277,32 @@ async for response in solana_agent.process("user123", audio_content, audio_input
275277

276278
### Realtime Audio Streaming
277279

278-
If input and/or output is encoded (compressed) like mp4/aac then you must have `ffmpeg` installed.
280+
If input and/or output is encoded (compressed) like mp4/mp3 then you must have `ffmpeg` installed.
279281

280282
Due to the overhead of the router (API call) - realtime only supports a single agent setup.
281283

282284
Realtime uses MongoDB for memory so Zep is not needed.
283285

286+
By default, when `realtime=True` and you supply raw/encoded audio bytes as input, the system **always skips the HTTP transcription (STT) path** and relies solely on the realtime websocket session for input transcription. If you don't specify `rt_transcription_model`, a sensible default (`gpt-4o-mini-transcribe`) is auto-selected so you still receive input transcript events with minimal latency.
287+
288+
Implications:
289+
- `llm_provider.transcribe_audio` is never invoked for realtime turns.
290+
- Lower end-to-end latency (no duplicate network round trip for STT).
291+
- Unified transcript sourcing from realtime events.
292+
- If you explicitly want to disable transcription altogether, send text (not audio bytes) or ignore transcript events client-side.
293+
284294
This example will work using expo-audio on Android and iOS.
285295

286296
```python
287297
from solana_agent import SolanaAgent
288298

289299
solana_agent = SolanaAgent(config=config)
290-
291-
audio_content = await audio_file.read()
292-
293-
async def generate():
294-
async for chunk in solana_agent.process(
295-
user_id=user_id,
300+
user_id="user123",
296301
message=audio_content,
297302
realtime=True,
298303
rt_encode_input=True,
299304
rt_encode_output=True,
305+
rt_output_modalities=["audio"],
300306
rt_voice="marin",
301307
output_format="audio",
302308
audio_output_format="mp3",
@@ -314,6 +320,106 @@ return StreamingResponse(
314320
"X-Accel-Buffering": "no",
315321
},
316322
)
323+
```
324+
325+
### Realtime Text Streaming
326+
327+
Due to the overhead of the router (API call) - realtime only supports a single agent setup.
328+
329+
Realtime uses MongoDB for memory so Zep is not needed.
330+
331+
When using realtime with text input, no audio transcription is needed. The same bypass rules apply—HTTP STT is never called in realtime mode.
332+
333+
```python
334+
from solana_agent import SolanaAgent
335+
336+
solana_agent = SolanaAgent(config=config)
337+
338+
async def generate():
339+
async for chunk in solana_agent.process(
340+
user_id="user123",
341+
message="What is the latest news on Solana?",
342+
realtime=True,
343+
rt_output_modalities=["text"],
344+
):
345+
yield chunk
346+
```
347+
348+
### Dual Modality Realtime Streaming
349+
350+
Solana Agent supports **dual modality realtime streaming**, allowing you to stream both audio and text simultaneously from a single realtime session. This enables rich conversational experiences where users can receive both voice responses and text transcripts in real-time.
351+
352+
#### Features
353+
- **Simultaneous Audio & Text**: Stream both modalities from the same conversation
354+
- **Flexible Output**: Choose audio-only, text-only, or both modalities
355+
- **Real-time Demuxing**: Automatically separate audio and text streams
356+
- **Mobile Optimized**: Works seamlessly with compressed audio formats (MP4/AAC)
357+
- **Memory Efficient**: Smart buffering and streaming for optimal performance
358+
359+
#### Mobile App Integration Example
360+
361+
```python
362+
from fastapi import UploadFile
363+
from fastapi.responses import StreamingResponse
364+
from solana_agent import SolanaAgent
365+
from solana_agent.interfaces.providers.realtime import RealtimeChunk
366+
import base64
367+
368+
solana_agent = SolanaAgent(config=config)
369+
370+
@app.post("/realtime/dual")
371+
async def realtime_dual_endpoint(audio_file: UploadFile):
372+
"""
373+
Dual modality (audio + text) realtime endpoint using Server-Sent Events (SSE).
374+
Emits:
375+
event: audio (base64 encoded audio frames)
376+
event: transcript (incremental text)
377+
Notes:
378+
- Do NOT set output_format when using both modalities.
379+
- If only one modality is requested, plain str (text) or raw audio bytes may be yielded instead of RealtimeChunk.
380+
"""
381+
audio_content = await audio_file.read()
382+
383+
async def event_stream():
384+
async for chunk in solana_agent.process(
385+
user_id="mobile_user",
386+
message=audio_content,
387+
realtime=True,
388+
rt_encode_input=True,
389+
rt_encode_output=True,
390+
rt_output_modalities=["audio", "text"],
391+
rt_voice="marin",
392+
audio_input_format="mp4",
393+
audio_output_format="mp3",
394+
# Optionally lock transcription model (otherwise default is auto-selected):
395+
# rt_transcription_model="gpt-4o-mini-transcribe",
396+
):
397+
if isinstance(chunk, RealtimeChunk):
398+
if chunk.is_audio and chunk.audio_data:
399+
b64 = base64.b64encode(chunk.audio_data).decode("ascii")
400+
yield f"event: audio\ndata: {b64}\n\n"
401+
elif chunk.is_text and chunk.text_data:
402+
# Incremental transcript (not duplicated at finalize)
403+
yield f"event: transcript\ndata: {chunk.text_data}\n\n"
404+
continue
405+
# (Defensive) fallback: if something else appears
406+
if isinstance(chunk, bytes):
407+
b64 = base64.b64encode(chunk).decode("ascii")
408+
yield f"event: audio\ndata: {b64}\n\n"
409+
elif isinstance(chunk, str):
410+
yield f"event: transcript\ndata: {chunk}\n\n"
411+
412+
yield "event: done\ndata: end\n\n"
413+
414+
return StreamingResponse(
415+
event_stream(),
416+
media_type="text/event-stream",
417+
headers={
418+
"Cache-Control": "no-store",
419+
"Access-Control-Allow-Origin": "*",
420+
},
421+
)
422+
```
317423

318424
### Image/Text Streaming
319425

docs/index.rst

Lines changed: 104 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,9 +223,10 @@ This example will work using expo-audio on Android and iOS.
223223
rt_encode_input=True,
224224
rt_encode_output=True,
225225
rt_voice="marin",
226+
rt_output_modalities=["audio"],
226227
output_format="audio",
227-
audio_output_format="mp3",
228228
audio_input_format="m4a",
229+
audio_output_format="mp3",
229230
):
230231
yield chunk
231232
@@ -240,6 +241,108 @@ This example will work using expo-audio on Android and iOS.
240241
},
241242
)
242243
244+
Realtime Text Streaming
245+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246+
247+
Due to the overhead of the router (API call) - realtime only supports a single agent setup.
248+
249+
Realtime uses MongoDB for memory so Zep is not needed.
250+
251+
.. code-block:: python
252+
253+
from solana_agent import SolanaAgent
254+
255+
solana_agent = SolanaAgent(config=config)
256+
257+
async def generate():
258+
async for chunk in solana_agent.process(
259+
user_id="user123",
260+
message="What is the latest news on Solana?",
261+
realtime=True,
262+
rt_output_modalities=["text"],
263+
):
264+
yield chunk
265+
266+
Dual Modality Realtime Streaming
267+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
268+
269+
Solana Agent now supports **dual modality realtime streaming**, allowing you to stream both audio and text simultaneously from a single realtime session. This enables rich conversational experiences where users can receive both voice responses and text transcripts in real-time.
270+
271+
Features
272+
^^^^^^^^
273+
274+
- **Simultaneous Audio & Text**: Stream both modalities from the same conversation
275+
- **Flexible Output**: Choose audio-only, text-only, or both modalities
276+
- **Real-time Demuxing**: Automatically separate audio and text streams
277+
- **Mobile Optimized**: Works seamlessly with compressed audio formats (MP4/MP3)
278+
- **Memory Efficient**: Smart buffering and streaming for optimal performance
279+
280+
Mobile App Integration Example
281+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
282+
283+
.. code-block:: python
284+
285+
from fastapi import UploadFile
286+
from fastapi.responses import StreamingResponse
287+
from solana_agent import SolanaAgent
288+
from solana_agent.interfaces.providers.realtime import RealtimeChunk
289+
import base64
290+
291+
solana_agent = SolanaAgent(config=config)
292+
293+
@app.post("/realtime/dual")
294+
async def realtime_dual_endpoint(audio_file: UploadFile):
295+
"""
296+
Dual modality (audio + text) realtime endpoint using Server-Sent Events (SSE).
297+
Emits:
298+
event: audio (base64 encoded audio frames)
299+
event: transcript (incremental text)
300+
Notes:
301+
- Do NOT set output_format when using both modalities.
302+
- If only one modality is requested, plain str (text) or raw audio bytes may be yielded instead of RealtimeChunk.
303+
"""
304+
audio_content = await audio_file.read()
305+
306+
async def event_stream():
307+
async for chunk in solana_agent.process(
308+
user_id="mobile_user",
309+
message=audio_content,
310+
realtime=True,
311+
rt_encode_input=True,
312+
rt_encode_output=True,
313+
rt_output_modalities=["audio", "text"],
314+
rt_voice="marin",
315+
audio_input_format="mp4",
316+
audio_output_format="mp3",
317+
# Optionally lock transcription model (otherwise default is auto-selected):
318+
# rt_transcription_model="gpt-4o-mini-transcribe",
319+
):
320+
if isinstance(chunk, RealtimeChunk):
321+
if chunk.is_audio and chunk.audio_data:
322+
b64 = base64.b64encode(chunk.audio_data).decode("ascii")
323+
yield f"event: audio\ndata: {b64}\n\n"
324+
elif chunk.is_text and chunk.text_data:
325+
# Incremental transcript (not duplicated at finalize)
326+
yield f"event: transcript\ndata: {chunk.text_data}\n\n"
327+
continue
328+
# (Defensive) fallback: if something else appears
329+
if isinstance(chunk, bytes):
330+
b64 = base64.b64encode(chunk).decode("ascii")
331+
yield f"event: audio\ndata: {b64}\n\n"
332+
elif isinstance(chunk, str):
333+
yield f"event: transcript\ndata: {chunk}\n\n"
334+
335+
yield "event: done\ndata: end\n\n"
336+
337+
return StreamingResponse(
338+
event_stream(),
339+
media_type="text/event-stream",
340+
headers={
341+
"Cache-Control": "no-store",
342+
"Access-Control-Allow-Origin": "*",
343+
},
344+
)
345+
243346
Image/Text Streaming
244347
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
245348

poetry.lock

Lines changed: 6 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "solana-agent"
3-
version = "31.2.6"
3+
version = "31.3.0"
44
description = "AI Agents for Solana"
55
authors = ["Bevan Hunt <bevan@bevanhunt.com>"]
66
license = "MIT"

0 commit comments

Comments
 (0)