Skip to content

Commit 7900354

Browse files
committed
feat: TTS 完全实现
1 parent ef2b7bd commit 7900354

File tree

13 files changed

+923
-7
lines changed

13 files changed

+923
-7
lines changed

README.md

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
## 🚀 特性
2424

2525
- **OpenAI 兼容 API**: 完全兼容 OpenAI 格式的 `/v1/chat/completions` 端点
26+
- **TTS 语音生成**: 支持 Gemini 2.5 TTS 模型的单/多说话人音频生成
2627
- **智能模型切换**: 通过 `model` 字段动态切换 AI Studio 中的模型
2728
- **反指纹检测**: 使用 Camoufox 浏览器降低被检测风险
2829
- **图形界面启动器**: 功能丰富的 **网页** 启动器,简化配置和管理
@@ -191,6 +192,59 @@ curl -X POST http://localhost:2048/v1/chat/completions \
191192
- **模型名称**: `gemini-2.5-pro` (或其他 AI Studio 支持的模型)
192193
- **API 密钥**: 留空或输入任意字符,如`123`
193194

195+
### TTS 语音生成
196+
197+
支持 Gemini 2.5 Flash/Pro TTS 模型进行单说话人或多说话人音频生成:
198+
199+
#### 单说话人示例
200+
201+
```bash
202+
curl -X POST http://localhost:2048/generate-speech \
203+
-H "Content-Type: application/json" \
204+
-d '{
205+
"model": "gemini-2.5-flash-preview-tts",
206+
"contents": "Hello, this is a test.",
207+
"generationConfig": {
208+
"responseModalities": ["AUDIO"],
209+
"speechConfig": {
210+
"voiceConfig": {
211+
"prebuiltVoiceConfig": {"voiceName": "Kore"}
212+
}
213+
}
214+
}
215+
}'
216+
```
217+
218+
#### 多说话人示例
219+
220+
```bash
221+
curl -X POST http://localhost:2048/generate-speech \
222+
-H "Content-Type: application/json" \
223+
-d '{
224+
"model": "gemini-2.5-flash-preview-tts",
225+
"contents": "Joe: How are you?\nJane: I am fine, thanks!",
226+
"generationConfig": {
227+
"responseModalities": ["AUDIO"],
228+
"speechConfig": {
229+
"multiSpeakerVoiceConfig": {
230+
"speakerVoiceConfigs": [
231+
{"speaker": "Joe", "voiceConfig": {"prebuiltVoiceConfig": {"voiceName": "Kore"}}},
232+
{"speaker": "Jane", "voiceConfig": {"prebuiltVoiceConfig": {"voiceName": "Puck"}}}
233+
]
234+
}
235+
}
236+
}
237+
}'
238+
```
239+
240+
**可用语音**: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus 等 30 种。
241+
242+
**端点**:
243+
- `POST /generate-speech`
244+
- `POST /v1beta/models/{model}:generateContent` (兼容官方 API)
245+
246+
**返回格式**: 音频数据以 Base64 编码的 WAV 格式在 `candidates[0].content.parts[0].inlineData.data` 中返回。
247+
194248
### Ollama 兼容层
195249

196250
项目还提供 Ollama 格式的 API 兼容:
@@ -219,6 +273,7 @@ AIStudio2API/
219273
│ ├── browser/ # 浏览器自动化模块
220274
│ ├── config/ # 配置管理
221275
│ ├── models/ # 数据模型
276+
│ ├── tts/ # TTS 语音生成模块
222277
│ ├── proxy/ # 流式代理
223278
│ └── static/ # 静态资源
224279
├── data/ # 运行时数据目录
@@ -296,7 +351,7 @@ cp .env.example .env
296351

297352
## 📅 开发计划
298353

299-
- **TTS 支持**: 适配 `gemini-2.5-flash/pro-preview-tts` 语音生成模型
354+
- **TTS 支持**: 已适配 `gemini-2.5-flash/pro-preview-tts` 语音生成模型
300355
- **文档完善**: 更新并优化 `docs/` 目录下的详细使用文档与 API 规范
301356
- **一键部署**: 提供 Windows/Linux/macOS 的全自动化安装与启动脚本
302357
- **Docker 支持**: 提供标准 Dockerfile 及 Docker Compose 编排文件,简化部署流程

README_en.md

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
## 🚀 Features
2424

2525
- **OpenAI Compatible API**: Fully compatible with OpenAI format `/v1/chat/completions` endpoint
26+
- **TTS Speech Generation**: Supports Gemini 2.5 TTS models for single/multi-speaker audio generation
2627
- **Smart Model Switching**: Dynamically switch models in AI Studio via the `model` field
2728
- **Anti-Fingerprint Detection**: Uses Camoufox browser to reduce detection risk
2829
- **GUI Launcher**: Feature-rich **web** launcher for simplified configuration and management
@@ -185,6 +186,59 @@ Using Cherry Studio as an example:
185186
- **Model Name**: `gemini-2.5-pro` (or other AI Studio supported models)
186187
- **API Key**: Leave empty or enter any character like `123`
187188

189+
### TTS Speech Generation
190+
191+
Supports Gemini 2.5 Flash/Pro TTS models for single-speaker or multi-speaker audio generation:
192+
193+
#### Single-Speaker Example
194+
195+
```bash
196+
curl -X POST http://localhost:2048/generate-speech \
197+
-H "Content-Type: application/json" \
198+
-d '{
199+
"model": "gemini-2.5-flash-preview-tts",
200+
"contents": "Hello, this is a test.",
201+
"generationConfig": {
202+
"responseModalities": ["AUDIO"],
203+
"speechConfig": {
204+
"voiceConfig": {
205+
"prebuiltVoiceConfig": {"voiceName": "Kore"}
206+
}
207+
}
208+
}
209+
}'
210+
```
211+
212+
#### Multi-Speaker Example
213+
214+
```bash
215+
curl -X POST http://localhost:2048/generate-speech \
216+
-H "Content-Type: application/json" \
217+
-d '{
218+
"model": "gemini-2.5-flash-preview-tts",
219+
"contents": "Joe: How are you?\nJane: I am fine, thanks!",
220+
"generationConfig": {
221+
"responseModalities": ["AUDIO"],
222+
"speechConfig": {
223+
"multiSpeakerVoiceConfig": {
224+
"speakerVoiceConfigs": [
225+
{"speaker": "Joe", "voiceConfig": {"prebuiltVoiceConfig": {"voiceName": "Kore"}}},
226+
{"speaker": "Jane", "voiceConfig": {"prebuiltVoiceConfig": {"voiceName": "Puck"}}}
227+
]
228+
}
229+
}
230+
}
231+
}'
232+
```
233+
234+
**Available Voices**: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, and 18 more voices.
235+
236+
**Endpoints**:
237+
- `POST /generate-speech`
238+
- `POST /v1beta/models/{model}:generateContent` (compatible with official API)
239+
240+
**Response Format**: Audio data is returned as Base64-encoded WAV format in `candidates[0].content.parts[0].inlineData.data`.
241+
188242
### Ollama Compatibility Layer
189243

190244
The project also provides Ollama format API compatibility:
@@ -213,6 +267,7 @@ AIStudio2API/
213267
│ ├── browser/ # Browser automation modules
214268
│ ├── config/ # Configuration management
215269
│ ├── models/ # Data models
270+
│ ├── tts/ # TTS Speech Generation modules
216271
│ ├── proxy/ # Streaming proxy
217272
│ └── static/ # Static resources
218273
├── data/ # Runtime data directory
@@ -290,7 +345,7 @@ Issues and Pull Requests are welcome!
290345

291346
## 📅 Development Roadmap
292347

293-
- **TTS Support**: Adapt `gemini-2.5-flash/pro-preview-tts` speech generation models
348+
- **TTS Support**: Adapted `gemini-2.5-flash/pro-preview-tts` speech generation models
294349
- **Documentation**: Update and optimize documentation in `docs/` directory
295350
- **One-Click Deployment**: Provide fully automated install and launch scripts for Windows/Linux/macOS
296351
- **Docker Support**: Provide standard Dockerfile and Docker Compose orchestration files

data/excluded_models.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@ veo-2.0-generate-001
22
imagen-4.0-fast-generate-001
33
imagen-4.0-ultra-generate-001
44
imagen-4.0-generate-001
5-
gemini-2.5-flash-preview-tts
6-
gemini-2.5-pro-preview-tts
75
gemini-2.5-flash-native-audio-preview-09-2025
86
gemini-2.5-flash-image
97
gemini-3-pro-image-preview

docs/api-usage.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,40 @@ else:
225225
print(f"Error: {response.status_code}\n{response.text}")
226226
```
227227

228+
### TTS 语音生成
229+
230+
**端点**:
231+
- `POST /generate-speech`
232+
- `POST /v1beta/models/{model}:generateContent`
233+
234+
支持 Gemini 2.5 TTS 模型进行单说话人或多说话人音频生成。
235+
236+
**支持的模型**:
237+
- `gemini-2.5-flash-preview-tts`
238+
- `gemini-2.5-pro-preview-tts`
239+
240+
**请求示例**:
241+
```bash
242+
curl -X POST http://localhost:2048/generate-speech \
243+
-H "Content-Type: application/json" \
244+
-d '{
245+
"model": "gemini-2.5-flash-preview-tts",
246+
"contents": "Hello, this is a test.",
247+
"generationConfig": {
248+
"responseModalities": ["AUDIO"],
249+
"speechConfig": {
250+
"voiceConfig": {
251+
"prebuiltVoiceConfig": {"voiceName": "Kore"}
252+
}
253+
}
254+
}
255+
}'
256+
```
257+
258+
**响应格式**: 音频数据以 Base64 编码的 WAV 格式在 `candidates[0].content.parts[0].inlineData.data` 中返回。
259+
260+
**详细文档**: 参见 [TTS 使用指南](tts-guide.md)
261+
228262
### Ollama 兼容层
229263

230264
项目还提供 Ollama 格式的 API 兼容:
@@ -393,5 +427,6 @@ curl -X POST http://localhost:11434/api/chat \
393427
## 下一步
394428

395429
API 使用配置完成后,请参考:
430+
- [TTS 语音生成指南](tts-guide.md)
396431
- [故障排除指南](troubleshooting.md)
397432
- [日志控制指南](logging-control.md)

docs/development-guide.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ AIStudio2API/
5656
│ ├── models/ # 数据模型
5757
│ │ ├── types.py # 聊天/异常模型
5858
│ │ └── websocket.py # WebSocket日志模型
59+
│ ├── tts/ # TTS 语音生成模块
60+
│ │ ├── __init__.py # 模块初始化
61+
│ │ ├── models.py # TTS 数据模型
62+
│ │ ├── tts_controller.py # TTS 页面控制器
63+
│ │ └── tts_processor.py # TTS 请求处理器
5964
│ ├── proxy/ # 流式代理服务
6065
│ │ ├── runner.py # 代理服务入口
6166
│ │ ├── server.py # 代理服务器

0 commit comments

Comments
 (0)