You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surgical Copilot is a multimodal agentic AI framework tailored for surgical procedures. It supports:
3
+
The Surgical Agentic Framework Demo is a multimodal agentic AI framework tailored for surgical procedures. It supports:
4
4
5
5
***Speech-to-Text**: Real-time audio is captured, transcribed by Whisper.
6
6
***VLM/LLM-based Conversational Agents**: A *selector agent* decides which specialized agent to invoke:
7
7
* ChatAgent for general Q&A,
8
8
* NotetakerAgent to record specific notes,
9
9
* AnnotationAgent to automatically annotate progress in the background,
10
10
* PostOpNoteAgent to summarize all data into a final post-operative note.
11
-
***(Optional) Text-to-Speech**: The system can speak back the AI’s response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
12
-
***Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by Ollama.
11
+
***(Optional) Text-to-Speech**: The system can speak back the AI's response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
12
+
***Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by vLLM.
13
+
***Video Upload and Processing**: Support for uploading and analyzing surgical videos.
14
+
***Post-Operation Note Generation**: Automatic generation of structured post-operative notes based on the procedure data.
13
15
14
16
15
17
## System Flow and Agent Overview
16
18
17
-
1. Microphone: The user clicks “Start Mic” in the web UI, or types a question.
18
-
2. Whisper ASR: Transcribes speech into text (via whisper_online_server.py).
19
+
1. Microphone: The user clicks "Start Mic" in the web UI, or types a question.
20
+
2. Whisper ASR: Transcribes speech into text (via servers/whisper_online_server.py).
19
21
3. SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
20
22
* ChatAgent (general Q&A about the procedure)
21
23
* NotetakerAgent (records a note with timestamp + optional image frame)
22
-
* In the background, AnnotationAgent is also generating structured “annotations” every 10 seconds.
24
+
* In the background, AnnotationAgent is also generating structured "annotations" every 10 seconds.
23
25
4. NotetakerAgent: If chosen, logs the note in a JSON file.
24
26
5. AnnotationAgent: Runs automatically, storing procedure annotations in ```procedure_..._annotations.json```.
25
27
6. PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.
26
28
27
-
Installation
29
+
## System Requirements
30
+
31
+
* Python 3.12 or higher
32
+
* Node.js 14.x or higher
33
+
* CUDA-compatible GPU (recommended) for model inference
vLLM is already configured in the project scripts. If you need to set up a custom vLLM server, see https://docs.vllm.ai/en/latest/getting_started/installation.html
* Place your model directory in ```models/```. The folder structure is:
64
+
5. Models Folder:
65
+
66
+
* Place your model files in ```models/llm/``` for LLMs and ```models/whisper/``` for Whisper models.
67
+
* This repository is configured to use a Llama-3.2-11B model with surgical fine-tuning.
68
+
* The model is served using vLLM for optimal performance.
69
+
70
+
* Folder structure is:
49
71
50
72
```
51
73
models/
52
-
├── Llama-3.2-11B-lora-surgical-4bit/
74
+
├── llm/
75
+
│ └── Llama-3.2-11B-lora-surgical-4bit/ <-- LLM model files
76
+
└── whisper/ <-- Whisper models (downloaded at runtime)
77
+
```
78
+
79
+
6. Setup:
80
+
81
+
* Edit ```scripts/start_app.sh``` if you need to change ports or model file names.
82
+
83
+
7. Create necessary directories:
84
+
85
+
```bash
86
+
mkdir -p annotations uploaded_videos
53
87
```
54
88
55
-
4. Video Setup:
89
+
## Running the Surgical Agentic Framework Demo
56
90
57
-
* Use the UI to select a surgical video sample to use.
91
+
### Production Mode
58
92
59
-
5. Setup:
93
+
1. Run the full stack with all services:
60
94
61
-
* Edit ```start_app.sh``` if you need to change ports or model file names.
95
+
```
96
+
npm start
97
+
```
62
98
63
-
## Running Surgical Copilot
99
+
Or using the script directly:
64
100
65
-
1. Run the script:
101
+
```
102
+
./scripts/start_app.sh
103
+
```
104
+
105
+
What it does:
106
+
107
+
* Builds the CSS with Tailwind
108
+
* Starts vLLM server with the model on port 8000
109
+
* Waits 45 seconds for the model to load
110
+
* Starts Whisper (servers/whisper_online_server.py) on port 43001 (for ASR)
111
+
* Waits 5 seconds
112
+
* Launches ```python servers/app.py``` (the main Flask + WebSockets application)
113
+
* Waits for all processes to complete
114
+
115
+
### Development Mode
116
+
117
+
For UI development with hot-reloading CSS changes:
66
118
67
119
```
68
-
./start_app.sh
120
+
npm run dev:web
69
121
```
70
122
71
-
2.**Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Copilot interface:
123
+
This starts:
124
+
* The CSS watch process for automatic Tailwind compilation
125
+
* The web server only (no LLM or Whisper)
126
+
127
+
For full stack development:
128
+
129
+
```
130
+
npm run dev:full
131
+
```
132
+
133
+
This is the same as production mode but also watches for CSS changes.
134
+
135
+
You can also use the development script for faster startup during development:
136
+
137
+
```
138
+
./scripts/dev.sh
139
+
```
140
+
141
+
2.**Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Agentic Framework Demo interface:
72
142
* A video sample (```sample_video.mp4```)
73
143
* Chat console
74
144
* A "Start Mic" button to begin ASR.
75
145
76
146
3. Try speaking or Typing:
77
-
* If you say “Take a note: The gallbladder is severely inflamed,” the system routes you to NotetakerAgent.
78
-
* If you say “What are the next steps after dissecting the cystic duct?” it routes you to ChatAgent.
147
+
* If you say "Take a note: The gallbladder is severely inflamed," the system routes you to NotetakerAgent.
148
+
* If you say "What are the next steps after dissecting the cystic duct?" it routes you to ChatAgent.
79
149
80
150
4. Background Annotations:
81
151
* Meanwhile, ```AnnotationAgent``` writes a file like: ```procedure_2025_01_18__10_25_03_annotations.json``` in the annotations folder very 10 seconds with structured timeline data.
82
152
153
+
## Uploading and Processing Videos
154
+
155
+
1. Click on the "Upload Video" button to add your own surgical videos
156
+
2. Browse the video library by clicking "Video Library"
157
+
3. Select a video to analyze
158
+
4. Use the chat interface to ask questions about the video or create annotations
159
+
160
+
## Generating Post-Operation Notes
161
+
162
+
After accumulating annotations and notes during a procedure:
163
+
164
+
1. Click the "Generate Post-Op Note" button
165
+
2. The system will analyze all annotations and notes
166
+
3. A structured post-operation note will be generated with:
167
+
* Procedure information
168
+
* Key findings
169
+
* Procedure timeline
170
+
* Complications
171
+
172
+
## Troubleshooting
173
+
174
+
Common issues and solutions:
175
+
176
+
1.**WebSocket Connection Errors**:
177
+
* Check firewall settings to ensure ports 49000 and 49001 are open
178
+
* Ensure no other applications are using these ports
179
+
* If you experience frequent timeouts, adjust the WebSocket configuration in `servers/web_server.py`
180
+
181
+
2.**Model Loading Errors**:
182
+
* Verify model paths are correct in configuration files
183
+
* Ensure you have sufficient GPU memory for the models
184
+
* Check the log files for specific error messages
185
+
186
+
3.**Audio Transcription Issues**:
187
+
* Verify your microphone is working correctly
188
+
* Check that the Whisper server is running
189
+
* Adjust microphone settings in your browser
190
+
83
191
## Text-to-Speech (Optional)
84
192
85
193
If you want to enable TTS with ElevenLabs (or implement your own local TTS server):
@@ -91,42 +199,77 @@ If you want to enable TTS with ElevenLabs (or implement your own local TTS serve
91
199
A brief overview:
92
200
93
201
```
94
-
surgical_copilot/
95
-
├── agents
202
+
surgical_agentic_framework/
203
+
├── agents/ <-- Agent implementations
96
204
│ ├── annotation_agent.py
97
205
│ ├── base_agent.py
98
206
│ ├── chat_agent.py
99
207
│ ├── notetaker_agent.py
100
208
│ ├── post_op_note_agent.py
101
209
│ └── selector_agent.py
102
-
├── app.py
103
-
├── configs
210
+
├── configs/ <-- Configuration files
104
211
│ ├── annotation_agent.yaml
105
212
│ ├── chat_agent.yaml
106
213
│ ├── notetaker_agent.yaml
107
214
│ ├── post_op_note_agent.yaml
108
215
│ └── selector.yaml
109
-
├── models
110
-
│ ├── mmproj-model-f16.gguf
111
-
│ └── surgical_copilot_Q_6.gguf
112
-
├── README.md <-- this file
113
-
├── requirements.txt
114
-
├── start_app.sh <-- main script to launch everything
115
-
├── whisper <-- directory for whisper servers
116
-
│ ├── whisper_online_server.py
117
-
│ └── jfk.flac
118
-
└── web
119
-
├── static
120
-
│ ├── audio.js
121
-
│ ├── bootstrap.bundle.min.js
122
-
│ ├── bootstrap.css
123
-
│ ├── chat.css
124
-
│ ├── favicon.ico
125
-
│ ├── jquery-3.6.3.min.js
126
-
│ ├── nvidia-logo.png
127
-
│ ├── sample_video.mp4
128
-
│ └── websocket.js
129
-
├── templates
130
-
│ └── index.html
131
-
└── webserver.py
216
+
├── models/ <-- Model files
217
+
│ ├── llm/ <-- LLM model files
218
+
│ │ └── Llama-3.2-11B-lora-surgical-4bit/
219
+
│ └── whisper/ <-- Whisper models (downloaded at runtime)
220
+
├── scripts/ <-- Shell scripts for starting services
221
+
│ ├── dev.sh <-- Development script for quick startup
222
+
│ ├── run_vllm_server.sh
223
+
│ ├── start_app.sh <-- Main script to launch everything
224
+
│ └── start_web_dev.sh <-- Web UI development script
225
+
├── servers/ <-- Server implementations
226
+
│ ├── app.py <-- Main application server
227
+
│ ├── uploaded_videos/ <-- Storage for uploaded videos
228
+
│ ├── web_server.py <-- Web interface server
229
+
│ └── whisper_online_server.py <-- Whisper ASR server
230
+
├── utils/ <-- Utility classes and functions
231
+
│ ├── chat_history.py
232
+
│ ├── logging_utils.py
233
+
│ └── response_handler.py
234
+
├── web/ <-- Web interface assets
235
+
│ ├── src/ <-- Vue.js components
236
+
│ │ ├── App.vue
237
+
│ │ ├── components/
238
+
│ │ │ ├── Annotation.vue
239
+
│ │ │ ├── ChatMessage.vue
240
+
│ │ │ ├── Note.vue
241
+
│ │ │ ├── PostOpNote.vue
242
+
│ │ │ └── VideoCard.vue
243
+
│ │ └── main.js
244
+
│ ├── static/ <-- CSS, JS, and other static assets
245
+
│ │ ├── audio.js
246
+
│ │ ├── bootstrap.bundle.min.js
247
+
│ │ ├── bootstrap.css
248
+
│ │ ├── chat.css
249
+
│ │ ├── jquery-3.6.3.min.js
250
+
│ │ ├── main.js
251
+
│ │ ├── nvidia-logo.png
252
+
│ │ ├── styles.css
253
+
│ │ ├── tailwind-custom.css
254
+
│ │ └── websocket.js
255
+
│ └── templates/
256
+
│ └── index.html
257
+
├── annotations/ <-- Stored procedure annotations
258
+
├── uploaded_videos/ <-- Uploaded video storage
259
+
├── CLAUDE.md <-- Guidelines for Claude or other AI assistants
260
+
├── README.md <-- This file
261
+
├── package.json <-- Node.js dependencies and scripts
262
+
├── postcss.config.js <-- PostCSS configuration for Tailwind
0 commit comments