Skip to content

Commit adb8a5d

Browse files
authored
Merge pull request #2 from zephyrie/revamp
Revamp UI/UX Experience
2 parents 7ccc628 + 5bc8617 commit adb8a5d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+14991
-717
lines changed

.gitignore

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,53 @@
11
models/*
22
!models/.gitignore
3+
4+
# Node.js
5+
node_modules/
6+
npm-debug.log
7+
yarn-debug.log
8+
yarn-error.log
9+
10+
# Python
11+
__pycache__/
12+
*.py[cod]
13+
*$py.class
14+
*.so
15+
.Python
16+
env/
17+
build/
18+
develop-eggs/
19+
dist/
20+
downloads/
21+
eggs/
22+
.eggs/
23+
lib/
24+
lib64/
25+
parts/
26+
sdist/
27+
var/
28+
wheels/
29+
*.egg-info/
30+
.installed.cfg
31+
*.egg
32+
33+
# Logs
34+
*.log
35+
logs/
36+
log/
37+
38+
# Environments
39+
.env
40+
.venv
41+
env/
42+
venv/
43+
ENV/
44+
45+
# IDE and editors
46+
.idea/
47+
.vscode/
48+
*.swp
49+
*.swo
50+
*~
51+
52+
# Uploaded content
53+
uploaded_videos/

README.md

Lines changed: 197 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,85 +1,193 @@
1-
# Surgical Copilot
1+
# Surgical Agentic Framework Demo
22

3-
Surgical Copilot is a multimodal agentic AI framework tailored for surgical procedures. It supports:
3+
The Surgical Agentic Framework Demo is a multimodal agentic AI framework tailored for surgical procedures. It supports:
44

55
* **Speech-to-Text**: Real-time audio is captured, transcribed by Whisper.
66
* **VLM/LLM-based Conversational Agents**: A *selector agent* decides which specialized agent to invoke:
77
* ChatAgent for general Q&A,
88
* NotetakerAgent to record specific notes,
99
* AnnotationAgent to automatically annotate progress in the background,
1010
* PostOpNoteAgent to summarize all data into a final post-operative note.
11-
* **(Optional) Text-to-Speech**: The system can speak back the AI’s response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
12-
* **Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by Ollama.
11+
* **(Optional) Text-to-Speech**: The system can speak back the AI's response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
12+
* **Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by vLLM.
13+
* **Video Upload and Processing**: Support for uploading and analyzing surgical videos.
14+
* **Post-Operation Note Generation**: Automatic generation of structured post-operative notes based on the procedure data.
1315

1416

1517
## System Flow and Agent Overview
1618

17-
1. Microphone: The user clicks Start Mic in the web UI, or types a question.
18-
2. Whisper ASR: Transcribes speech into text (via whisper_online_server.py).
19+
1. Microphone: The user clicks "Start Mic" in the web UI, or types a question.
20+
2. Whisper ASR: Transcribes speech into text (via servers/whisper_online_server.py).
1921
3. SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
2022
* ChatAgent (general Q&A about the procedure)
2123
* NotetakerAgent (records a note with timestamp + optional image frame)
22-
* In the background, AnnotationAgent is also generating structured annotations every 10 seconds.
24+
* In the background, AnnotationAgent is also generating structured "annotations" every 10 seconds.
2325
4. NotetakerAgent: If chosen, logs the note in a JSON file.
2426
5. AnnotationAgent: Runs automatically, storing procedure annotations in ```procedure_..._annotations.json```.
2527
6. PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.
2628

27-
Installation
29+
## System Requirements
30+
31+
* Python 3.12 or higher
32+
* Node.js 14.x or higher
33+
* CUDA-compatible GPU (recommended) for model inference
34+
* Microphone for voice input (optional)
35+
* 16GB+ RAM recommended
36+
37+
## Installation
2838

2939
1. Clone or Download this repository:
3040

3141
```
32-
git clone https://github.com/project-monai/VLM-Surgical-Agent-Framework
33-
cd VLM-Surgical-Agent-Framework
42+
git clone https://github.com/monai/surgical_agentic_framework.git
43+
cd surgical_agentic_framework
3444
```
3545

36-
2. Install Dependencies:
46+
2. Setup vLLM (Optional)
47+
48+
vLLM is already configured in the project scripts. If you need to set up a custom vLLM server, see https://docs.vllm.ai/en/latest/getting_started/installation.html
49+
50+
3. Install Dependencies:
3751

3852
```
39-
conda create -n surgical_copilot python=3.12
40-
conda activate surgical_copilot
53+
conda create -n surgical_agentic_framework python=3.12
54+
conda activate surgical_agentic_framework
4155
pip install -r requirements.txt
4256
```
4357

44-
3. Models Folder:
58+
4. Install Node.js dependencies (for UI development):
4559

46-
Download models from Huggingface here: TBD
60+
```
61+
npm install
62+
```
4763

48-
* Place your model directory in ```models/```. The folder structure is:
64+
5. Models Folder:
65+
66+
* Place your model files in ```models/llm/``` for LLMs and ```models/whisper/``` for Whisper models.
67+
* This repository is configured to use a Llama-3.2-11B model with surgical fine-tuning.
68+
* The model is served using vLLM for optimal performance.
69+
70+
* Folder structure is:
4971

5072
```
5173
models/
52-
├── Llama-3.2-11B-lora-surgical-4bit/
74+
├── llm/
75+
│ └── Llama-3.2-11B-lora-surgical-4bit/ <-- LLM model files
76+
└── whisper/ <-- Whisper models (downloaded at runtime)
77+
```
78+
79+
6. Setup:
80+
81+
* Edit ```scripts/start_app.sh``` if you need to change ports or model file names.
82+
83+
7. Create necessary directories:
84+
85+
```bash
86+
mkdir -p annotations uploaded_videos
5387
```
5488

55-
4. Video Setup:
89+
## Running the Surgical Agentic Framework Demo
5690

57-
* Use the UI to select a surgical video sample to use.
91+
### Production Mode
5892

59-
5. Setup:
93+
1. Run the full stack with all services:
6094

61-
* Edit ```start_app.sh``` if you need to change ports or model file names.
95+
```
96+
npm start
97+
```
6298

63-
## Running Surgical Copilot
99+
Or using the script directly:
64100

65-
1. Run the script:
101+
```
102+
./scripts/start_app.sh
103+
```
104+
105+
What it does:
106+
107+
* Builds the CSS with Tailwind
108+
* Starts vLLM server with the model on port 8000
109+
* Waits 45 seconds for the model to load
110+
* Starts Whisper (servers/whisper_online_server.py) on port 43001 (for ASR)
111+
* Waits 5 seconds
112+
* Launches ```python servers/app.py``` (the main Flask + WebSockets application)
113+
* Waits for all processes to complete
114+
115+
### Development Mode
116+
117+
For UI development with hot-reloading CSS changes:
66118

67119
```
68-
./start_app.sh
120+
npm run dev:web
69121
```
70122

71-
2. **Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Copilot interface:
123+
This starts:
124+
* The CSS watch process for automatic Tailwind compilation
125+
* The web server only (no LLM or Whisper)
126+
127+
For full stack development:
128+
129+
```
130+
npm run dev:full
131+
```
132+
133+
This is the same as production mode but also watches for CSS changes.
134+
135+
You can also use the development script for faster startup during development:
136+
137+
```
138+
./scripts/dev.sh
139+
```
140+
141+
2. **Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Agentic Framework Demo interface:
72142
* A video sample (```sample_video.mp4```)
73143
* Chat console
74144
* A "Start Mic" button to begin ASR.
75145

76146
3. Try speaking or Typing:
77-
* If you say Take a note: The gallbladder is severely inflamed, the system routes you to NotetakerAgent.
78-
* If you say What are the next steps after dissecting the cystic duct? it routes you to ChatAgent.
147+
* If you say "Take a note: The gallbladder is severely inflamed," the system routes you to NotetakerAgent.
148+
* If you say "What are the next steps after dissecting the cystic duct?" it routes you to ChatAgent.
79149

80150
4. Background Annotations:
81151
* Meanwhile, ```AnnotationAgent``` writes a file like: ```procedure_2025_01_18__10_25_03_annotations.json``` in the annotations folder very 10 seconds with structured timeline data.
82152

153+
## Uploading and Processing Videos
154+
155+
1. Click on the "Upload Video" button to add your own surgical videos
156+
2. Browse the video library by clicking "Video Library"
157+
3. Select a video to analyze
158+
4. Use the chat interface to ask questions about the video or create annotations
159+
160+
## Generating Post-Operation Notes
161+
162+
After accumulating annotations and notes during a procedure:
163+
164+
1. Click the "Generate Post-Op Note" button
165+
2. The system will analyze all annotations and notes
166+
3. A structured post-operation note will be generated with:
167+
* Procedure information
168+
* Key findings
169+
* Procedure timeline
170+
* Complications
171+
172+
## Troubleshooting
173+
174+
Common issues and solutions:
175+
176+
1. **WebSocket Connection Errors**:
177+
* Check firewall settings to ensure ports 49000 and 49001 are open
178+
* Ensure no other applications are using these ports
179+
* If you experience frequent timeouts, adjust the WebSocket configuration in `servers/web_server.py`
180+
181+
2. **Model Loading Errors**:
182+
* Verify model paths are correct in configuration files
183+
* Ensure you have sufficient GPU memory for the models
184+
* Check the log files for specific error messages
185+
186+
3. **Audio Transcription Issues**:
187+
* Verify your microphone is working correctly
188+
* Check that the Whisper server is running
189+
* Adjust microphone settings in your browser
190+
83191
## Text-to-Speech (Optional)
84192

85193
If you want to enable TTS with ElevenLabs (or implement your own local TTS server):
@@ -91,42 +199,77 @@ If you want to enable TTS with ElevenLabs (or implement your own local TTS serve
91199
A brief overview:
92200

93201
```
94-
surgical_copilot/
95-
├── agents
202+
surgical_agentic_framework/
203+
├── agents/ <-- Agent implementations
96204
│ ├── annotation_agent.py
97205
│ ├── base_agent.py
98206
│ ├── chat_agent.py
99207
│ ├── notetaker_agent.py
100208
│ ├── post_op_note_agent.py
101209
│ └── selector_agent.py
102-
├── app.py
103-
├── configs
210+
├── configs/ <-- Configuration files
104211
│ ├── annotation_agent.yaml
105212
│ ├── chat_agent.yaml
106213
│ ├── notetaker_agent.yaml
107214
│ ├── post_op_note_agent.yaml
108215
│ └── selector.yaml
109-
├── models
110-
│ ├── mmproj-model-f16.gguf
111-
│ └── surgical_copilot_Q_6.gguf
112-
├── README.md <-- this file
113-
├── requirements.txt
114-
├── start_app.sh <-- main script to launch everything
115-
├── whisper <-- directory for whisper servers
116-
│ ├── whisper_online_server.py
117-
│ └── jfk.flac
118-
└── web
119-
├── static
120-
│ ├── audio.js
121-
│ ├── bootstrap.bundle.min.js
122-
│ ├── bootstrap.css
123-
│ ├── chat.css
124-
│ ├── favicon.ico
125-
│ ├── jquery-3.6.3.min.js
126-
│ ├── nvidia-logo.png
127-
│ ├── sample_video.mp4
128-
│ └── websocket.js
129-
├── templates
130-
│ └── index.html
131-
└── webserver.py
216+
├── models/ <-- Model files
217+
│ ├── llm/ <-- LLM model files
218+
│ │ └── Llama-3.2-11B-lora-surgical-4bit/
219+
│ └── whisper/ <-- Whisper models (downloaded at runtime)
220+
├── scripts/ <-- Shell scripts for starting services
221+
│ ├── dev.sh <-- Development script for quick startup
222+
│ ├── run_vllm_server.sh
223+
│ ├── start_app.sh <-- Main script to launch everything
224+
│ └── start_web_dev.sh <-- Web UI development script
225+
├── servers/ <-- Server implementations
226+
│ ├── app.py <-- Main application server
227+
│ ├── uploaded_videos/ <-- Storage for uploaded videos
228+
│ ├── web_server.py <-- Web interface server
229+
│ └── whisper_online_server.py <-- Whisper ASR server
230+
├── utils/ <-- Utility classes and functions
231+
│ ├── chat_history.py
232+
│ ├── logging_utils.py
233+
│ └── response_handler.py
234+
├── web/ <-- Web interface assets
235+
│ ├── src/ <-- Vue.js components
236+
│ │ ├── App.vue
237+
│ │ ├── components/
238+
│ │ │ ├── Annotation.vue
239+
│ │ │ ├── ChatMessage.vue
240+
│ │ │ ├── Note.vue
241+
│ │ │ ├── PostOpNote.vue
242+
│ │ │ └── VideoCard.vue
243+
│ │ └── main.js
244+
│ ├── static/ <-- CSS, JS, and other static assets
245+
│ │ ├── audio.js
246+
│ │ ├── bootstrap.bundle.min.js
247+
│ │ ├── bootstrap.css
248+
│ │ ├── chat.css
249+
│ │ ├── jquery-3.6.3.min.js
250+
│ │ ├── main.js
251+
│ │ ├── nvidia-logo.png
252+
│ │ ├── styles.css
253+
│ │ ├── tailwind-custom.css
254+
│ │ └── websocket.js
255+
│ └── templates/
256+
│ └── index.html
257+
├── annotations/ <-- Stored procedure annotations
258+
├── uploaded_videos/ <-- Uploaded video storage
259+
├── CLAUDE.md <-- Guidelines for Claude or other AI assistants
260+
├── README.md <-- This file
261+
├── package.json <-- Node.js dependencies and scripts
262+
├── postcss.config.js <-- PostCSS configuration for Tailwind
263+
├── tailwind.config.js <-- Tailwind CSS configuration
264+
├── vite.config.js <-- Vite build configuration
265+
└── requirements.txt <-- Python dependencies
132266
```
267+
268+
## Recent Updates
269+
270+
* Added robust WebSocket connection handling with automatic reconnection
271+
* Improved video upload and management functionality
272+
* Enhanced post-operation note generation interface
273+
* Modernized UI with Vue.js components
274+
* Added tailored CSS styling with Tailwind
275+
* Implemented development scripts for faster iteration

0 commit comments

Comments
 (0)