Skip to content

Commit 9df1f4e

Browse files
committed
first commit
0 parents  commit 9df1f4e

File tree

1 file changed

+132
-0
lines changed

1 file changed

+132
-0
lines changed

README.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Surgical Copilot
2+
3+
Surgical Copilot is a multimodal agentic AI framework tailored for surgical procedures. It supports:
4+
5+
* **Speech-to-Text**: Real-time audio is captured, transcribed by Whisper.
6+
* **VLM/LLM-based Conversational Agents**: A *selector agent* decides which specialized agent to invoke:
7+
* ChatAgent for general Q&A,
8+
* NotetakerAgent to record specific notes,
9+
* AnnotationAgent to automatically annotate progress in the background,
10+
* PostOpNoteAgent to summarize all data into a final post-operative note.
11+
* **(Optional) Text-to-Speech**: The system can speak back the AI’s response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
12+
* **Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by Ollama.
13+
14+
15+
## System Flow and Agent Overview
16+
17+
1. Microphone: The user clicks “Start Mic” in the web UI, or types a question.
18+
2. Whisper ASR: Transcribes speech into text (via whisper_online_server.py).
19+
3. SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
20+
* ChatAgent (general Q&A about the procedure)
21+
* NotetakerAgent (records a note with timestamp + optional image frame)
22+
* In the background, AnnotationAgent is also generating structured “annotations” every 10 seconds.
23+
4. NotetakerAgent: If chosen, logs the note in a JSON file.
24+
5. AnnotationAgent: Runs automatically, storing procedure annotations in ```procedure_..._annotations.json```.
25+
6. PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.
26+
27+
Installation
28+
29+
1. Clone or Download this repository:
30+
31+
```
32+
git clone https://github.com/project-monai/VLM-Surgical-Agent-Framework
33+
cd VLM-Surgical-Agent-Framework
34+
```
35+
36+
2. Install Dependencies:
37+
38+
```
39+
conda create -n surgical_copilot python=3.12
40+
conda activate surgical_copilot
41+
pip install -r requirements.txt
42+
```
43+
44+
3. Models Folder:
45+
46+
Download models from Huggingface here: TBD
47+
48+
* Place your model directory in ```models/```. The folder structure is:
49+
50+
```
51+
models/
52+
├── Llama-3.2-11B-lora-surgical-4bit/
53+
```
54+
55+
4. Video Setup:
56+
57+
* Use the UI to select a surgical video sample to use.
58+
59+
5. Setup:
60+
61+
* Edit ```start_app.sh``` if you need to change ports or model file names.
62+
63+
## Running Surgical Copilot
64+
65+
1. Run the script:
66+
67+
```
68+
./start_app.sh
69+
```
70+
71+
2. **Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Copilot interface:
72+
* A video sample (```sample_video.mp4```)
73+
* Chat console
74+
* A "Start Mic" button to begin ASR.
75+
76+
3. Try speaking or Typing:
77+
* If you say “Take a note: The gallbladder is severely inflamed,” the system routes you to NotetakerAgent.
78+
* If you say “What are the next steps after dissecting the cystic duct?” it routes you to ChatAgent.
79+
80+
4. Background Annotations:
81+
* Meanwhile, ```AnnotationAgent``` writes a file like: ```procedure_2025_01_18__10_25_03_annotations.json``` in the annotations folder very 10 seconds with structured timeline data.
82+
83+
## Text-to-Speech (Optional)
84+
85+
If you want to enable TTS with ElevenLabs (or implement your own local TTS server):
86+
* Follow the instructions in the index.html or your code snippet that calls a TTS route or API.
87+
* Provide your TTS API key if needed.
88+
89+
## File Structure
90+
91+
A brief overview:
92+
93+
```
94+
surgical_copilot/
95+
├── agents
96+
│ ├── annotation_agent.py
97+
│ ├── base_agent.py
98+
│ ├── chat_agent.py
99+
│ ├── notetaker_agent.py
100+
│ ├── post_op_note_agent.py
101+
│ └── selector_agent.py
102+
├── app.py
103+
├── configs
104+
│ ├── annotation_agent.yaml
105+
│ ├── chat_agent.yaml
106+
│ ├── notetaker_agent.yaml
107+
│ ├── post_op_note_agent.yaml
108+
│ └── selector.yaml
109+
├── models
110+
│ ├── mmproj-model-f16.gguf
111+
│ └── surgical_copilot_Q_6.gguf
112+
├── README.md <-- this file
113+
├── requirements.txt
114+
├── start_app.sh <-- main script to launch everything
115+
├── whisper <-- directory for whisper servers
116+
│ ├── whisper_online_server.py
117+
│ └── jfk.flac
118+
└── web
119+
├── static
120+
│ ├── audio.js
121+
│ ├── bootstrap.bundle.min.js
122+
│ ├── bootstrap.css
123+
│ ├── chat.css
124+
│ ├── favicon.ico
125+
│ ├── jquery-3.6.3.min.js
126+
│ ├── nvidia-logo.png
127+
│ ├── sample_video.mp4
128+
│ └── websocket.js
129+
├── templates
130+
│ └── index.html
131+
└── webserver.py
132+
```

0 commit comments

Comments
 (0)