|
| 1 | +# Surgical Copilot |
| 2 | + |
| 3 | +Surgical Copilot is a multimodal agentic AI framework tailored for surgical procedures. It supports: |
| 4 | + |
| 5 | +* **Speech-to-Text**: Real-time audio is captured, transcribed by Whisper. |
| 6 | +* **VLM/LLM-based Conversational Agents**: A *selector agent* decides which specialized agent to invoke: |
| 7 | + * ChatAgent for general Q&A, |
| 8 | + * NotetakerAgent to record specific notes, |
| 9 | + * AnnotationAgent to automatically annotate progress in the background, |
| 10 | + * PostOpNoteAgent to summarize all data into a final post-operative note. |
| 11 | +* **(Optional) Text-to-Speech**: The system can speak back the AI’s response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well). |
| 12 | +* **Computer Vision** or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by Ollama. |
| 13 | + |
| 14 | + |
| 15 | +## System Flow and Agent Overview |
| 16 | + |
| 17 | +1. Microphone: The user clicks “Start Mic” in the web UI, or types a question. |
| 18 | +2. Whisper ASR: Transcribes speech into text (via whisper_online_server.py). |
| 19 | +3. SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to: |
| 20 | + * ChatAgent (general Q&A about the procedure) |
| 21 | + * NotetakerAgent (records a note with timestamp + optional image frame) |
| 22 | + * In the background, AnnotationAgent is also generating structured “annotations” every 10 seconds. |
| 23 | +4. NotetakerAgent: If chosen, logs the note in a JSON file. |
| 24 | +5. AnnotationAgent: Runs automatically, storing procedure annotations in ```procedure_..._annotations.json```. |
| 25 | +6. PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note. |
| 26 | + |
| 27 | +Installation |
| 28 | + |
| 29 | +1. Clone or Download this repository: |
| 30 | + |
| 31 | +``` |
| 32 | +git clone https://github.com/project-monai/VLM-Surgical-Agent-Framework |
| 33 | +cd VLM-Surgical-Agent-Framework |
| 34 | +``` |
| 35 | + |
| 36 | +2. Install Dependencies: |
| 37 | + |
| 38 | +``` |
| 39 | +conda create -n surgical_copilot python=3.12 |
| 40 | +conda activate surgical_copilot |
| 41 | +pip install -r requirements.txt |
| 42 | +``` |
| 43 | + |
| 44 | +3. Models Folder: |
| 45 | + |
| 46 | +Download models from Huggingface here: TBD |
| 47 | + |
| 48 | +* Place your model directory in ```models/```. The folder structure is: |
| 49 | + |
| 50 | +``` |
| 51 | +models/ |
| 52 | + ├── Llama-3.2-11B-lora-surgical-4bit/ |
| 53 | +``` |
| 54 | + |
| 55 | +4. Video Setup: |
| 56 | + |
| 57 | +* Use the UI to select a surgical video sample to use. |
| 58 | + |
| 59 | +5. Setup: |
| 60 | + |
| 61 | +* Edit ```start_app.sh``` if you need to change ports or model file names. |
| 62 | + |
| 63 | +## Running Surgical Copilot |
| 64 | + |
| 65 | +1. Run the script: |
| 66 | + |
| 67 | +``` |
| 68 | +./start_app.sh |
| 69 | +``` |
| 70 | + |
| 71 | +2. **Open** your browser at ```http://127.0.0.1:8050```. You should see the Surgical Copilot interface: |
| 72 | + * A video sample (```sample_video.mp4```) |
| 73 | + * Chat console |
| 74 | + * A "Start Mic" button to begin ASR. |
| 75 | + |
| 76 | +3. Try speaking or Typing: |
| 77 | + * If you say “Take a note: The gallbladder is severely inflamed,” the system routes you to NotetakerAgent. |
| 78 | + * If you say “What are the next steps after dissecting the cystic duct?” it routes you to ChatAgent. |
| 79 | + |
| 80 | +4. Background Annotations: |
| 81 | + * Meanwhile, ```AnnotationAgent``` writes a file like: ```procedure_2025_01_18__10_25_03_annotations.json``` in the annotations folder very 10 seconds with structured timeline data. |
| 82 | + |
| 83 | +## Text-to-Speech (Optional) |
| 84 | + |
| 85 | +If you want to enable TTS with ElevenLabs (or implement your own local TTS server): |
| 86 | + * Follow the instructions in the index.html or your code snippet that calls a TTS route or API. |
| 87 | + * Provide your TTS API key if needed. |
| 88 | + |
| 89 | +## File Structure |
| 90 | + |
| 91 | +A brief overview: |
| 92 | + |
| 93 | +``` |
| 94 | +surgical_copilot/ |
| 95 | +├── agents |
| 96 | +│ ├── annotation_agent.py |
| 97 | +│ ├── base_agent.py |
| 98 | +│ ├── chat_agent.py |
| 99 | +│ ├── notetaker_agent.py |
| 100 | +│ ├── post_op_note_agent.py |
| 101 | +│ └── selector_agent.py |
| 102 | +├── app.py |
| 103 | +├── configs |
| 104 | +│ ├── annotation_agent.yaml |
| 105 | +│ ├── chat_agent.yaml |
| 106 | +│ ├── notetaker_agent.yaml |
| 107 | +│ ├── post_op_note_agent.yaml |
| 108 | +│ └── selector.yaml |
| 109 | +├── models |
| 110 | +│ ├── mmproj-model-f16.gguf |
| 111 | +│ └── surgical_copilot_Q_6.gguf |
| 112 | +├── README.md <-- this file |
| 113 | +├── requirements.txt |
| 114 | +├── start_app.sh <-- main script to launch everything |
| 115 | +├── whisper <-- directory for whisper servers |
| 116 | +│ ├── whisper_online_server.py |
| 117 | +│ └── jfk.flac |
| 118 | +└── web |
| 119 | + ├── static |
| 120 | + │ ├── audio.js |
| 121 | + │ ├── bootstrap.bundle.min.js |
| 122 | + │ ├── bootstrap.css |
| 123 | + │ ├── chat.css |
| 124 | + │ ├── favicon.ico |
| 125 | + │ ├── jquery-3.6.3.min.js |
| 126 | + │ ├── nvidia-logo.png |
| 127 | + │ ├── sample_video.mp4 |
| 128 | + │ └── websocket.js |
| 129 | + ├── templates |
| 130 | + │ └── index.html |
| 131 | + └── webserver.py |
| 132 | +``` |
0 commit comments