Skip to content

Commit 959105b

Browse files
authored
Merge pull request #103 from CatchTheTornado/feat/102-minicpm
[feat] minicpm-v support
2 parents 31b1a07 + 48cc738 commit 959105b

File tree

4 files changed

+101
-3
lines changed

4 files changed

+101
-3
lines changed

README.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
88

99
## Features:
1010
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11-
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR), [minicpm-v](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#minicpm-v-26)
1212
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
1414
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -162,6 +162,32 @@ python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --pr
162162
163163
In case of any questions, help requests or just feedback - please [join us on Discord](https://discord.gg/NJzu47Ye3a)!
164164
165+
166+
## Text extract stratgies
167+
168+
### `easyocr`
169+
170+
Easy OCR is avaialble on Apache based license. It's general purpose OCR with support for more than 30 langues, probably with the best performance for English.
171+
172+
Enabled by default. Please do use the `strategy=easyocr` CLI and URL parameters to use it.
173+
174+
175+
### `minicpm-v`
176+
177+
MiniCPM-V is Apache based licenseed OCR strategy.
178+
179+
The usage of MiniCPM-o/V model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
180+
181+
The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
182+
183+
Enabled by default. Please do use the `strategy=minicpm_v` CLI and URL parameters to use it.
184+
185+
### `llama_vision`
186+
187+
LLama 3.2 Vision Strategy is licensed on [Meta Community License Agreement](https://ollama.com/library/llama3.2-vision/blobs/0b4284c1f870). Works great for many languages, although due to the number of parameters (90b) this model is probably **the slowest** one.
188+
189+
Enabled by default. Please do use the `strategy=llama_vision` CLI and URL parameters to use it. It's by the way the default strategy
190+
165191
## Getting started with Docker
166192
167193
### Prerequisites
@@ -410,7 +436,7 @@ apiClient.uploadFile(formData).then(response => {
410436
- **Method**: POST
411437
- **Parameters**:
412438
- **file**: PDF, image or Office file to be processed.
413-
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
439+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v` or `easyocr`).
414440
- **ocr_cache**: Whether to cache the OCR result (true or false).
415441
- **prompt**: When provided, will be used for Ollama processing the OCR result
416442
- **model**: When provided along with the prompt - this model will be used for LLM processing
@@ -429,7 +455,7 @@ curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-m
429455
- **Method**: POST
430456
- **Parameters** (JSON body):
431457
- **file**: Base64 encoded PDF file content.
432-
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
458+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v` or `easyocr`).
433459
- **ocr_cache**: Whether to cache the OCR result (true or false).
434460
- **prompt**: When provided, will be used for Ollama processing the OCR result.
435461
- **model**: When provided along with the prompt - this model will be used for LLM processing.

config/strategies.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
strategies:
22
llama_vision:
33
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
4+
minicpm_v:
5+
class: text_extract_api.extract.strategies.minicpm_v.MiniCPMVStrategy
46
easyocr:
57
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy

scripts/entrypoint.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ else
2424
echo "Pulling LLM models, please wait until this process is done..."
2525
python client/cli.py llm_pull --model llama3.1
2626
python client/cli.py llm_pull --model llama3.2-vision
27+
python client/cli.py llm_pull --model minicpm-v
2728
echo "LLM models are ready!"
2829

2930
echo "Starting FastAPI app..."
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
import os
2+
import tempfile
3+
import time
4+
5+
import ollama
6+
7+
from text_extract_api.extract.strategies.strategy import Strategy
8+
from text_extract_api.files.file_formats.file_format import FileFormat
9+
from text_extract_api.files.file_formats.image import ImageFileFormat
10+
11+
12+
class MiniCPMVStrategy(Strategy):
13+
"""MiniCPM-V OCR Strategy"""
14+
15+
@classmethod
16+
def name(cls) -> str:
17+
return "minicpm_v"
18+
19+
def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
20+
21+
if (
22+
not isinstance(file_format, ImageFileFormat)
23+
and not file_format.can_convert_to(ImageFileFormat)
24+
):
25+
raise TypeError(
26+
f"MiniCPM-V - format {file_format.mime_type} is not supported (yet?)"
27+
)
28+
29+
images = FileFormat.convert_to(file_format, ImageFileFormat)
30+
extracted_text = ""
31+
start_time = time.time()
32+
ocr_percent_done = 0
33+
num_pages = len(images)
34+
for i, image in enumerate(images):
35+
36+
with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as temp_file:
37+
temp_file.write(image.binary)
38+
temp_filename = temp_file.name
39+
40+
# Generate text using the Llama 3.2 Vision model
41+
try:
42+
response = ollama.chat("minicpm-v", [{
43+
'role': 'user',
44+
'content': os.getenv('MINICPMV_PROMPT', "You are OCR. Convert image to markdown."),
45+
'images': [temp_filename]
46+
}], stream=True)
47+
os.remove(temp_filename)
48+
num_chunk = 1
49+
for chunk in response:
50+
meta = {
51+
'progress': str(30 + ocr_percent_done),
52+
'status': 'OCR Processing'
53+
+ '(page ' + str(i + 1) + ' of ' + str(num_pages) + ')'
54+
+ ' chunk no: ' + str(num_chunk),
55+
'start_time': start_time,
56+
'elapsed_time': time.time() - start_time}
57+
self.update_state_callback(state='PROGRESS', meta=meta)
58+
num_chunk += 1
59+
extracted_text += chunk['message']['content']
60+
61+
ocr_percent_done += int(
62+
20 / num_pages) # 20% of work is for OCR - just a stupid assumption from tasks.py
63+
except ollama.ResponseError as e:
64+
print('Error:', e.error)
65+
raise Exception("Failed to generate text with MiniCPM-V model")
66+
67+
print(response)
68+
69+
return extracted_text

0 commit comments

Comments
 (0)