Skip to content

Commit 3bd883e

Browse files
authored
Merge branch 'main' into pr110
2 parents 3b87d76 + 536802e commit 3bd883e

File tree

12 files changed

+186
-92
lines changed

12 files changed

+186
-92
lines changed

.env.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
REDIS_CACHE_URL=redis://redis:6379/1
33
OLLAMA_HOST=http://ollama:11434
44
STORAGE_PROFILE_PATH=./storage_profiles
5-
LLAMA_VISION_PROMPT="You are OCR. Convert image to markdown."
5+
REMOTE_API_URL=
66

77
# CLI settings
88
OCR_URL=http://localhost:8000/ocr/upload
@@ -15,3 +15,4 @@ LOAD_FILE_URL=http://localhost:8000/storage/load
1515
DELETE_FILE_URL=http://localhost:8000/storage/delete
1616
OCR_REQUEST_URL=http://localhost:8000/ocr/request
1717
OCR_UPLOAD_URL=http://localhost:8000/ocr/upload
18+

.env.localhost.example

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#APP_ENV=production # sets the app into prod mode, othervise dev mode with auto-reload on code changes
22
REDIS_CACHE_URL=redis://localhost:6379/1
3-
LLAMA_VISION_PROMPT="You are OCR. Convert image to markdown."
43
DISABLE_LOCAL_OLLAMA=0
4+
REMOTE_API_URL=
55

66
# CLI settings
77
OCR_URL=http://localhost:8000/ocr/upload

README.md

Lines changed: 47 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
77
![hero doc extract](ocr-hero.webp)
88

99
## Features:
10-
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose`. No data is sent outside your dev/server environment.
11-
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR), [minicpm-v](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#minicpm-v-26)
10+
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR), [minicpm-v](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#minicpm-v-26), remote URL strategies including [marker-pdf](https://github.com/VikParuchuri/marker)
1212
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
1414
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -196,6 +196,49 @@ LLama 3.2 Vision Strategy is licensed on [Meta Community License Agreement](http
196196
197197
Enabled by default. Please do use the `strategy=llama_vision` CLI and URL parameters to use it. It's by the way the default strategy
198198
199+
200+
### `remote`
201+
202+
Some OCR's - like [Marker, state of the art PDF OCR](https://github.com/VikParuchuri/marker) - works really great for more than 50 languages, including great accuracy for Polish and other languages - let's say that are "diffult" to read for standard OCR.
203+
204+
The `marker-pdf` is however licensed on GPL3 license and **therefore it's not included** by default in this application (as we're bound to MIT).
205+
206+
The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.
207+
208+
To have it up and running you can execute the following steps:
209+
210+
```bash
211+
mkdir marker-distribution # this should be outside of the `text-extract-api` folder!
212+
cd marker-distribution
213+
pip install marker-pdf
214+
pip install -U uvicorn fastapi python-multipart
215+
marker_server --port 8002
216+
```
217+
218+
Set the Remote API Url:
219+
220+
**Note: *** you might run `marker_server` on different port or server - then just make sure you export a proper env setting beffore starting off `text-extract-api` server:
221+
222+
```bash
223+
export REMOTE_API_URL=http://localhost:8002/marker/upload
224+
```
225+
226+
**Note: *** the URL might be also set via `/config/strategies.yaml` file
227+
228+
Run the `text-extract-api`:
229+
230+
```bash
231+
make run
232+
```
233+
234+
Please do use the `strategy=remote` CLI and URL parameters to use it. For example:
235+
236+
```bash
237+
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=remote" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
238+
```
239+
240+
We are connecting to remote OCR via it's API to not share the same license (GPL3) by having it all linked on the source code level.
241+
199242
## Getting started with Docker
200243
201244
### Prerequisites
@@ -444,7 +487,7 @@ apiClient.uploadFile(formData).then(response => {
444487
- **Method**: POST
445488
- **Parameters**:
446489
- **file**: PDF, image or Office file to be processed.
447-
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v` or `easyocr`).
490+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v`, `remote` or `easyocr`). See the [available strategies](#text-extract-stratgies)
448491
- **ocr_cache**: Whether to cache the OCR result (true or false).
449492
- **prompt**: When provided, will be used for Ollama processing the OCR result
450493
- **model**: When provided along with the prompt - this model will be used for LLM processing
@@ -463,7 +506,7 @@ curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-m
463506
- **Method**: POST
464507
- **Parameters** (JSON body):
465508
- **file**: Base64 encoded PDF file content.
466-
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v` or `easyocr`).
509+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v`, `remote` or `easyocr`). See the [available strategies](#text-extract-stratgies)
467510
- **ocr_cache**: Whether to cache the OCR result (true or false).
468511
- **prompt**: When provided, will be used for Ollama processing the OCR result.
469512
- **model**: When provided along with the prompt - this model will be used for LLM processing.

config/strategies.yaml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
strategies:
22
llama_vision:
3-
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
3+
class: text_extract_api.extract.strategies.ollama.OllamaStrategy
4+
model: llama3.2-vision
5+
prompt: You are OCR. Convert image to markdown. Return only the markdown with no explanation text. Do not exclude any content from the page.
46
minicpm_v:
5-
class: text_extract_api.extract.strategies.minicpm_v.MiniCPMVStrategy
7+
class: text_extract_api.extract.strategies.ollama.OllamaStrategy
8+
model: minicpm-v
9+
prompt: You are OCR. Convert image to markdown. Return only the markdown with no explanation text. Do not exclude any content from the page.
610
easyocr:
711
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy
12+
remote:
13+
class: text_extract_api.extract.strategies.remote.RemoteStrategy
14+
url:

docker-compose.gpu.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ services:
1818
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
1919
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
2020
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
21-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
21+
- REMOTE_API_URL=${REMOTE_API_URL}
2222
depends_on:
2323
- redis
2424
- ollama
@@ -44,7 +44,6 @@ services:
4444
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
4545
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
4646
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
47-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
4847
depends_on:
4948
- redis
5049
- fastapi_app

docker-compose.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ services:
1818
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
1919
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
2020
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
21-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
21+
- REMOTE_API_URL=${REMOTE_API_URL}
2222
depends_on:
2323
- redis
2424
- ollama
@@ -39,7 +39,6 @@ services:
3939
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
4040
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
4141
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
42-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
4342
depends_on:
4443
- redis
4544
- fastapi_app

text_extract_api/extract/strategies/minicpm_v.py

Lines changed: 0 additions & 70 deletions
This file was deleted.

text_extract_api/extract/strategies/llama_vision.py renamed to text_extract_api/extract/strategies/ollama.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
from text_extract_api.files.file_formats.image import ImageFileFormat
1111

1212

13-
class LlamaVisionStrategy(Strategy):
14-
"""Llama 3.2 Vision OCR Strategy"""
13+
class OllamaStrategy(Strategy):
14+
"""Ollama models OCR strategy"""
1515

1616
@classmethod
1717
def name(cls) -> str:
@@ -24,7 +24,7 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> Extract
2424
and not file_format.can_convert_to(ImageFileFormat)
2525
):
2626
raise TypeError(
27-
f"Llama Vision - format {file_format.mime_type} is not supported (yet?)"
27+
f"Ollama OCR - format {file_format.mime_type} is not supported (yet?)"
2828
)
2929

3030
images = FileFormat.convert_to(file_format, ImageFileFormat)
@@ -38,11 +38,12 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> Extract
3838
temp_file.write(image.binary)
3939
temp_filename = temp_file.name
4040

41-
# Generate text using the Llama 3.2 Vision model
41+
print(self._strategy_config)
42+
# Generate text using the specified model
4243
try:
43-
response = ollama.chat("llama3.2-vision", [{
44+
response = ollama.chat(self._strategy_config.get('model'), [{
4445
'role': 'user',
45-
'content': os.getenv('LLAMA_VISION_PROMPT', "You are OCR. Convert image to markdown."),
46+
'content': self._strategy_config.get('prompt'),
4647
'images': [temp_filename]
4748
}], stream=True)
4849
os.remove(temp_filename)
@@ -63,7 +64,7 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> Extract
6364
20 / num_pages) # 20% of work is for OCR - just a stupid assumption from tasks.py
6465
except ollama.ResponseError as e:
6566
print('Error:', e.error)
66-
raise Exception("Failed to generate text with Llama 3.2 Vision model")
67+
raise Exception("Failed to generate text with Ollama model " + self._strategy_config.get('model'))
6768

6869
print(response)
6970

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
import os
2+
import tempfile
3+
import time
4+
5+
from extract.extract_result import ExtractResult
6+
7+
from text_extract_api.extract.strategies.strategy import Strategy
8+
from text_extract_api.files.file_formats.file_format import FileFormat
9+
from text_extract_api.files.file_formats.image import ImageFileFormat
10+
from text_extract_api.files.file_formats.pdf import PdfFileFormat
11+
import requests
12+
13+
14+
class RemoteStrategy(Strategy):
15+
"""Remote API Strategy"""
16+
17+
@classmethod
18+
def name(cls) -> str:
19+
return "remote"
20+
21+
def extract_text(self, file_format: FileFormat, language: str = 'en') -> ExtractResult:
22+
23+
if (
24+
not isinstance(file_format, PdfFileFormat)
25+
and not file_format.can_convert_to(PdfFileFormat)
26+
):
27+
raise TypeError(
28+
f"Marker PDF - format {file_format.mime_type} is not supported (yet?)"
29+
)
30+
31+
pdf_files = FileFormat.convert_to(file_format, PdfFileFormat)
32+
extracted_text = ""
33+
start_time = time.time()
34+
ocr_percent_done = 0
35+
36+
if len(pdf_files) > 1:
37+
raise ValueError("Only one PDF file is supported.")
38+
39+
if len(pdf_files) == 0:
40+
raise ValueError("No PDF file found - conversion error.")
41+
42+
try:
43+
url = os.getenv("REMOTE_API_URL", self._strategy_config.get("url"))
44+
if not url:
45+
raise Exception('Please do set the REMOTE_API_URL environment variable: export REMOTE_API_URL=http://...')
46+
files = {'file': ('document.pdf', pdf_files[0].binary, 'application/pdf')}
47+
data = {
48+
'page_range': None,
49+
'languages': language,
50+
'force_ocr': False,
51+
'paginate_output': False,
52+
'output_format': 'markdown' # TODO: support JSON output format
53+
}
54+
55+
meta = {
56+
'progress': str(30 + ocr_percent_done),
57+
'status': 'OCR Processing',
58+
'start_time': start_time,
59+
'elapsed_time': time.time() - start_time}
60+
self.update_state_callback(state='PROGRESS', meta=meta)
61+
62+
response = requests.post(url, files=files, data=data)
63+
if response.status_code != 200:
64+
raise Exception(f"Failed to upload PDF file: {response.content}")
65+
66+
extracted_text = response.json().get('output', '')
67+
except Exception as e:
68+
print('Error:', e)
69+
raise Exception("Failed to generate text with Remote API. Make sure the remote server is up and running")
70+
71+
return ExtractResult.from_text(extracted_text)

text_extract_api/extract/strategies/strategy.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,14 @@
1212

1313
class Strategy:
1414
_strategies: Dict[str, Strategy] = {}
15+
_strategy_config: Dict[str, Dict] = {}
1516

1617
def __init__(self):
1718
self.update_state_callback = None
19+
self._strategy_config = None
20+
21+
def set_strategy_config(self, config: Dict):
22+
self._strategy_config = config
1823

1924
def set_update_state_callback(self, callback):
2025
self.update_state_callback = callback
@@ -88,8 +93,10 @@ def load_strategies_from_config(cls, path: str = os.getenv('OCR_CONFIG_PATH', 'c
8893
module = importlib.import_module(module_path)
8994

9095
strategy = getattr(module, class_name)
91-
92-
cls.register_strategy(strategy(), strategy_name)
96+
strategy_instance = strategy()
97+
strategy_instance.set_strategy_config(strategy_config)
98+
99+
cls.register_strategy(strategy_instance, strategy_name)
93100
print(f"Loaded strategy from {config_file_path} {strategy_name} [{strategy_class_path}]")
94101

95102
return strategies

0 commit comments

Comments
 (0)