Skip to content

Commit 8d8cea0

Browse files
committed
Merge remote-tracking branch 'origin/main' into feature/54-add-docling-support
# Conflicts: # config/strategies.yaml # text_extract_api/extract/tasks.py
2 parents b8787dd + 536802e commit 8d8cea0

File tree

18 files changed

+351
-41
lines changed

18 files changed

+351
-41
lines changed

.env.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
REDIS_CACHE_URL=redis://redis:6379/1
33
OLLAMA_HOST=http://ollama:11434
44
STORAGE_PROFILE_PATH=./storage_profiles
5-
LLAMA_VISION_PROMPT="You are OCR. Convert image to markdown."
5+
REMOTE_API_URL=
66

77
# CLI settings
88
OCR_URL=http://localhost:8000/ocr/upload
@@ -15,3 +15,4 @@ LOAD_FILE_URL=http://localhost:8000/storage/load
1515
DELETE_FILE_URL=http://localhost:8000/storage/delete
1616
OCR_REQUEST_URL=http://localhost:8000/ocr/request
1717
OCR_UPLOAD_URL=http://localhost:8000/ocr/upload
18+

.env.localhost.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
#APP_ENV=production # sets the app into prod mode, othervise dev mode with auto-reload on code changes
22
REDIS_CACHE_URL=redis://localhost:6379/1
3-
LLAMA_VISION_PROMPT="You are OCR. Convert image to markdown."
3+
DISABLE_LOCAL_OLLAMA=0
4+
REMOTE_API_URL=
45

56
# CLI settings
67
OCR_URL=http://localhost:8000/ocr/upload

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@ SHELL := /bin/bash
33
export DISABLE_VENV ?= 0
44
export DISABLE_LOCAL_OLLAMA ?= 0
55

6+
define load_env
7+
@if [ -f $(1) ]; then \
8+
echo "Loading environment from $(1)"; \
9+
set -o allexport; source $(1); set +o allexport; \
10+
fi
11+
endef
12+
613
.PHONY: help
714
help:
815
@echo "Available commands:"
@@ -81,6 +88,7 @@ install-requirements:
8188

8289
.PHONY: run
8390
run:
91+
@$(call load_env,.env.localhost)
8492
@echo "Starting the local application server..."; \
8593
DISABLE_VENV=$(DISABLE_VENV) DISABLE_LOCAL_OLLAMA=$(DISABLE_LOCAL_OLLAMA) ./run.sh
8694

README.md

Lines changed: 81 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
88

99
## Features:
1010
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11-
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR), [minicpm-v](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#minicpm-v-26), remote URL strategies including [marker-pdf](https://github.com/VikParuchuri/marker)
1212
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
1414
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -162,6 +162,83 @@ python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --pr
162162
163163
In case of any questions, help requests or just feedback - please [join us on Discord](https://discord.gg/NJzu47Ye3a)!
164164
165+
166+
## Text extract stratgies
167+
168+
### `easyocr`
169+
170+
Easy OCR is avaialble on Apache based license. It's general purpose OCR with support for more than 30 langues, probably with the best performance for English.
171+
172+
Enabled by default. Please do use the `strategy=easyocr` CLI and URL parameters to use it.
173+
174+
175+
### `minicpm-v`
176+
177+
MiniCPM-V is Apache based licenseed OCR strategy.
178+
179+
The usage of MiniCPM-o/V model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
180+
181+
The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
182+
183+
Enabled by default. Please do use the `strategy=minicpm_v` CLI and URL parameters to use it.
184+
185+
| ⚠️ **Remember to pull the model in Ollama first** |
186+
|---------------------------------------------------------|
187+
| You need to pull the model in Ollama - use the command: |
188+
| `python client/cli.py llm_pull --model minicpm-v` |
189+
| Or, if you have Ollama locally: `ollama pull minicpm-v` |
190+
191+
192+
193+
### `llama_vision`
194+
195+
LLama 3.2 Vision Strategy is licensed on [Meta Community License Agreement](https://ollama.com/library/llama3.2-vision/blobs/0b4284c1f870). Works great for many languages, although due to the number of parameters (90b) this model is probably **the slowest** one.
196+
197+
Enabled by default. Please do use the `strategy=llama_vision` CLI and URL parameters to use it. It's by the way the default strategy
198+
199+
200+
### `remote`
201+
202+
Some OCR's - like [Marker, state of the art PDF OCR](https://github.com/VikParuchuri/marker) - works really great for more than 50 languages, including great accuracy for Polish and other languages - let's say that are "diffult" to read for standard OCR.
203+
204+
The `marker-pdf` is however licensed on GPL3 license and **therefore it's not included** by default in this application (as we're bound to MIT).
205+
206+
The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.
207+
208+
To have it up and running you can execute the following steps:
209+
210+
```bash
211+
mkdir marker-distribution # this should be outside of the `text-extract-api` folder!
212+
cd marker-distribution
213+
pip install marker-pdf
214+
pip install -U uvicorn fastapi python-multipart
215+
marker_server --port 8002
216+
```
217+
218+
Set the Remote API Url:
219+
220+
**Note: *** you might run `marker_server` on different port or server - then just make sure you export a proper env setting beffore starting off `text-extract-api` server:
221+
222+
```bash
223+
export REMOTE_API_URL=http://localhost:8002/marker/upload
224+
```
225+
226+
**Note: *** the URL might be also set via `/config/strategies.yaml` file
227+
228+
Run the `text-extract-api`:
229+
230+
```bash
231+
make run
232+
```
233+
234+
Please do use the `strategy=remote` CLI and URL parameters to use it. For example:
235+
236+
```bash
237+
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=remote" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
238+
```
239+
240+
We are connecting to remote OCR via it's API to not share the same license (GPL3) by having it all linked on the source code level.
241+
165242
## Getting started with Docker
166243
167244
### Prerequisites
@@ -410,7 +487,7 @@ apiClient.uploadFile(formData).then(response => {
410487
- **Method**: POST
411488
- **Parameters**:
412489
- **file**: PDF, image or Office file to be processed.
413-
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
490+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v`, `remote` or `easyocr`). See the [available strategies](#text-extract-stratgies)
414491
- **ocr_cache**: Whether to cache the OCR result (true or false).
415492
- **prompt**: When provided, will be used for Ollama processing the OCR result
416493
- **model**: When provided along with the prompt - this model will be used for LLM processing
@@ -429,7 +506,7 @@ curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-m
429506
- **Method**: POST
430507
- **Parameters** (JSON body):
431508
- **file**: Base64 encoded PDF file content.
432-
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
509+
- **strategy**: OCR strategy to use (`llama_vision`, `minicpm_v`, `remote` or `easyocr`). See the [available strategies](#text-extract-stratgies)
433510
- **ocr_cache**: Whether to cache the OCR result (true or false).
434511
- **prompt**: When provided, will be used for Ollama processing the OCR result.
435512
- **model**: When provided along with the prompt - this model will be used for LLM processing.
@@ -447,7 +524,7 @@ curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/j
447524
"prompt": "",
448525
"model": "llama3.1",
449526
"storage_profile": "default",
450-
"storage_filename": "example.pdf"
527+
"storage_filename": "example.md"
451528
}'
452529
```
453530

config/strategies.yaml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
11
strategies:
22
llama_vision:
3-
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
3+
class: text_extract_api.extract.strategies.ollama.OllamaStrategy
4+
model: llama3.2-vision
5+
prompt: You are OCR. Convert image to markdown. Return only the markdown with no explanation text. Do not exclude any content from the page.
6+
minicpm_v:
7+
class: text_extract_api.extract.strategies.ollama.OllamaStrategy
8+
model: minicpm-v
9+
prompt: You are OCR. Convert image to markdown. Return only the markdown with no explanation text. Do not exclude any content from the page.
410
easyocr:
511
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy
612
docling:
713
class: text_extract_api.extract.strategies.docling.DoclingStrategy
14+
15+
# remote strategy example:
16+
#remote:
17+
# class: text_extract_api.extract.strategies.remote.RemoteStrategy
18+
# url:

docker-compose.gpu.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ services:
1818
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
1919
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
2020
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
21-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
21+
- REMOTE_API_URL=${REMOTE_API_URL}
2222
depends_on:
2323
- redis
2424
- ollama
@@ -44,7 +44,6 @@ services:
4444
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
4545
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
4646
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
47-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
4847
depends_on:
4948
- redis
5049
- fastapi_app

docker-compose.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ services:
1818
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
1919
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
2020
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
21-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
21+
- REMOTE_API_URL=${REMOTE_API_URL}
2222
depends_on:
2323
- redis
2424
- ollama
@@ -39,7 +39,6 @@ services:
3939
- LIST_FILES_URL=${LIST_FILES_URL-http://localhost:8000/storage/list}
4040
- LOAD_FILE_URL=${LOAD_FILE_URL-http://localhost:8000/storage/load}
4141
- DELETE_FILE_URL=${DELETE_FILE_URL-http://localhost:8000/storage/delete}
42-
- LLAMA_VISION_PROMPT=${LLAMA_VISION_PROMPT-"You are OCR. Convert image to markdown."}
4342
depends_on:
4443
- redis
4544
- fastapi_app

scripts/entrypoint.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ else
2424
echo "Pulling LLM models, please wait until this process is done..."
2525
python client/cli.py llm_pull --model llama3.1
2626
python client/cli.py llm_pull --model llama3.2-vision
27+
python client/cli.py llm_pull --model minicpm-v
2728
echo "LLM models are ready!"
2829

2930
echo "Starting FastAPI app..."
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
from typing import Callable, Any
2+
3+
"""
4+
IMPORTANT INFORMATION ABOUT THIS CLASS:
5+
6+
This is not the final version of the object, namespace, or intended use.
7+
8+
For this reason, I am not creating an interface, etc. Add code here as soon as possible
9+
along with further integrations, and once we have gained sufficient experience, we will
10+
undertake a refactor.
11+
12+
Currently, the object's purpose is to replace the use of a primitive type, a string, for
13+
extract returns. The limitation of this approach became evident when returning only the
14+
resulting string caused us to lose valuable metadata about the document. Thanks to this
15+
class, we retain DoclingDocument and foresee that other converters/OCRs may have similar
16+
metadata.
17+
"""
18+
class ExtractResult:
19+
def __init__(
20+
self,
21+
value: Any,
22+
text_gatherer: Callable[[Any], str] = None
23+
):
24+
"""
25+
Initializes a UnifiedText instance.
26+
27+
Args:
28+
value (Any): The object containing or representing the text.
29+
text_gatherer (Callable[[Any], str], optional): A callable that extracts text
30+
from the `data`. Defaults to the `_default_text_gatherer`.
31+
32+
Raises:
33+
ValueError: If `text_gatherer` is not callable or not provided when `value` is not a string.
34+
35+
Examples:
36+
Using the default text gatherer
37+
38+
>>> unified = ExtractResult("Example text")
39+
>>> print(unified.text())
40+
Example text
41+
42+
Using a custom text gatherer
43+
44+
>>> def custom_gatherer(value): return f"Custom: {value}"
45+
>>> unified = ExtractResult(123, custom_gatherer)
46+
>>> print(unified.text())
47+
Custom: 123
48+
"""
49+
50+
if text_gatherer is not None and not callable(text_gatherer):
51+
raise ValueError("The `text_gatherer` provided to UnifiedText must be a callable.")
52+
53+
if not isinstance(value, str) and not callable(text_gatherer):
54+
raise ValueError("If `value` is not a string, `text_gatherer` must be provided.")
55+
56+
self.value = value
57+
self.text_gatherer = text_gatherer or self._default_text_gatherer
58+
59+
@staticmethod
60+
def from_text(value: str) -> 'ExtractResult':
61+
return ExtractResult(value)
62+
63+
@property
64+
def text(self) -> str:
65+
"""
66+
Retrieves text using the text gatherer.
67+
68+
Returns:
69+
str: The extracted text from `value`.
70+
"""
71+
return self.text_gatherer(self.value)
72+
73+
@staticmethod
74+
def _default_text_gatherer(value: Any) -> str:
75+
"""
76+
Default method to extract str from str.
77+
So it just return value, obviously.
78+
79+
Args:
80+
value (Any): The input value.
81+
82+
Returns:
83+
str: The text representation of the input value.
84+
85+
Raises:
86+
TypeError: If the `value` is not a string.
87+
"""
88+
if isinstance(value, str):
89+
return value
90+
raise TypeError("Default text gatherer only supports strings.")

text_extract_api/extract/strategies/easyocr.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from PIL import Image
44
import easyocr
55

6+
from extract.extract_result import ExtractResult
67
from text_extract_api.extract.strategies.strategy import Strategy
78
from text_extract_api.files.file_formats.file_format import FileFormat
89
from text_extract_api.files.file_formats.image import ImageFileFormat
@@ -13,7 +14,7 @@ class EasyOCRStrategy(Strategy):
1314
def name(cls) -> str:
1415
return "easyOCR"
1516

16-
def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
17+
def extract_text(self, file_format: FileFormat, language: str = 'en') -> ExtractResult:
1718
"""
1819
Extract text using EasyOCR after converting the input file to images
1920
(if not already an ImageFileFormat).
@@ -53,4 +54,6 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
5354

5455
# Join text from all images/pages
5556
full_text = "\n\n".join(all_extracted_text)
56-
return full_text
57+
58+
59+
return ExtractResult.from_text(full_text)

0 commit comments

Comments
 (0)