Skip to content

Commit faacb7a

Browse files
committed
[fix]: marker strategy rename to remote
1 parent d370dd2 commit faacb7a

File tree

3 files changed

+18
-14
lines changed

3 files changed

+18
-14
lines changed

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -197,17 +197,19 @@ LLama 3.2 Vision Strategy is licensed on [Meta Community License Agreement](http
197197
Enabled by default. Please do use the `strategy=llama_vision` CLI and URL parameters to use it. It's by the way the default strategy
198198
199199
200-
### `marker`
200+
### `remote`
201201
202-
[Marker, state of the art PDF OCR](https://github.com/VikParuchuri/marker) - works really great for more than 50 languages, including great accuracy for Polish and other languages - let's say that are "diffult" to read for standard OCR.
202+
Some OCR's - like [Marker, state of the art PDF OCR](https://github.com/VikParuchuri/marker) - works really great for more than 50 languages, including great accuracy for Polish and other languages - let's say that are "diffult" to read for standard OCR.
203203
204204
The `marker-pdf` is however licensed on GPL3 license and **therefore it's not included** by default in this application (as we're bound to MIT).
205205
206206
The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.
207207
208-
To have it up and running please execute the following steps:
208+
To have it up and running you can execute the following steps:
209209
210210
```bash
211+
mkdir marker-distribution # this should be outside of the `text-extract-api` folder!
212+
cd marker-distribution
211213
pip install marker-pdf
212214
pip install -U uvicorn fastapi python-multipart
213215
marker_server --port 8002
@@ -216,16 +218,16 @@ marker_server --port 8002
216218
**Note: *** you might run `marker_server` on different port - then just make sure you export a proper env setting beffore starting off `text-extract-api` server:
217219
218220
```bash
219-
export MARKER_API_URL=http://localhost:8002/marker/upload
221+
export REMOTE_API_URL=http://localhost:8002/marker/upload
220222
```
221223
222-
Please do use the `strategy=marker` CLI and URL parameters to use it. For example:
224+
Please do use the `strategy=remote` CLI and URL parameters to use it. For example:
223225
224226
```bash
225-
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
227+
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=remote" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
226228
```
227229
228-
We are connecting to marker via it's API to not share the same license (GPL3) by having it all linked on the source code level.
230+
We are connecting to remote OCR via it's API to not share the same license (GPL3) by having it all linked on the source code level.
229231
230232
## Getting started with Docker
231233

config/strategies.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ strategies:
55
class: text_extract_api.extract.strategies.minicpm_v.MiniCPMVStrategy
66
easyocr:
77
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy
8-
marker:
9-
class: text_extract_api.extract.strategies.marker.MarkerStrategy
8+
remote:
9+
class: text_extract_api.extract.strategies.marker.RemoteStrategy

text_extract_api/extract/strategies/marker.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,23 @@
22
import tempfile
33
import time
44

5+
from extract.extract_result import ExtractResult
6+
57
from text_extract_api.extract.strategies.strategy import Strategy
68
from text_extract_api.files.file_formats.file_format import FileFormat
79
from text_extract_api.files.file_formats.image import ImageFileFormat
810
from text_extract_api.files.file_formats.pdf import PdfFileFormat
911
import requests
1012

1113

12-
class MarkerStrategy(Strategy):
13-
"""Marker PDF via API - strategy"""
14+
class RemoteStrategy(Strategy):
15+
"""Remote API Strategy"""
1416

1517
@classmethod
1618
def name(cls) -> str:
1719
return "marker"
1820

19-
def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
21+
def extract_text(self, file_format: FileFormat, language: str = 'en') -> ExtractResult:
2022

2123
if (
2224
not isinstance(file_format, PdfFileFormat)
@@ -38,7 +40,7 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
3840
raise ValueError("No PDF file found - conversion error.")
3941

4042
try:
41-
url = os.getenv("MARKER_API_URL", "http://localhost:8002/marker/upload")
43+
url = os.getenv("REMOTE_API_URL", "http://localhost:8002/marker/upload")
4244
files = {'file': ('document.pdf', pdf_files[0].binary, 'application/pdf')}
4345
data = {
4446
'page_range': None,
@@ -64,4 +66,4 @@ def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
6466
print('Error:', e)
6567
raise Exception("Failed to generate text with Marker PDF API. Make sure marker-pdf server is up and running: marker_server --port 8002. Details: https://github.com/VikParuchuri/marker")
6668

67-
return extracted_text
69+
return ExtractResult.from_text(extracted_text)

0 commit comments

Comments
 (0)