Skip to content

Commit 9562593

Browse files
committed
feat: LICENSE change, marker removed
1 parent fe37006 commit 9562593

File tree

17 files changed

+70
-768
lines changed

17 files changed

+70
-768
lines changed

LICENSE

Lines changed: 21 additions & 674 deletions
Large diffs are not rendered by default.

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,12 @@ setup-local:
6666
.PHONY: install-linux
6767
install-linux:
6868
@echo -e "\033[1;34m Installing Linux dependencies...\033[0m"; \
69-
sudo apt update && sudo apt install -y libmagic1 tesseract-ocr poppler-utils pkg-config
69+
sudo apt update && sudo apt install -y libmagic1 poppler-utils pkg-config
7070

7171
.PHONY: install-macos
7272
install-macos:
7373
@echo -e "\033[1;34m Installing macOS dependencies...\033[0m"; \
74-
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
74+
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
7575

7676
.PHONY: install-requirements
7777
install-requirements:

README.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
77
![hero doc extract](ocr-hero.webp)
88

99
## Features:
10-
- **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11-
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
10+
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
1212
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
1414
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started)
3939

4040
![Converting Invoice to JSON](./screenshots/example-2.png)
4141

42-
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
43-
4442
## Getting started
4543

4644
You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment).
@@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so
114112
115113
(MAC) - Dependencies
116114
```
117-
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
115+
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
118116
```
119117
120118
(Mac) - You need to startup the celery worker
@@ -312,9 +310,11 @@ python client/cli.py llm_pull --model llama3.2-vision
312310
and only after to run this specific prompt query:
313311
314312
```bash
315-
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
313+
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en
316314
```
317315
316+
**Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc.
317+
318318
The `ocr` command can store the results using the `storage_profiles`:
319319
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
320320
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
@@ -410,37 +410,39 @@ apiClient.uploadFile(formData).then(response => {
410410
- **Method**: POST
411411
- **Parameters**:
412412
- **file**: PDF, image or Office file to be processed.
413-
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
413+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
414414
- **ocr_cache**: Whether to cache the OCR result (true or false).
415415
- **prompt**: When provided, will be used for Ollama processing the OCR result
416416
- **model**: When provided along with the prompt - this model will be used for LLM processing
417417
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
418418
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
419+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
419420
420421
Example:
421422
422423
```bash
423-
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
424+
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=easyocr" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
424425
```
425426
426427
### OCR Endpoint via JSON request
427428
- **URL**: /ocr/request
428429
- **Method**: POST
429430
- **Parameters** (JSON body):
430431
- **file**: Base64 encoded PDF file content.
431-
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
432+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
432433
- **ocr_cache**: Whether to cache the OCR result (true or false).
433434
- **prompt**: When provided, will be used for Ollama processing the OCR result.
434435
- **model**: When provided along with the prompt - this model will be used for LLM processing.
435436
- **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved.
436437
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting.
438+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
437439
438440
Example:
439441
440442
```bash
441443
curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{
442444
"file": "<base64-encoded-file-content>",
443-
"strategy": "marker",
445+
"strategy": "easyocr",
444446
"ocr_cache": true,
445447
"prompt": "",
446448
"model": "llama3.1",
@@ -598,13 +600,7 @@ AWS_S3_BUCKET_NAME=your-bucket-name
598600
```
599601
600602
## License
601-
This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details.
602-
603-
**Important note on [marker](https://github.com/VikParuchuri/marker) license***:
604-
605-
The weights for the models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/).
606-
607-
603+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
608604
609605
## Contact
610606
In case of any questions please contact us at: info@catchthetornado.com

client/cli.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,19 @@
66
import math
77
from ollama import pull
88

9-
def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
9+
def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
1010
ocr_url = os.getenv('OCR_UPLOAD_URL', 'http://localhost:8000/ocr/upload')
1111
files = {'file': open(file_path, 'rb')}
1212
if not ocr_cache:
1313
print("OCR cache disabled.")
1414

15-
data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile}
15+
data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile, 'language': language}
1616

1717
if storage_filename:
1818
data['storage_filename'] = storage_filename
1919

20+
print(data)
21+
2022
try:
2123
if prompt_file:
2224
prompt = open(prompt_file, 'r').read()
@@ -42,7 +44,7 @@ def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1',
4244
print(f"Failed to upload file: {response.text}")
4345
return None
4446

45-
def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
47+
def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
4648
ocr_url = os.getenv('OCR_REQUEST_URL', 'http://localhost:8000/ocr/request')
4749
with open(file_path, 'rb') as f:
4850
file_content = base64.b64encode(f.read()).decode('utf-8')
@@ -52,7 +54,8 @@ def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1'
5254
'model': model,
5355
'strategy': strategy,
5456
'storage_profile': storage_profile,
55-
'file': file_content
57+
'file': file_content,
58+
'language': language
5659
}
5760

5861
if storage_filename:
@@ -175,6 +178,7 @@ def main():
175178
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
176179
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
177180
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
181+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
178182
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
179183

180184
# Sub-command for uploading a file via file upload - @deprecated - it's a backward compatibility gimmick
@@ -189,6 +193,7 @@ def main():
189193
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
190194
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
191195
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
196+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
192197
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
193198

194199

@@ -204,6 +209,7 @@ def main():
204209
ocr_request_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
205210
ocr_request_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use. You may use some formatting - see the docs')
206211
ocr_request_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use')
212+
ocr_request_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
207213

208214
# Sub-command for getting the result
209215
result_parser = subparsers.add_parser('result', help='Get the OCR result by specified task id.')
@@ -239,7 +245,7 @@ def main():
239245

240246
if args.command == 'ocr' or args.command == 'ocr_upload':
241247
print(args)
242-
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
248+
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
243249
if result is None:
244250
print("Error uploading file.")
245251
return
@@ -251,7 +257,7 @@ def main():
251257
if text_result:
252258
print(text_result)
253259
elif args.command == 'ocr_request':
254-
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
260+
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
255261
if result is None:
256262
print("Error uploading file.")
257263
return

config/strategies.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
strategies:
22
llama_vision:
33
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
4-
marker:
5-
class: text_extract_api.extract.strategies.marker.MarkerStrategy
64
easyocr:
75
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy

dev.Dockerfile

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
88
&& apt-get update --fix-missing \
99
&& apt-get install -y \
1010
libgl1-mesa-glx \
11-
tesseract-ocr \
12-
libtesseract-dev \
1311
poppler-utils \
1412
libmagic1 \
1513
libmagic-dev \

dev.gpu.Dockerfile

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
4343
&& apt-get update --fix-missing \
4444
&& apt-get install -y \
4545
libgl1-mesa-glx \
46-
tesseract-ocr \
47-
libtesseract-dev \
4846
poppler-utils \
4947
libpoppler-cpp-dev \
5048
&& rm -rf /var/lib/apt/lists/*

pyproject.toml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ dependencies = [
1515
"easyocr",
1616
"celery",
1717
"redis",
18-
"pytesseract",
1918
"opencv-python-headless",
2019
"pdf2image",
2120
"ollama",
@@ -28,8 +27,6 @@ dependencies = [
2827
"google-auth-httplib2",
2928
"google-auth-oauthlib",
3029
"transformers",
31-
"surya-ocr==0.4.14",
32-
"marker-pdf==0.2.6",
3330
"boto3",
3431
"Pillow",
3532
"python-magic==0.4.27",

run.sh

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,6 @@ echo "Starting Redis"
5252
echo "Your ENV settings loaded from .env.localhost file: "
5353
printenv
5454

55-
echo "Downloading models"
56-
python -c 'from marker.models import load_all_models; load_all_models()'
57-
5855
CELERY_BIN="$(pwd)/.venv/bin/celery"
5956
CELERY_PID=$(pgrep -f "$CELERY_BIN")
6057
REDIS_PORT=6379 # will move it to .envs in near future

text_extract_api/extract/strategies/easyocr.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@
55

66
from text_extract_api.extract.strategies.strategy import Strategy
77
from text_extract_api.files.file_formats.file_format import FileFormat
8-
from text_extract_api.files.file_formats.image_file_format import ImageFileFormat
8+
from text_extract_api.files.file_formats.image import ImageFileFormat
99

1010

11-
class EasyOCR(Strategy):
11+
class EasyOCRStrategy(Strategy):
1212
@classmethod
1313
def name(cls) -> str:
1414
return "easyOCR"
1515

16-
def extract_text(self, file_format: FileFormat) -> str:
16+
def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
1717
"""
1818
Extract text using EasyOCR after converting the input file to images
1919
(if not already an ImageFileFormat).
@@ -33,7 +33,7 @@ def extract_text(self, file_format: FileFormat) -> str:
3333

3434
# Initialize the EasyOCR Reader
3535
# Add or change languages to your needs, e.g., ['en', 'fr']
36-
reader = easyocr.Reader(['en'])
36+
reader = easyocr.Reader(language.split(','))
3737

3838
# Process each image, extracting text
3939
all_extracted_text = []
@@ -45,7 +45,7 @@ def extract_text(self, file_format: FileFormat) -> str:
4545
np_image = np.array(pil_image)
4646

4747
# Perform OCR; with `detail=0`, we get just text, no bounding boxes
48-
ocr_result = reader.readtext(np_image, detail=0)
48+
ocr_result = reader.readtext(np_image, detail=0) # TODO: addd bounding boxes support as described in #37
4949

5050
# Combine all lines into a single string for that image/page
5151
extracted_text = "\n".join(ocr_result)

0 commit comments

Comments
 (0)