feat: LICENSE change, marker removed

pkarw · pkarw · commit 956259307e85 · 2025-01-17T14:26:20.000+01:00
diff --git a/LICENSE b/LICENSE
diff --git a/Makefile b/Makefile
@@ -66,12 +66,12 @@ setup-local:
 .PHONY: install-linux
 install-linux:
 	@echo -e "\033[1;34m   Installing Linux dependencies...\033[0m"; \
-	sudo apt update && sudo apt install -y libmagic1 tesseract-ocr poppler-utils pkg-config
+	sudo apt update && sudo apt install -y libmagic1 poppler-utils pkg-config
 
 .PHONY: install-macos
 install-macos:
 	@echo -e "\033[1;34m   Installing macOS dependencies...\033[0m"; \
-	brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
+	brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
 
 .PHONY: install-requirements
 install-requirements:
diff --git a/README.md b/README.md
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
 ![hero doc extract](ocr-hero.webp)
 
 ## Features:
-- **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
-- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
+- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
+- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
 - **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
 - **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
 - **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started)
 
 ![Converting Invoice to JSON](./screenshots/example-2.png)
 
-**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
-
 ## Getting started
 
 You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment).
@@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so
 
 (MAC) - Dependencies
 ```
-brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
+brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
 ```
 
 (Mac) - You need to startup the celery worker
@@ -312,9 +310,11 @@ python client/cli.py llm_pull --model llama3.2-vision
 and only after to run this specific prompt query:
 
 ```bash
-python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
+python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en
 ```
 
+**Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc.
+
 The `ocr` command can store the results using the `storage_profiles`:
   - **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
   - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
@@ -410,37 +410,39 @@ apiClient.uploadFile(formData).then(response => {
 - **Method**: POST
 - **Parameters**:
   - **file**: PDF, image or Office file to be processed.
-  - **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
+  - **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
   - **ocr_cache**: Whether to cache the OCR result (true or false).
   - **prompt**: When provided, will be used for Ollama processing the OCR result
   - **model**: When provided along with the prompt - this model will be used for LLM processing
   - **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
   - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
+  - **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
 
 Example:
 
 ```bash
-curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload" 
+curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=easyocr" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload" 
 ```
 
 ### OCR Endpoint via JSON request
 - **URL**: /ocr/request
 - **Method**: POST
 - **Parameters** (JSON body):
   - **file**: Base64 encoded PDF file content.
-  - **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
+  - **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
   - **ocr_cache**: Whether to cache the OCR result (true or false).
   - **prompt**: When provided, will be used for Ollama processing the OCR result.
   - **model**: When provided along with the prompt - this model will be used for LLM processing.
   - **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved.
   - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting.
+  - **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
 
 Example:
 
 ```bash
 curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{
   "file": "<base64-encoded-file-content>",
-  "strategy": "marker",
+  "strategy": "easyocr",
   "ocr_cache": true,
   "prompt": "",
   "model": "llama3.1",
@@ -598,13 +600,7 @@ AWS_S3_BUCKET_NAME=your-bucket-name
 ```
 
 ## License
-This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details.
-
-**Important note on [marker](https://github.com/VikParuchuri/marker) license***:
-
-The weights for the models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/).
-
-
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
 
 ## Contact
 In case of any questions please contact us at: info@catchthetornado.com
diff --git a/client/cli.py b/client/cli.py
@@ -6,17 +6,19 @@
 import math
 from ollama import pull
 
-def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
+def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
     ocr_url = os.getenv('OCR_UPLOAD_URL', 'http://localhost:8000/ocr/upload')
     files = {'file': open(file_path, 'rb')}
     if not ocr_cache:
         print("OCR cache disabled.")
 
-    data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile}
+    data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile, 'language': language}
 
     if storage_filename:
         data['storage_filename'] = storage_filename
     
+    print(data)
+
     try:
         if prompt_file:
             prompt = open(prompt_file, 'r').read()
@@ -42,7 +44,7 @@ def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1',
         print(f"Failed to upload file: {response.text}")
         return None
 
-def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
+def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
     ocr_url = os.getenv('OCR_REQUEST_URL', 'http://localhost:8000/ocr/request')
     with open(file_path, 'rb') as f:
         file_content = base64.b64encode(f.read()).decode('utf-8')
@@ -52,7 +54,8 @@ def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1'
         'model': model,
         'strategy': strategy,
         'storage_profile': storage_profile,
-        'file': file_content
+        'file': file_content,
+        'language': language
     }
 
     if storage_filename:
@@ -175,6 +178,7 @@ def main():
     ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
     ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
     ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
+    ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
     #ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
 
     # Sub-command for uploading a file via file upload - @deprecated - it's a backward compatibility gimmick
@@ -189,6 +193,7 @@ def main():
     ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
     ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
     ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
+    ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
     #ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
 
 
@@ -204,6 +209,7 @@ def main():
     ocr_request_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
     ocr_request_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use. You may use some formatting - see the docs')
     ocr_request_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use')
+    ocr_request_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
 
     # Sub-command for getting the result
     result_parser = subparsers.add_parser('result', help='Get the OCR result by specified task id.')
@@ -239,7 +245,7 @@ def main():
 
     if args.command == 'ocr' or args.command == 'ocr_upload':
         print(args)
-        result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
+        result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
         if result is None:
             print("Error uploading file.")
             return
@@ -251,7 +257,7 @@ def main():
             if text_result:
                 print(text_result)
     elif args.command == 'ocr_request':
-        result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
+        result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
         if result is None:
             print("Error uploading file.")
             return
diff --git a/config/strategies.yaml b/config/strategies.yaml
@@ -1,7 +1,5 @@
 strategies:
    llama_vision:
       class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
-   marker:
-      class: text_extract_api.extract.strategies.marker.MarkerStrategy
    easyocr:
       class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy
diff --git a/dev.Dockerfile b/dev.Dockerfile
@@ -8,8 +8,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
     && apt-get update --fix-missing \
     && apt-get install -y \
         libgl1-mesa-glx \
-        tesseract-ocr \
-        libtesseract-dev \
         poppler-utils \
         libmagic1 \
         libmagic-dev \
diff --git a/dev.gpu.Dockerfile b/dev.gpu.Dockerfile
@@ -43,8 +43,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
     && apt-get update --fix-missing \
     && apt-get install -y \
         libgl1-mesa-glx \
-        tesseract-ocr \
-        libtesseract-dev \
         poppler-utils \
         libpoppler-cpp-dev \
     && rm -rf /var/lib/apt/lists/*
diff --git a/pyproject.toml b/pyproject.toml
@@ -15,7 +15,6 @@ dependencies = [
     "easyocr",
     "celery",
     "redis",
-    "pytesseract",
     "opencv-python-headless",
     "pdf2image",
     "ollama",
@@ -28,8 +27,6 @@ dependencies = [
     "google-auth-httplib2",
     "google-auth-oauthlib",
     "transformers",
-    "surya-ocr==0.4.14",
-    "marker-pdf==0.2.6",
     "boto3",
     "Pillow",
     "python-magic==0.4.27",
diff --git a/run.sh b/run.sh
@@ -52,9 +52,6 @@ echo "Starting Redis"
 echo "Your ENV settings loaded from .env.localhost file: "
 printenv
 
-echo "Downloading models"
-python -c 'from marker.models import load_all_models; load_all_models()'
-
 CELERY_BIN="$(pwd)/.venv/bin/celery"
 CELERY_PID=$(pgrep -f "$CELERY_BIN")
 REDIS_PORT=6379 # will move it to .envs in near future
diff --git a/text_extract_api/extract/strategies/easyocr.py b/text_extract_api/extract/strategies/easyocr.py
@@ -5,15 +5,15 @@
 
 from text_extract_api.extract.strategies.strategy import Strategy
 from text_extract_api.files.file_formats.file_format import FileFormat
-from text_extract_api.files.file_formats.image_file_format import ImageFileFormat
+from text_extract_api.files.file_formats.image import ImageFileFormat
 
 
-class EasyOCR(Strategy):
+class EasyOCRStrategy(Strategy):
     @classmethod
     def name(cls) -> str:
         return "easyOCR"
 
-    def extract_text(self, file_format: FileFormat) -> str:
+    def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
         """
         Extract text using EasyOCR after converting the input file to images
         (if not already an ImageFileFormat). 
@@ -33,7 +33,7 @@ def extract_text(self, file_format: FileFormat) -> str:
 
         # Initialize the EasyOCR Reader
         # Add or change languages to your needs, e.g., ['en', 'fr']
-        reader = easyocr.Reader(['en'])
+        reader = easyocr.Reader(language.split(','))
 
         # Process each image, extracting text
         all_extracted_text = []
@@ -45,7 +45,7 @@ def extract_text(self, file_format: FileFormat) -> str:
             np_image = np.array(pil_image)
 
             # Perform OCR; with `detail=0`, we get just text, no bounding boxes
-            ocr_result = reader.readtext(np_image, detail=0)
+            ocr_result = reader.readtext(np_image, detail=0) # TODO: addd bounding boxes support as described in #37
 
             # Combine all lines into a single string for that image/page
             extracted_text = "\n".join(ocr_result)
diff --git a/text_extract_api/extract/strategies/llama_vision.py b/text_extract_api/extract/strategies/llama_vision.py
@@ -16,7 +16,7 @@ class LlamaVisionStrategy(Strategy):
     def name(cls) -> str:
         return "llama_vision"
 
-    def extract_text(self, file_format: FileFormat):
+    def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
 
         if (
                 not isinstance(file_format, ImageFileFormat)
diff --git a/text_extract_api/extract/strategies/marker.py b/text_extract_api/extract/strategies/marker.py
diff --git a/text_extract_api/extract/strategies/strategy.py b/text_extract_api/extract/strategies/strategy.py
@@ -27,7 +27,7 @@ def name(cls) -> str:
         raise NotImplementedError("Strategy subclasses must implement name")
 
     @classmethod
-    def extract_text(cls, file_format: Type["FileFormat"]):
+    def extract_text(cls, file_format: Type["FileFormat"], language: str = 'en') -> str:
         raise NotImplementedError("Strategy subclasses must implement extract_text method")
 
     @classmethod
diff --git a/text_extract_api/extract/tasks.py b/text_extract_api/extract/tasks.py
@@ -25,8 +25,9 @@ def ocr_task(
         ocr_cache: bool,
         prompt: str,
         model: str,
+        language: str,
         storage_profile: str,
-        storage_filename: Optional[str] = None
+        storage_filename: Optional[str] = None,
 ):
     """
     Celery task to perform OCR processing on a PDF/Office/image file.
@@ -51,7 +52,7 @@ def ocr_task(
         self.update_state(state='PROGRESS',
                           meta={'progress': 30, 'status': 'Extracting text from PDF', 'start_time': start_time,
                                 'elapsed_time': time.time() - start_time})  # Example progress update
-        extracted_text = strategy.extract_text(FileFormat.from_binary(binary_content))
+        extracted_text = strategy.extract_text(FileFormat.from_binary(binary_content), language)
     else:
         print("Using cached result...")
 
diff --git a/text_extract_api/main.py b/text_extract_api/main.py
diff --git a/utils/marker_cli.py b/utils/marker_cli.py
diff --git a/utils/requirements.txt b/utils/requirements.txt