Merge pull request #53 from CatchTheTornado/fix_46

pkarw · web-flow · commit eb4e3d320daf · 2025-01-08T11:22:46.000+01:00
Project rename
diff --git a/README.md b/README.md
@@ -1,17 +1,17 @@
-# pdf-extract-api
+# text-extract-api
 
-Convert any image or PDF to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
+Convert any image, PDF or Office document to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
 
 The API is built with FastAPI and uses Celery for asynchronous task processing. Redis is used for caching OCR results.
 
 ![hero doc extract](ocr-hero.webp)
 
 ## Features:
 - **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
-- **PDF to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
-- **PDF to JSON** conversion using Ollama supported models (eg. LLama 3.1)
+- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
+- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
 - **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
-- **Removing PII** This tool can be used for removing Personally Identifiable Information out of PDF - see `examples`
+- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
 - **Distributed queue processing** using [Celery](https://docs.celeryq.dev/en/stable/getting-started/introduction.html))
 - **Caching** using Redis - the OCR results can be easily cached prior to LLM processing,
 - **Storage Strategies** switchable storage strategies (Google Drive, Local File System ...)
@@ -39,7 +39,7 @@ Before running the example see [getting started](#getting-started)
 
 ![Converting Invoice to JSON](./screenshots/example-2.png)
 
-**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/pdf-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
+**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
 
 ## Getting started
 
@@ -71,7 +71,7 @@ chmod +x run.sh
 run.sh
 ```
 
-This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `pdf-extract-api` anyways :)
+This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `text-extract-api` anyways :)
 
 Then you're good to go with running some CLI commands like:
 
@@ -105,7 +105,7 @@ export RESULT_URL=https://doctractor:Aekie2ao@api.doctractor.com/ocr/result/
 python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
 ```
 
-[Demo Source code](https://github.com/CatchTheTornado/pdf-extract-api-demo)
+[Demo Source code](https://github.com/CatchTheTornado/text-extract-api-demo)
 
 **Note:** In the free demo we don't guarantee any processing times. The API is Open so please do **not send any secret documents neither any documents containing personal information**, If you do - you're doing it on your own risk and responsiblity.
 
@@ -123,8 +123,8 @@ python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --pr
 ### Clone the Repository
 
 ```sh
-git clone https://github.com/CatchTheTornado/pdf-extract-api.git
-cd pdf-extract-api
+git clone https://github.com/CatchTheTornado/text-extract-api.git
+cd text-extract-api
 ```
 
 ### Setup environmental variables
@@ -194,7 +194,7 @@ This will start the following services:
 
 ## Cloud - paid edition
 
-If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=pdf-extract-api%20but%20hosted) of pdf-extract-api, we can setup it you, billed just for the usage.
+If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=text-extract-api%20but%20hosted) of text-extract-api, we can setup it you, billed just for the usage.
 
 ## CLI tool
 
@@ -225,7 +225,7 @@ python client/cli.py llm_pull --model llama3.1
 python client/cli.py llm_pull --model llama3.2-vision
 ```
 
-These models are required for most features supported by `pdf-extract-api`.
+These models are required for most features supported by `text-extract-api`.
 
 
 ### Upload a File for OCR (converting to Markdown)
@@ -321,20 +321,20 @@ python llm_generate --prompt "Your prompt here"
 
 ## API Clients
 
-You might want to use the decdicated API clients to use `pdf-extract-api`
+You might want to use the decdicated API clients to use `text-extract-api`
 
 ### Typescript
 
-There's a dedicated API client for Typescript - [pdf-extract-api-client](https://github.com/CatchTheTornado/pdf-extract-api-client) and the `npm` package by the same name:
+There's a dedicated API client for Typescript - [text-extract-api-client](https://github.com/CatchTheTornado/text-extract-api-client) and the `npm` package by the same name:
 
 ```bash
-npm install pdf-extract-api-client
+npm install text-extract-api-client
 ```
 
 Usage:
 
 ```js
-import { ApiClient, OcrRequest } from 'pdf-extract-api-client';
+import { ApiClient, OcrRequest } from 'text-extract-api-client';
 const apiClient = new ApiClient('https://api.doctractor.com/', 'doctractor', 'Aekie2ao');
 const formData = new FormData();
 formData.append('file', fileInput.files[0]);
@@ -354,7 +354,7 @@ apiClient.uploadFile(formData).then(response => {
 - **URL**: /ocr/upload
 - **Method**: POST
 - **Parameters**:
-  - **file**: PDF file to be processed.
+  - **file**: PDF, image or Office file to be processed.
   - **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
   - **ocr_cache**: Whether to cache the OCR result (true or false).
   - **prompt**: When provided, will be used for Ollama processing the OCR result
diff --git a/app/main.py b/app/main.py
@@ -36,7 +36,7 @@ async def ocr_endpoint(
     storage_filename: str = Form(None)
 ):
     """
-    Endpoint to extract text from an uploaded PDF file using different OCR strategies.
+    Endpoint to extract text from an uploaded PDF, Image or Office file using different OCR strategies.
     Supports both synchronous and asynchronous processing.
     """
     # Validate input
@@ -50,10 +50,10 @@ async def ocr_endpoint(
 
     pdf_bytes = await file.read()
 
-    # Generate a hash of the PDF content for caching
+    # Generate a hash of the document content for caching
     pdf_hash = md5(pdf_bytes).hexdigest()
 
-    print(f"Processing PDF {file.filename} with strategy: {strategy}, ocr_cache: {ocr_cache}, model: {model}, storage_profile: {storage_profile}, storage_filename: {storage_filename}")
+    print(f"Processing Document {file.filename} with strategy: {strategy}, ocr_cache: {ocr_cache}, model: {model}, storage_profile: {storage_profile}, storage_filename: {storage_filename}")
 
     # Asynchronous processing using Celery
     task = ocr_task.apply_async(args=[pdf_bytes, strategy, file.filename, pdf_hash, ocr_cache, prompt, model, storage_profile, storage_filename])
@@ -71,7 +71,7 @@ async def ocr_upload_endpoint(
     storage_filename: str = Form(None)
 ):
     """
-    Alias endpoint to extract text from an uploaded PDF file using different OCR strategies.
+    Alias endpoint to extract text from an uploaded PDF/Office/Image file using different OCR strategies.
     Supports both synchronous and asynchronous processing.
     """
     return await ocr_endpoint(strategy, prompt, model, file, ocr_cache, storage_profile, storage_filename)
@@ -87,7 +87,7 @@ class OcrRequest(BaseModel):
     strategy: str = Field(..., description="OCR strategy to use")
     prompt: Optional[str] = Field(None, description="Prompt for the Ollama model")
     model: str = Field(..., description="Model to use for the Ollama endpoint")
-    file: str = Field(..., description="Base64 encoded PDF file")
+    file: str = Field(..., description="Base64 encoded document file")
     ocr_cache: bool = Field(..., description="Enable OCR result caching")
     storage_profile: Optional[str] = Field('default', description="Storage profile to use")
     storage_filename: Optional[str] = Field(None, description="Storage filename to use")
@@ -137,7 +137,7 @@ def validate_storage_profile(cls, v):
 @app.post("/ocr/request")
 async def ocr_request_endpoint(request: OcrRequest):
     """
-    Endpoint to extract text from an uploaded PDF file using different OCR strategies.
+    Endpoint to extract text from an uploaded PDF/Office/Image file using different OCR strategies.
     Supports both synchronous and asynchronous processing.
     """
     # Validate input
diff --git a/app/tasks.py b/app/tasks.py
@@ -21,7 +21,7 @@
 @celery.task(bind=True)
 def ocr_task(self, pdf_bytes, strategy_name, pdf_filename, pdf_hash, ocr_cache, prompt, model, storage_profile, storage_filename=None):
     """
-    Celery task to perform OCR processing on a PDF file.
+    Celery task to perform OCR processing on a PDF/Office/image file.
     """
     start_time = time.time()
     if strategy_name not in OCR_STRATEGIES:
diff --git a/utils/marker_cli.py b/utils/marker_cli.py
@@ -3,8 +3,8 @@
 import argparse
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Process a PDF file.")
-    parser.add_argument("file", type=str, nargs='?', default="../examples/example-mri.pdf", help="The path to the PDF file to be processed.")
+    parser = argparse.ArgumentParser(description="Process a PDF/Office/Image file.")
+    parser.add_argument("file", type=str, nargs='?', default="../examples/example-mri.pdf", help="The path to the PDF/Office/Image file to be processed.")
     args = parser.parse_args()
 
     model_lst = load_all_models()