Skip to content

Commit eb4e3d3

Browse files
authored
Merge pull request #53 from CatchTheTornado/fix_46
Project rename
2 parents fbd31e1 + 486b7f5 commit eb4e3d3

File tree

4 files changed

+26
-26
lines changed

4 files changed

+26
-26
lines changed

README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
# pdf-extract-api
1+
# text-extract-api
22

3-
Convert any image or PDF to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
3+
Convert any image, PDF or Office document to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
44

55
The API is built with FastAPI and uses Celery for asynchronous task processing. Redis is used for caching OCR results.
66

77
![hero doc extract](ocr-hero.webp)
88

99
## Features:
1010
- **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11-
- **PDF to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
12-
- **PDF to JSON** conversion using Ollama supported models (eg. LLama 3.1)
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
12+
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
14-
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of PDF - see `examples`
14+
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
1515
- **Distributed queue processing** using [Celery](https://docs.celeryq.dev/en/stable/getting-started/introduction.html))
1616
- **Caching** using Redis - the OCR results can be easily cached prior to LLM processing,
1717
- **Storage Strategies** switchable storage strategies (Google Drive, Local File System ...)
@@ -39,7 +39,7 @@ Before running the example see [getting started](#getting-started)
3939

4040
![Converting Invoice to JSON](./screenshots/example-2.png)
4141

42-
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/pdf-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
42+
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
4343

4444
## Getting started
4545

@@ -71,7 +71,7 @@ chmod +x run.sh
7171
run.sh
7272
```
7373

74-
This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `pdf-extract-api` anyways :)
74+
This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `text-extract-api` anyways :)
7575

7676
Then you're good to go with running some CLI commands like:
7777

@@ -105,7 +105,7 @@ export RESULT_URL=https://doctractor:Aekie2ao@api.doctractor.com/ocr/result/
105105
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
106106
```
107107

108-
[Demo Source code](https://github.com/CatchTheTornado/pdf-extract-api-demo)
108+
[Demo Source code](https://github.com/CatchTheTornado/text-extract-api-demo)
109109

110110
**Note:** In the free demo we don't guarantee any processing times. The API is Open so please do **not send any secret documents neither any documents containing personal information**, If you do - you're doing it on your own risk and responsiblity.
111111

@@ -123,8 +123,8 @@ python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --pr
123123
### Clone the Repository
124124

125125
```sh
126-
git clone https://github.com/CatchTheTornado/pdf-extract-api.git
127-
cd pdf-extract-api
126+
git clone https://github.com/CatchTheTornado/text-extract-api.git
127+
cd text-extract-api
128128
```
129129

130130
### Setup environmental variables
@@ -194,7 +194,7 @@ This will start the following services:
194194

195195
## Cloud - paid edition
196196

197-
If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=pdf-extract-api%20but%20hosted) of pdf-extract-api, we can setup it you, billed just for the usage.
197+
If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=text-extract-api%20but%20hosted) of text-extract-api, we can setup it you, billed just for the usage.
198198

199199
## CLI tool
200200

@@ -225,7 +225,7 @@ python client/cli.py llm_pull --model llama3.1
225225
python client/cli.py llm_pull --model llama3.2-vision
226226
```
227227

228-
These models are required for most features supported by `pdf-extract-api`.
228+
These models are required for most features supported by `text-extract-api`.
229229

230230

231231
### Upload a File for OCR (converting to Markdown)
@@ -321,20 +321,20 @@ python llm_generate --prompt "Your prompt here"
321321

322322
## API Clients
323323

324-
You might want to use the decdicated API clients to use `pdf-extract-api`
324+
You might want to use the decdicated API clients to use `text-extract-api`
325325

326326
### Typescript
327327

328-
There's a dedicated API client for Typescript - [pdf-extract-api-client](https://github.com/CatchTheTornado/pdf-extract-api-client) and the `npm` package by the same name:
328+
There's a dedicated API client for Typescript - [text-extract-api-client](https://github.com/CatchTheTornado/text-extract-api-client) and the `npm` package by the same name:
329329

330330
```bash
331-
npm install pdf-extract-api-client
331+
npm install text-extract-api-client
332332
```
333333

334334
Usage:
335335

336336
```js
337-
import { ApiClient, OcrRequest } from 'pdf-extract-api-client';
337+
import { ApiClient, OcrRequest } from 'text-extract-api-client';
338338
const apiClient = new ApiClient('https://api.doctractor.com/', 'doctractor', 'Aekie2ao');
339339
const formData = new FormData();
340340
formData.append('file', fileInput.files[0]);
@@ -354,7 +354,7 @@ apiClient.uploadFile(formData).then(response => {
354354
- **URL**: /ocr/upload
355355
- **Method**: POST
356356
- **Parameters**:
357-
- **file**: PDF file to be processed.
357+
- **file**: PDF, image or Office file to be processed.
358358
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
359359
- **ocr_cache**: Whether to cache the OCR result (true or false).
360360
- **prompt**: When provided, will be used for Ollama processing the OCR result

app/main.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ async def ocr_endpoint(
3636
storage_filename: str = Form(None)
3737
):
3838
"""
39-
Endpoint to extract text from an uploaded PDF file using different OCR strategies.
39+
Endpoint to extract text from an uploaded PDF, Image or Office file using different OCR strategies.
4040
Supports both synchronous and asynchronous processing.
4141
"""
4242
# Validate input
@@ -50,10 +50,10 @@ async def ocr_endpoint(
5050

5151
pdf_bytes = await file.read()
5252

53-
# Generate a hash of the PDF content for caching
53+
# Generate a hash of the document content for caching
5454
pdf_hash = md5(pdf_bytes).hexdigest()
5555

56-
print(f"Processing PDF {file.filename} with strategy: {strategy}, ocr_cache: {ocr_cache}, model: {model}, storage_profile: {storage_profile}, storage_filename: {storage_filename}")
56+
print(f"Processing Document {file.filename} with strategy: {strategy}, ocr_cache: {ocr_cache}, model: {model}, storage_profile: {storage_profile}, storage_filename: {storage_filename}")
5757

5858
# Asynchronous processing using Celery
5959
task = ocr_task.apply_async(args=[pdf_bytes, strategy, file.filename, pdf_hash, ocr_cache, prompt, model, storage_profile, storage_filename])
@@ -71,7 +71,7 @@ async def ocr_upload_endpoint(
7171
storage_filename: str = Form(None)
7272
):
7373
"""
74-
Alias endpoint to extract text from an uploaded PDF file using different OCR strategies.
74+
Alias endpoint to extract text from an uploaded PDF/Office/Image file using different OCR strategies.
7575
Supports both synchronous and asynchronous processing.
7676
"""
7777
return await ocr_endpoint(strategy, prompt, model, file, ocr_cache, storage_profile, storage_filename)
@@ -87,7 +87,7 @@ class OcrRequest(BaseModel):
8787
strategy: str = Field(..., description="OCR strategy to use")
8888
prompt: Optional[str] = Field(None, description="Prompt for the Ollama model")
8989
model: str = Field(..., description="Model to use for the Ollama endpoint")
90-
file: str = Field(..., description="Base64 encoded PDF file")
90+
file: str = Field(..., description="Base64 encoded document file")
9191
ocr_cache: bool = Field(..., description="Enable OCR result caching")
9292
storage_profile: Optional[str] = Field('default', description="Storage profile to use")
9393
storage_filename: Optional[str] = Field(None, description="Storage filename to use")
@@ -137,7 +137,7 @@ def validate_storage_profile(cls, v):
137137
@app.post("/ocr/request")
138138
async def ocr_request_endpoint(request: OcrRequest):
139139
"""
140-
Endpoint to extract text from an uploaded PDF file using different OCR strategies.
140+
Endpoint to extract text from an uploaded PDF/Office/Image file using different OCR strategies.
141141
Supports both synchronous and asynchronous processing.
142142
"""
143143
# Validate input

app/tasks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
@celery.task(bind=True)
2222
def ocr_task(self, pdf_bytes, strategy_name, pdf_filename, pdf_hash, ocr_cache, prompt, model, storage_profile, storage_filename=None):
2323
"""
24-
Celery task to perform OCR processing on a PDF file.
24+
Celery task to perform OCR processing on a PDF/Office/image file.
2525
"""
2626
start_time = time.time()
2727
if strategy_name not in OCR_STRATEGIES:

utils/marker_cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
import argparse
44

55
if __name__ == "__main__":
6-
parser = argparse.ArgumentParser(description="Process a PDF file.")
7-
parser.add_argument("file", type=str, nargs='?', default="../examples/example-mri.pdf", help="The path to the PDF file to be processed.")
6+
parser = argparse.ArgumentParser(description="Process a PDF/Office/Image file.")
7+
parser.add_argument("file", type=str, nargs='?', default="../examples/example-mri.pdf", help="The path to the PDF/Office/Image file to be processed.")
88
args = parser.parse_args()
99

1010
model_lst = load_all_models()

0 commit comments

Comments
 (0)