You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,17 @@
1
-
# pdf-extract-api
1
+
# text-extract-api
2
2
3
-
Convert any imageor PDF to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
3
+
Convert any image, PDF or Office document to Markdown *text* or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.
4
4
5
5
The API is built with FastAPI and uses Celery for asynchronous task processing. Redis is used for caching OCR results.
6
6
7
7

8
8
9
9
## Features:
10
10
-**No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11
-
-**PDF to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
12
-
-**PDF to JSON** conversion using Ollama supported models (eg. LLama 3.1)
11
+
-**PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
12
+
-**PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
13
13
-**LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
14
-
-**Removing PII** This tool can be used for removing Personally Identifiable Information out of PDF - see `examples`
14
+
-**Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
15
15
-**Distributed queue processing** using [Celery](https://docs.celeryq.dev/en/stable/getting-started/introduction.html))
16
16
-**Caching** using Redis - the OCR results can be easily cached prior to LLM processing,
17
17
-**Storage Strategies** switchable storage strategies (Google Drive, Local File System ...)
@@ -39,7 +39,7 @@ Before running the example see [getting started](#getting-started)
39
39
40
40

41
41
42
-
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/pdf-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
42
+
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
43
43
44
44
## Getting started
45
45
@@ -71,7 +71,7 @@ chmod +x run.sh
71
71
run.sh
72
72
```
73
73
74
-
This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `pdf-extract-api` anyways :)
74
+
This command will install all the dependencies - including Redis (via Docker, so it is not entirely docker free method of running `text-extract-api` anyways :)
75
75
76
76
Then you're good to go with running some CLI commands like:
**Note:** In the free demo we don't guarantee any processing times. The API is Open so please do **not send any secret documents neither any documents containing personal information**, If you do - you're doing it on your own risk and responsiblity.
@@ -194,7 +194,7 @@ This will start the following services:
194
194
195
195
## Cloud - paid edition
196
196
197
-
If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=pdf-extract-api%20but%20hosted) of pdf-extract-api, we can setup it you, billed just for the usage.
197
+
If the on-prem is too much hassle [ask us about the hosted/cloud edition](mailto:info@catchthetornado.com?subject=text-extract-api%20but%20hosted) of text-extract-api, we can setup it you, billed just for the usage.
You might want to use the decdicated API clients to use `pdf-extract-api`
324
+
You might want to use the decdicated API clients to use `text-extract-api`
325
325
326
326
### Typescript
327
327
328
-
There's a dedicated API client for Typescript - [pdf-extract-api-client](https://github.com/CatchTheTornado/pdf-extract-api-client) and the `npm` package by the same name:
328
+
There's a dedicated API client for Typescript - [text-extract-api-client](https://github.com/CatchTheTornado/text-extract-api-client) and the `npm` package by the same name:
0 commit comments