You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`--media`: Load in Whisper model to transcribe audio and video files.
96
96
-`--web`: Set up selenium crawler.
97
97
98
+
Download Models:
99
+
If you want to download the models before starting the server
100
+
101
+
```bash
102
+
python download.py --documents --media --web
103
+
```
104
+
105
+
-`--documents`: Load in all the models that help you parse and ingest documents (Surya OCR series of models and Florence-2).
106
+
-`--media`: Load in Whisper model to transcribe audio and video files.
107
+
-`--web`: Set up selenium crawler.
108
+
98
109
## Supported Data Types
99
110
100
111
| Type | Supported Extensions |
@@ -280,14 +291,16 @@ Arguments:
280
291
## Limitations
281
292
There is a need for a GPU with 8~10 GB minimum VRAM as we are using deep learning models.
282
293
\
294
+
283
295
Document Parsing Limitations
284
296
\
285
-
[Marker](https://github.com/VikParuchuri/marker) which is the underlying PDF parser will not convert 100% of equations to LaTeX because it has to detect and then convert them.
286
-
Tables are not always formatted 100% correctly; text can be in the wrong column.
287
-
Whitespace and indentations are not always respected.
288
-
Not all lines/spans will be joined properly.
289
-
This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
290
-
To fit all the models in the GPU, we are using the smallest variants, which might not offer the best-in-class performance.
297
+
-[Marker](https://github.com/VikParuchuri/marker) which is the underlying PDF parser will not convert 100% of equations to LaTeX because it has to detect and then convert them.
298
+
- It is good at parsing english but might struggle for languages such as Chinese
299
+
- Tables are not always formatted 100% correctly; text can be in the wrong column.
300
+
- Whitespace and indentations are not always respected.
301
+
- Not all lines/spans will be joined properly.
302
+
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
303
+
- To fit all the models in the GPU, we are using the smallest variants, which might not offer the best-in-class performance.
291
304
292
305
## License
293
306
OmniParse is licensed under the GPL-3.0 license. See `LICENSE` for more information.
0 commit comments