Skip to content

Commit 3f36113

Browse files
committed
docs: update README with v1.2.0 content ordering feature
- Add Content Ordering section explaining Y-coordinate based sorting - Update Recent Updates section with v1.2.0 highlights - Add content ordering to Features list - Include practical examples of content part ordering - Document benefits for AI comprehension
1 parent 86a13ae commit 3f36113

File tree

1 file changed

+53
-3
lines changed

1 file changed

+53
-3
lines changed

README.md

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
- 📄 **Extract text content** from PDF files (full document or specific pages)
1919
- 🖼️ **Extract embedded images** from PDF pages as base64-encoded data
20+
- 📐 **Preserve content order** - Text and images returned in exact document layout order (NEW v1.2.0)
2021
- 📊 **Get metadata** (author, title, creation date, etc.)
2122
- 🔢 **Count pages** in PDF documents
2223
- 🌐 **Support for both local files and URLs**
@@ -27,14 +28,23 @@
2728

2829
## 🆕 Recent Updates (October 2025)
2930

31+
### v1.2.0 - Content Ordering (Latest)
32+
-**Y-Coordinate Based Ordering**: Text and images returned in exact document order
33+
-**Natural Reading Flow**: Content parts preserve the layout sequence as it appears in PDF
34+
-**Intelligent Grouping**: Automatically groups text items on the same line
35+
-**Optimized for AI**: Enables AI models to understand content in natural reading order
36+
37+
### v1.1.0 - Image Extraction
38+
-**Image Extraction**: Extract embedded images from PDF pages as base64-encoded data
39+
-**Performance Optimization**: Parallel page processing for 5-10x speedup
40+
-**Deep Refactoring**: Modular architecture with 98.9% test coverage (91 tests)
41+
42+
### Previous Updates
3043
-**Fixed critical bugs**: Buffer/Uint8Array compatibility for PDF.js v5.x
3144
-**Fixed schema validation**: Resolved `exclusiveMinimum` issue affecting Windsurf, Mistral API, and other tools
3245
-**Improved metadata extraction**: Robust fallback handling for PDF.js compatibility
3346
-**Updated dependencies**: All packages updated to latest versions
3447
-**Migrated to Biome**: 50x faster linting and formatting with unified tooling
35-
-**Added image extraction**: Extract embedded images from PDF pages
36-
-**Performance optimization**: Parallel page processing for 5-10x speedup
37-
-**Deep refactoring**: Modular architecture with 98.9% test coverage (90 tests)
3848

3949
## 📦 Installation
4050

@@ -226,6 +236,46 @@ Extract embedded images from PDF pages as base64-encoded data:
226236
- 🔸 Set `include_images: false` (default) to extract text only
227237
- 🔸 Combine with `pages` parameter to limit extraction scope
228238

239+
### Content Ordering (NEW in v1.2.0)
240+
241+
**Text and images are now returned in exact document order!**
242+
243+
The server uses Y-coordinates from PDF.js to preserve the natural reading flow of the document. This means AI models receive content parts in the same sequence as they appear on the page.
244+
245+
**Example document layout**:
246+
```
247+
Page 1:
248+
[Heading text]
249+
[Image: Chart]
250+
[Description text]
251+
[Image: Photo A]
252+
[Image: Photo B]
253+
[Conclusion text]
254+
```
255+
256+
**Content parts returned**:
257+
```
258+
[
259+
{ type: "text", text: "Heading text" },
260+
{ type: "image", data: "base64..." }, // Chart
261+
{ type: "text", text: "Description text" },
262+
{ type: "image", data: "base64..." }, // Photo A
263+
{ type: "image", data: "base64..." }, // Photo B
264+
{ type: "text", text: "Conclusion text" }
265+
]
266+
```
267+
268+
**Benefits**:
269+
- ✅ AI understands context between text and images
270+
- ✅ Natural reading flow preserved
271+
- ✅ Better comprehension for complex documents
272+
- ✅ Automatic line grouping for multi-line text blocks
273+
274+
**When is ordering applied?**
275+
- Automatically enabled when `include_images: true`
276+
- Works with both specific pages and full document extraction
277+
- Content on each page is independently sorted by Y-position
278+
229279
### Security: Relative Paths Only
230280

231281
**Important:** The server only accepts **relative paths** for security reasons. Absolute paths are blocked to prevent unauthorized file system access.

0 commit comments

Comments
 (0)