|
17 | 17 |
|
18 | 18 | - 📄 **Extract text content** from PDF files (full document or specific pages) |
19 | 19 | - 🖼️ **Extract embedded images** from PDF pages as base64-encoded data |
| 20 | +- 📐 **Preserve content order** - Text and images returned in exact document layout order (NEW v1.2.0) |
20 | 21 | - 📊 **Get metadata** (author, title, creation date, etc.) |
21 | 22 | - 🔢 **Count pages** in PDF documents |
22 | 23 | - 🌐 **Support for both local files and URLs** |
|
27 | 28 |
|
28 | 29 | ## 🆕 Recent Updates (October 2025) |
29 | 30 |
|
| 31 | +### v1.2.0 - Content Ordering (Latest) |
| 32 | +- ✅ **Y-Coordinate Based Ordering**: Text and images returned in exact document order |
| 33 | +- ✅ **Natural Reading Flow**: Content parts preserve the layout sequence as it appears in PDF |
| 34 | +- ✅ **Intelligent Grouping**: Automatically groups text items on the same line |
| 35 | +- ✅ **Optimized for AI**: Enables AI models to understand content in natural reading order |
| 36 | + |
| 37 | +### v1.1.0 - Image Extraction |
| 38 | +- ✅ **Image Extraction**: Extract embedded images from PDF pages as base64-encoded data |
| 39 | +- ✅ **Performance Optimization**: Parallel page processing for 5-10x speedup |
| 40 | +- ✅ **Deep Refactoring**: Modular architecture with 98.9% test coverage (91 tests) |
| 41 | + |
| 42 | +### Previous Updates |
30 | 43 | - ✅ **Fixed critical bugs**: Buffer/Uint8Array compatibility for PDF.js v5.x |
31 | 44 | - ✅ **Fixed schema validation**: Resolved `exclusiveMinimum` issue affecting Windsurf, Mistral API, and other tools |
32 | 45 | - ✅ **Improved metadata extraction**: Robust fallback handling for PDF.js compatibility |
33 | 46 | - ✅ **Updated dependencies**: All packages updated to latest versions |
34 | 47 | - ✅ **Migrated to Biome**: 50x faster linting and formatting with unified tooling |
35 | | -- ✅ **Added image extraction**: Extract embedded images from PDF pages |
36 | | -- ✅ **Performance optimization**: Parallel page processing for 5-10x speedup |
37 | | -- ✅ **Deep refactoring**: Modular architecture with 98.9% test coverage (90 tests) |
38 | 48 |
|
39 | 49 | ## 📦 Installation |
40 | 50 |
|
@@ -226,6 +236,46 @@ Extract embedded images from PDF pages as base64-encoded data: |
226 | 236 | - 🔸 Set `include_images: false` (default) to extract text only |
227 | 237 | - 🔸 Combine with `pages` parameter to limit extraction scope |
228 | 238 |
|
| 239 | +### Content Ordering (NEW in v1.2.0) |
| 240 | + |
| 241 | +**Text and images are now returned in exact document order!** |
| 242 | + |
| 243 | +The server uses Y-coordinates from PDF.js to preserve the natural reading flow of the document. This means AI models receive content parts in the same sequence as they appear on the page. |
| 244 | + |
| 245 | +**Example document layout**: |
| 246 | +``` |
| 247 | +Page 1: |
| 248 | + [Heading text] |
| 249 | + [Image: Chart] |
| 250 | + [Description text] |
| 251 | + [Image: Photo A] |
| 252 | + [Image: Photo B] |
| 253 | + [Conclusion text] |
| 254 | +``` |
| 255 | + |
| 256 | +**Content parts returned**: |
| 257 | +``` |
| 258 | +[ |
| 259 | + { type: "text", text: "Heading text" }, |
| 260 | + { type: "image", data: "base64..." }, // Chart |
| 261 | + { type: "text", text: "Description text" }, |
| 262 | + { type: "image", data: "base64..." }, // Photo A |
| 263 | + { type: "image", data: "base64..." }, // Photo B |
| 264 | + { type: "text", text: "Conclusion text" } |
| 265 | +] |
| 266 | +``` |
| 267 | + |
| 268 | +**Benefits**: |
| 269 | +- ✅ AI understands context between text and images |
| 270 | +- ✅ Natural reading flow preserved |
| 271 | +- ✅ Better comprehension for complex documents |
| 272 | +- ✅ Automatic line grouping for multi-line text blocks |
| 273 | + |
| 274 | +**When is ordering applied?** |
| 275 | +- Automatically enabled when `include_images: true` |
| 276 | +- Works with both specific pages and full document extraction |
| 277 | +- Content on each page is independently sorted by Y-position |
| 278 | + |
229 | 279 | ### Security: Relative Paths Only |
230 | 280 |
|
231 | 281 | **Important:** The server only accepts **relative paths** for security reasons. Absolute paths are blocked to prevent unauthorized file system access. |
|
0 commit comments