You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`files` (_shared.Files_) |`files` (_File_, _Blob_, _shared.Files_) | The file to process. |
14
+
| `files` (_shared.Files_) | `files` (_File_, _Blob_, _shared.Files_) | The file to process.
15
+
|`chunking_strategy` (_str_) |`chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"`, `"by_page"`, `"by_similarity"`|
15
16
|`coordinates` (_bool_) |`coordinates` (_boolean_) | If true, return bounding box coordinates for each element extracted via OCR. Default: false |
16
17
|`encoding` (_str_) |`encoding` (_string_) | The encoding method used to decode the text input. Default: `utf-8`|
17
18
|`extract_image_block_types` (_List[str]_) |`extractImageBlockTypes` (_string[]_) | The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields |
@@ -25,8 +26,7 @@ The only required parameter is `files` - the file you wish to process.
25
26
|`split_pdf_page` (_bool_) |`splitPdfPage` (_boolean_) | Should the pdf file be split at client. Ignored on backend. |
26
27
|`strategy` (_str_) |`strategy` (_string_) | The strategy to use for partitioning PDF/image. Options are `fast`, `hi_res`, `auto`. Default: `auto`|
27
28
|`unique_element_ids` (_bool_) |`uniqueElementIds` (_boolean_) | When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False |
28
-
|`xml_keep_tags` (_bool_) |`xmlKeepTags` (_boolean_) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. |
29
-
|`chunking_strategy` (_str_) |`chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"`|
29
+
|`xml_keep_tags` (_bool_) |`xmlKeepTags` (_boolean_) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. ||
30
30
31
31
The following parameters only apply when a `chunking_strategy` is specified. Otherwise, they are ignored.
Deprecation Warning: The legacy method of making API calls is currently supported, but it may be deprecated and could break in future releases. We advise all users to migrate to the new `PartitionRequest` object introduced in v0.25.0 to ensure compatibility with future updates.
22
+
</Note>
23
+
<Note>
24
+
Deprecation Warning: Defining `strategy`, `chunking_strategy`, and `output_format` parameters as strings may also be deprecated and could break in future releases. It is also advised to use the new classes for defining those parameters. Ex: `shared.Strategy.HI_RES`
25
+
</Note>
26
+
20
27
Let's start with a simple example in which you send a pdf document to partition via Unstructured API using the Python SDK:
21
28
22
29
```python
23
-
from unstructured_client import UnstructuredClient
30
+
import unstructured_client
31
+
from unstructured_client.models import operations, shared
32
+
33
+
client = unstructured_client.UnstructuredClient(
34
+
api_key_auth="YOUR_API_KEY",
35
+
# you may need to provide your unique API URL
36
+
# server_url="YOUR_API_URL",
37
+
)
38
+
39
+
filename ="sample-docs/layout-parser-paper.pdf"
40
+
file=open(filename, "rb")
41
+
42
+
res = client.general.partition(request=operations.PartitionRequest(
43
+
partition_parameters=shared.PartitionParameters(
44
+
# Note that this currently only supports a single file
Copy file name to clipboardExpand all lines: open-source/core-functionality/chunking.mdx
+5-26Lines changed: 5 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
ttile: Chunking
2
+
title: Chunking
3
3
description: Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).
4
4
---
5
5
@@ -77,33 +77,12 @@ for chunk in chunks:
77
77
78
78
## Chunking Strategies
79
79
80
-
There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with the basic strategy so we’ll describe the baseline strategy first:
80
+
There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with
81
+
the basic strategy so we'll describe the baseline strategy first:
* The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified `max_characters` (hard-max) and `new_after_n_chars` (soft-max) option values.
85
-
86
-
* A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting.
87
-
88
-
* A `Table` element is always isolated and never combined with another element. A `Table` can be oversized, like any other text element, and in that case is divided into two or more `TableChunk` elements using text-splitting.
89
-
90
-
* If specified, `overlap` is applied between split-chunks and is also applied between normal chunks when `overlap_all` is `True`.
91
-
92
-
93
-
### “by\_title” chunking strategy
94
-
95
-
The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
96
-
97
-
In addition to the behaviors of the `basic` strategy above, the `by_title` strategy has the following behaviors:
98
-
99
-
***Detect section headings.** A `Title` element is considered to start a new section. When a `Title` element is encountered, the prior chunk is closed and a new chunk started, even if the `Title` element would fit in the prior chunk. This implements the first aspect of the “preserve section boundaries” contract.
100
-
101
-
***Detect metadata.section change.** An element with a new value in `element.metadata.section` is considered to start a new section. When a change in this value is encountered a new chunk is started. This implements the second aspect of preserving section boundaries. This metadata is not present in all document formats so is not used alone. An element having `None` for this metadata field is considered to be part of the prior section; a section break is only detected on an explicit change in value.
102
-
103
-
***Respect page boundaries.** Page boundaries can optionally also be respected using the `multipage_sections` argument. This defaults to `True` meaning that a page break does _not_ start a new chunk. Setting this to `False` will separate elements that occur on different pages into distinct chunks.
104
-
105
-
***Combine small sections.** In certain documents, partitioning may identify a list-item or other short paragraph as a `Title` element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the `combine_text_under_n_chars` argument. This defaults to the same value as `max_characters` such that sequential small sections are combined to maximally fill the chunking window. Setting this to `0` will disable section combining.
0 commit comments