Skip to content

Commit 0344ee9

Browse files
authored
Merge branch 'main' into main
2 parents 967bc86 + 7b2a591 commit 0344ee9

File tree

10 files changed

+202
-74
lines changed

10 files changed

+202
-74
lines changed

api-reference/api-services/api-parameters.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ The only required parameter is `files` - the file you wish to process.
1111

1212
| Python & direct call | JavaScript | Description |
1313
|-------------------------------------------|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
14-
| `files` (_shared.Files_) | `files` (_File_, _Blob_, _shared.Files_) | The file to process. |
14+
| `files` (_shared.Files_) | `files` (_File_, _Blob_, _shared.Files_) | The file to process.
15+
| `chunking_strategy` (_str_) | `chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"`, `"by_page"`, `"by_similarity"` |
1516
| `coordinates` (_bool_) | `coordinates` (_boolean_) | If true, return bounding box coordinates for each element extracted via OCR. Default: false |
1617
| `encoding` (_str_) | `encoding` (_string_) | The encoding method used to decode the text input. Default: `utf-8` |
1718
| `extract_image_block_types` (_List[str]_) | `extractImageBlockTypes` (_string[]_) | The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields |
@@ -25,8 +26,7 @@ The only required parameter is `files` - the file you wish to process.
2526
| `split_pdf_page` (_bool_) | `splitPdfPage` (_boolean_) | Should the pdf file be split at client. Ignored on backend. |
2627
| `strategy` (_str_) | `strategy` (_string_) | The strategy to use for partitioning PDF/image. Options are `fast`, `hi_res`, `auto`. Default: `auto` |
2728
| `unique_element_ids` (_bool_) | `uniqueElementIds` (_boolean_) | When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False |
28-
| `xml_keep_tags` (_bool_) | `xmlKeepTags` (_boolean_) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. |
29-
| `chunking_strategy` (_str_) | `chunkingStrategy` (_string_) | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"` |
29+
| `xml_keep_tags` (_bool_) | `xmlKeepTags` (_boolean_) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. | |
3030

3131
The following parameters only apply when a `chunking_strategy` is specified. Otherwise, they are ignored.
3232

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
title: Chunking strategies
3+
---
4+
5+
Chunking functions use metadata and document elements detected with partition functions to split a document into
6+
appropriately-sized chunks for uses cases such as Retrieval Augmented Generation (RAG).
7+
8+
If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that
9+
Unstructured methods slightly differ, since the partitioning step already divides an entire document into its structural elements.
10+
11+
Individual elements will only be split if they exceed the desired maximum chunk size. Two or more consecutive text elements
12+
that will together fit within `max_characters` will be combined. After chunking, you will only have elements of the
13+
following types:
14+
15+
* `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
16+
combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single
17+
element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
18+
text element that was too big to fit in one chunk and required splitting.
19+
* `Table`: A table element is not combined with other elements and if it fits within `max_characters` it will remain as is.
20+
* `TableChunk`: large tables that exceed `max_characters` chunk size are split into special `TableChunk` elements.
21+
22+
23+
import SharedChunkingStrategies from '/snippets/concepts/chunking-strategies.mdx';
24+
25+
<SharedChunkingStrategies/>
26+
27+
### "by_page" chunking strategy
28+
29+
Only available in Unstructured API and Platform.
30+
31+
The `by_page` chunking strategy ensures the content from different pages do not end up in the same chunk.
32+
When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the
33+
prior chunk.
34+
35+
### "by_similarity" chunking strategy
36+
37+
Only available in Unstructured API and Platform.
38+
39+
The `by_similarity` chunking strategy employs the `sentence-transformers/multi-qa-mpnet-base-dot-v1` embedding model to
40+
identify topically similar sequential elements and combine them into chunks.
41+
42+
As with other strategies, chunks will never exceed the hard-maximum chunk size set by `max_characters`. For this reason,
43+
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
44+
guarantee that two elements with low similarity will not be combined in a single chunk.
45+
46+
You can control the level of topic similarity you require for elements to have by setting the `similarity_threshold` parameter.
47+
`similarity_threshold` expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements
48+
must have to be included in the same chunk. The default is 0.5.

api-reference/api-services/free-api.mdx

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -72,25 +72,34 @@ Next, use it to call the API:
7272
<CodeGroup>
7373

7474
```python Python
75-
from unstructured_client import UnstructuredClient
76-
from unstructured_client.models import shared
77-
from unstructured_client.models.errors import SDKError
75+
import unstructured_client
76+
from unstructured_client.models import operations, shared
7877

79-
client = UnstructuredClient(api_key_auth="YOUR_API_KEY")
80-
filename = "PATH_TO_FILE"
81-
82-
with open(filename, "rb") as f:
83-
files=shared.Files(
84-
content=f.read(),
85-
file_name=filename,
86-
)
87-
88-
req = shared.PartitionParameters(files=files)
78+
client = unstructured_client.UnstructuredClient(
79+
api_key_auth="YOUR_API_KEY",
80+
# you may need to provide your unique API URL
81+
# server_url="YOUR_API_URL",
82+
)
8983

90-
try:
91-
resp = client.general.partition(req)
92-
except SDKError as e:
93-
print(e)
84+
filename = "sample-docs/layout-parser-paper.pdf"
85+
file = open(filename, "rb")
86+
87+
res = client.general.partition(request=operations.PartitionRequest(
88+
partition_parameters=shared.PartitionParameters(
89+
# Note that this currently only supports a single file
90+
files=shared.Files(
91+
content=file.read(),
92+
file_name=filename,
93+
),
94+
# Other parameters
95+
strategy=shared.Strategy.HI_RES,
96+
chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
97+
),
98+
))
99+
100+
if res.elements is not None:
101+
# handle response
102+
pass
94103
```
95104

96105
```javascript JavaScript

api-reference/api-services/python-sdk.mdx

Lines changed: 45 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,62 @@ pip install unstructured-client
1717

1818
## Basics
1919

20+
<Note>
21+
Deprecation Warning: The legacy method of making API calls is currently supported, but it may be deprecated and could break in future releases. We advise all users to migrate to the new `PartitionRequest` object introduced in v0.25.0 to ensure compatibility with future updates.
22+
</Note>
23+
<Note>
24+
Deprecation Warning: Defining `strategy`, `chunking_strategy`, and `output_format` parameters as strings may also be deprecated and could break in future releases. It is also advised to use the new classes for defining those parameters. Ex: `shared.Strategy.HI_RES`
25+
</Note>
26+
2027
Let's start with a simple example in which you send a pdf document to partition via Unstructured API using the Python SDK:
2128

2229
```python
23-
from unstructured_client import UnstructuredClient
30+
import unstructured_client
31+
from unstructured_client.models import operations, shared
32+
33+
client = unstructured_client.UnstructuredClient(
34+
api_key_auth="YOUR_API_KEY",
35+
# you may need to provide your unique API URL
36+
# server_url="YOUR_API_URL",
37+
)
38+
39+
filename = "sample-docs/layout-parser-paper.pdf"
40+
file = open(filename, "rb")
41+
42+
res = client.general.partition(request=operations.PartitionRequest(
43+
partition_parameters=shared.PartitionParameters(
44+
# Note that this currently only supports a single file
45+
files=shared.Files(
46+
content=file.read(),
47+
file_name=filename,
48+
),
49+
# Other parameters
50+
strategy=shared.Strategy.HI_RES,
51+
chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
52+
),
53+
))
54+
55+
if res.elements is not None:
56+
# handle response
57+
pass
58+
```
59+
60+
61+
Legacy method without `PartitionRequest`:
62+
```python
63+
import unstructured_client
2464
from unstructured_client.models import shared
2565
from unstructured_client.models.errors import SDKError
2666

27-
client = UnstructuredClient(
67+
client = unstructured_client.UnstructuredClient(
2868
api_key_auth="YOUR_API_KEY",
2969
# you may need to provide your unique API URL
3070
# server_url="YOUR_API_URL",
3171
)
3272

3373
filename = "sample-docs/layout-parser-paper.pdf"
3474
file = open(filename, "rb")
75+
3576
req = shared.PartitionParameters(
3677
# Note that this currently only supports a single file
3778
files=shared.Files(
@@ -50,10 +91,10 @@ except SDKError as e:
5091
```
5192

5293
In the example above we're sending the request to the free Unstructured API, in which case the API URL is the same for all
53-
users, and you don't need to provide it. Note, however, that you still need to authenticate yourself with
94+
users and you don't need to provide it. Note, however, that you still need to authenticate yourself with
5495
your individual API Key.
5596

56-
If you want to use the SaaS Unstructured API, you need to replace the URL in this example with the unique API URL that you have
97+
If you want to use the SaaS Unstructured API, you need to define `server_url` as the unique API URL that you
5798
received in the same email as your API key. For Unstructured API on Azure/AWS, use the API URL that you
5899
configured through those services.
59100

api-reference/api-services/saas-api-development-guide.mdx

Lines changed: 24 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -81,29 +81,34 @@ Unstructured API key to authenticate yourself.
8181
<CodeGroup>
8282

8383
```python Python
84-
from unstructured_client import UnstructuredClient
85-
from unstructured_client.models import shared
86-
from unstructured_client.models.errors import SDKError
84+
import unstructured_client
85+
from unstructured_client.models import operations, shared
8786

88-
client = UnstructuredClient(
87+
client = unstructured_client.UnstructuredClient(
8988
api_key_auth="YOUR_API_KEY",
90-
server_url="YOUR_API_URL",
89+
# you may need to provide your unique API URL
90+
# server_url="YOUR_API_URL",
9191
)
9292

93-
filename = "PATH_TO_FILE"
94-
95-
with open(filename, "rb") as f:
96-
files=shared.Files(
97-
content=f.read(),
98-
file_name=filename,
99-
)
100-
101-
req = shared.PartitionParameters(files=files)
102-
103-
try:
104-
resp = client.general.partition(req)
105-
except SDKError as e:
106-
print(e)
93+
filename = "sample-docs/layout-parser-paper.pdf"
94+
file = open(filename, "rb")
95+
96+
res = client.general.partition(request=operations.PartitionRequest(
97+
partition_parameters=shared.PartitionParameters(
98+
# Note that this currently only supports a single file
99+
files=shared.Files(
100+
content=file.read(),
101+
file_name=filename,
102+
),
103+
# Other parameters
104+
strategy=shared.Strategy.HI_RES,
105+
chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
106+
),
107+
))
108+
109+
if res.elements is not None:
110+
# handle response
111+
pass
107112
```
108113

109114
```javascript JavaScript

mint.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
{
3333
"name": "Community",
3434
"icon": "slack",
35-
"url": "https://unstructuredw-kbe4326.slack.com/signup#/domain-signup"
35+
"url": "https://short.unstructured.io/pzw05l7"
3636
},
3737
{
3838
"name": "Product",
@@ -312,7 +312,8 @@
312312
"group": "Concepts",
313313
"pages": [
314314
"api-reference/api-services/document-elements",
315-
"api-reference/api-services/partitioning"
315+
"api-reference/api-services/partitioning",
316+
"api-reference/api-services/chunking"
316317
]
317318
},
318319

open-source/core-functionality/chunking.mdx

Lines changed: 5 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
ttile: Chunking
2+
title: Chunking
33
description: Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).
44
---
55

@@ -77,33 +77,12 @@ for chunk in chunks:
7777

7878
## Chunking Strategies
7979

80-
There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with the basic strategy so we’ll describe the baseline strategy first:
80+
There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with
81+
the basic strategy so we'll describe the baseline strategy first:
8182

82-
### “basic” chunking strategy
83+
import SharedChunkingStrategies from '/snippets/concepts/chunking-strategies.mdx';
8384

84-
* The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified `max_characters` (hard-max) and `new_after_n_chars` (soft-max) option values.
85-
86-
* A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting.
87-
88-
* A `Table` element is always isolated and never combined with another element. A `Table` can be oversized, like any other text element, and in that case is divided into two or more `TableChunk` elements using text-splitting.
89-
90-
* If specified, `overlap` is applied between split-chunks and is also applied between normal chunks when `overlap_all` is `True`.
91-
92-
93-
### “by\_title” chunking strategy
94-
95-
The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
96-
97-
In addition to the behaviors of the `basic` strategy above, the `by_title` strategy has the following behaviors:
98-
99-
* **Detect section headings.** A `Title` element is considered to start a new section. When a `Title` element is encountered, the prior chunk is closed and a new chunk started, even if the `Title` element would fit in the prior chunk. This implements the first aspect of the “preserve section boundaries” contract.
100-
101-
* **Detect metadata.section change.** An element with a new value in `element.metadata.section` is considered to start a new section. When a change in this value is encountered a new chunk is started. This implements the second aspect of preserving section boundaries. This metadata is not present in all document formats so is not used alone. An element having `None` for this metadata field is considered to be part of the prior section; a section break is only detected on an explicit change in value.
102-
103-
* **Respect page boundaries.** Page boundaries can optionally also be respected using the `multipage_sections` argument. This defaults to `True` meaning that a page break does _not_ start a new chunk. Setting this to `False` will separate elements that occur on different pages into distinct chunks.
104-
105-
* **Combine small sections.** In certain documents, partitioning may identify a list-item or other short paragraph as a `Title` element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the `combine_text_under_n_chars` argument. This defaults to the same value as `max_characters` such that sequential small sections are combined to maximally fill the chunking window. Setting this to `0` will disable section combining.
106-
85+
<SharedChunkingStrategies/>
10786

10887
## Recovering Chunk Elements
10988

0 commit comments

Comments
 (0)