Merge branch 'main' into main

fzowl · web-flow · commit 0344ee9b0721 · 2024-05-28T10:06:55.000+02:00
diff --git a/api-reference/api-services/api-parameters.mdx b/api-reference/api-services/api-parameters.mdx
@@ -11,7 +11,8 @@ The only required parameter is `files` -  the file you wish to process.
 
 | Python & direct call                      | JavaScript                               | Description                                                                                                                                                                                                                                                                                                                                                                          |
 |-------------------------------------------|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `files` (_shared.Files_)                  | `files` (_File_, _Blob_, _shared.Files_) | The file to process.                                                                                                                                                                                                                                                                                                                                                                 |
+| `files` (_shared.Files_)                  | `files` (_File_, _Blob_, _shared.Files_) | The file to process.          
+| `chunking_strategy` (_str_)               | `chunkingStrategy` (_string_)            | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"`, `"by_page"`, `"by_similarity"`    |
 | `coordinates` (_bool_)                    | `coordinates` (_boolean_)                | If true, return bounding box coordinates for each element extracted via OCR. Default: false                                                                                                                                                                                                                                                                                          |
 | `encoding` (_str_)                        | `encoding` (_string_)                    | The encoding method used to decode the text input. Default: `utf-8`                                                                                                                                                                                                                                                                                                                  |
 | `extract_image_block_types` (_List[str]_) | `extractImageBlockTypes` (_string[]_)    | The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields                                                                                                                                                                                                                                                                |
@@ -25,8 +26,7 @@ The only required parameter is `files` -  the file you wish to process.
 | `split_pdf_page` (_bool_)                 | `splitPdfPage` (_boolean_)               | Should the pdf file be split at client. Ignored on backend.                                                                                                                                                                                                                                                                                                                          |
 | `strategy` (_str_)                        | `strategy` (_string_)                    | The strategy to use for partitioning PDF/image. Options are `fast`, `hi_res`, `auto`. Default: `auto`                                                                                                                                                                                                                                                                                |
 | `unique_element_ids` (_bool_)             | `uniqueElementIds` (_boolean_)           | When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False                                                                                                                                                                                          |
-| `xml_keep_tags` (_bool_)                  | `xmlKeepTags` (_boolean_)                | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.                                                                                                                                                                                                                                      |
-| `chunking_strategy` (_str_)               | `chunkingStrategy` (_string_)            | Use one of the supported strategies to chunk the returned elements after partitioning. When `chunking_strategy` is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `"basic"`, `"by_title"`                                                                                                                     |
+| `xml_keep_tags` (_bool_)                  | `xmlKeepTags` (_boolean_)                | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.                                                                                                                                                                                                                                      |                                                                                                                 |
 
 The following parameters only apply when a `chunking_strategy` is specified. Otherwise, they are ignored.
 
diff --git a/api-reference/api-services/chunking.mdx b/api-reference/api-services/chunking.mdx
@@ -0,0 +1,48 @@
+---
+title: Chunking strategies
+---
+
+Chunking functions use metadata and document elements detected with partition functions to split a document into
+appropriately-sized chunks for uses cases such as Retrieval Augmented Generation (RAG).
+
+If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that
+Unstructured methods slightly differ, since the partitioning step already divides an entire document into its structural elements.
+
+Individual elements will only be split if they exceed the desired maximum chunk size. Two or more consecutive text elements
+that will together fit within `max_characters` will be combined. After chunking, you will only have elements of the
+following types:
+
+* `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
+combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single
+element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
+text element that was too big to fit in one chunk and required splitting.
+* `Table`:  A table element is not combined with other elements and if it fits within `max_characters` it will remain as is.
+* `TableChunk`: large tables that exceed `max_characters` chunk size are split into special `TableChunk` elements.
+
+
+import SharedChunkingStrategies from '/snippets/concepts/chunking-strategies.mdx';
+
+<SharedChunkingStrategies/>
+
+### "by_page" chunking strategy
+
+Only available in Unstructured API and Platform.
+
+The `by_page` chunking strategy ensures the content from different pages do not end up in the same chunk.
+When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the
+prior chunk.
+
+### "by_similarity" chunking strategy
+
+Only available in Unstructured API and Platform.
+
+The `by_similarity` chunking strategy employs the `sentence-transformers/multi-qa-mpnet-base-dot-v1` embedding model to
+identify topically similar sequential elements and combine them into chunks.
+
+As with other strategies, chunks will never exceed the hard-maximum chunk size set by `max_characters`. For this reason,
+not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
+guarantee that two elements with low similarity will not be combined in a single chunk.
+
+You can control the level of topic similarity you require for elements to have by setting the `similarity_threshold` parameter.
+`similarity_threshold` expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements
+must have to be included in the same chunk. The default is 0.5.
diff --git a/api-reference/api-services/free-api.mdx b/api-reference/api-services/free-api.mdx
@@ -72,25 +72,34 @@ Next, use it to call the API:
 <CodeGroup>
 
 ```python Python
-from unstructured_client import UnstructuredClient
-from unstructured_client.models import shared
-from unstructured_client.models.errors import SDKError
+import unstructured_client
+from unstructured_client.models import operations, shared
 
-client = UnstructuredClient(api_key_auth="YOUR_API_KEY")
-filename = "PATH_TO_FILE"
-
-with open(filename, "rb") as f:
-    files=shared.Files(
-        content=f.read(),
-        file_name=filename,
-    )
-
-req = shared.PartitionParameters(files=files)
+client = unstructured_client.UnstructuredClient(
+    api_key_auth="YOUR_API_KEY",
+    # you may need to provide your unique API URL
+    # server_url="YOUR_API_URL",
+)
 
-try:
-    resp = client.general.partition(req)
-except SDKError as e:
-    print(e)
+filename = "sample-docs/layout-parser-paper.pdf"
+file = open(filename, "rb")
+
+res = client.general.partition(request=operations.PartitionRequest(
+    partition_parameters=shared.PartitionParameters(
+        # Note that this currently only supports a single file
+        files=shared.Files(
+            content=file.read(),
+            file_name=filename,
+        ),
+        # Other parameters
+        strategy=shared.Strategy.HI_RES,
+        chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
+    ),
+))
+
+if res.elements is not None:
+    # handle response
+    pass
 ```
 
 ```javascript JavaScript
diff --git a/api-reference/api-services/python-sdk.mdx b/api-reference/api-services/python-sdk.mdx
@@ -17,21 +17,62 @@ pip install unstructured-client
 
 ## Basics
 
+<Note>
+    Deprecation Warning: The legacy method of making API calls is currently supported, but it may be deprecated and could break in future releases. We advise all users to migrate to the new `PartitionRequest` object introduced in v0.25.0 to ensure compatibility with future updates.
+</Note>
+<Note>
+    Deprecation Warning: Defining `strategy`, `chunking_strategy`, and `output_format` parameters as strings may also be deprecated and could break in future releases. It is also advised to use the new classes for defining those parameters. Ex: `shared.Strategy.HI_RES`
+</Note>
+
 Let's start with a simple example in which you send a pdf document to partition via Unstructured API using the Python SDK:
 
 ```python
-from unstructured_client import UnstructuredClient
+import unstructured_client
+from unstructured_client.models import operations, shared
+
+client = unstructured_client.UnstructuredClient(
+    api_key_auth="YOUR_API_KEY",
+    # you may need to provide your unique API URL
+    # server_url="YOUR_API_URL",
+)
+
+filename = "sample-docs/layout-parser-paper.pdf"
+file = open(filename, "rb")
+
+res = client.general.partition(request=operations.PartitionRequest(
+    partition_parameters=shared.PartitionParameters(
+        # Note that this currently only supports a single file
+        files=shared.Files(
+            content=file.read(),
+            file_name=filename,
+        ),
+        # Other parameters
+        strategy=shared.Strategy.HI_RES,
+        chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
+    ),
+))
+
+if res.elements is not None:
+    # handle response
+    pass
+```
+
+
+Legacy method without `PartitionRequest`:
+```python
+import unstructured_client
 from unstructured_client.models import shared
 from unstructured_client.models.errors import SDKError
 
-client = UnstructuredClient(
+client = unstructured_client.UnstructuredClient(
     api_key_auth="YOUR_API_KEY",
     # you may need to provide your unique API URL
     # server_url="YOUR_API_URL",
 )
 
 filename = "sample-docs/layout-parser-paper.pdf"
 file = open(filename, "rb")
+
 req = shared.PartitionParameters(
     # Note that this currently only supports a single file
     files=shared.Files(
@@ -50,10 +91,10 @@ except SDKError as e:
 ```
 
 In the example above we're sending the request to the free Unstructured API, in which case the API URL is the same for all
-users, and you don't need to provide it. Note, however, that you still need to authenticate yourself with
+users and you don't need to provide it. Note, however, that you still need to authenticate yourself with
 your individual API Key.
 
-If you want to use the SaaS Unstructured API, you need to replace the URL in this example with the unique API URL that you have
+If you want to use the SaaS Unstructured API, you need to define `server_url` as the unique API URL that you
 received in the same email as your API key. For Unstructured API on Azure/AWS, use the API URL that you
 configured through those services.
 
diff --git a/api-reference/api-services/saas-api-development-guide.mdx b/api-reference/api-services/saas-api-development-guide.mdx
@@ -81,29 +81,34 @@ Unstructured API key to authenticate yourself.
 <CodeGroup>
 
 ```python Python
-from unstructured_client import UnstructuredClient
-from unstructured_client.models import shared
-from unstructured_client.models.errors import SDKError
+import unstructured_client
+from unstructured_client.models import operations, shared
 
-client = UnstructuredClient(
+client = unstructured_client.UnstructuredClient(
     api_key_auth="YOUR_API_KEY",
-    server_url="YOUR_API_URL",
+    # you may need to provide your unique API URL
+    # server_url="YOUR_API_URL",
 )
 
-filename = "PATH_TO_FILE"
-
-with open(filename, "rb") as f:
-    files=shared.Files(
-        content=f.read(),
-        file_name=filename,
-    )
-
-req = shared.PartitionParameters(files=files)
-
-try:
-    resp = client.general.partition(req)
-except SDKError as e:
-    print(e)
+filename = "sample-docs/layout-parser-paper.pdf"
+file = open(filename, "rb")
+
+res = client.general.partition(request=operations.PartitionRequest(
+    partition_parameters=shared.PartitionParameters(
+        # Note that this currently only supports a single file
+        files=shared.Files(
+            content=file.read(),
+            file_name=filename,
+        ),
+        # Other parameters
+        strategy=shared.Strategy.HI_RES,
+        chunking_strategy=shared.ChunkingStrategy.BY_PAGE,
+    ),
+))
+
+if res.elements is not None:
+    # handle response
+    pass
 ```
 
 ```javascript JavaScript
diff --git a/mint.json b/mint.json
@@ -32,7 +32,7 @@
       {
         "name": "Community",
         "icon": "slack",
-        "url": "https://unstructuredw-kbe4326.slack.com/signup#/domain-signup"
+        "url": "https://short.unstructured.io/pzw05l7"
       },
       {
         "name": "Product",
@@ -312,7 +312,8 @@
         "group": "Concepts",
         "pages": [
           "api-reference/api-services/document-elements",
-          "api-reference/api-services/partitioning"
+          "api-reference/api-services/partitioning",
+          "api-reference/api-services/chunking"
         ]
       },
 
diff --git a/open-source/core-functionality/chunking.mdx b/open-source/core-functionality/chunking.mdx
@@ -1,5 +1,5 @@
 ---
-ttile: Chunking 
+title: Chunking
 description: Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).
 ---
 
@@ -77,33 +77,12 @@ for chunk in chunks:
 
 ## Chunking Strategies
 
-There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with the basic strategy so we’ll describe the baseline strategy first:
+There are currently two chunking strategies, _basic_ and _by\_title_. The `by_title` strategy shares most behaviors with
+the basic strategy so we'll describe the baseline strategy first:
 
-### “basic” chunking strategy
+import SharedChunkingStrategies from '/snippets/concepts/chunking-strategies.mdx';
 
-*   The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified `max_characters` (hard-max) and `new_after_n_chars` (soft-max) option values.
-    
-*   A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting.
-    
-*   A `Table` element is always isolated and never combined with another element. A `Table` can be oversized, like any other text element, and in that case is divided into two or more `TableChunk` elements using text-splitting.
-    
-*   If specified, `overlap` is applied between split-chunks and is also applied between normal chunks when `overlap_all` is `True`.
-    
-
-### “by\_title” chunking strategy
-
-The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
-
-In addition to the behaviors of the `basic` strategy above, the `by_title` strategy has the following behaviors:
-
-*   **Detect section headings.** A `Title` element is considered to start a new section. When a `Title` element is encountered, the prior chunk is closed and a new chunk started, even if the `Title` element would fit in the prior chunk. This implements the first aspect of the “preserve section boundaries” contract.
-    
-*   **Detect metadata.section change.** An element with a new value in `element.metadata.section` is considered to start a new section. When a change in this value is encountered a new chunk is started. This implements the second aspect of preserving section boundaries. This metadata is not present in all document formats so is not used alone. An element having `None` for this metadata field is considered to be part of the prior section; a section break is only detected on an explicit change in value.
-    
-*   **Respect page boundaries.** Page boundaries can optionally also be respected using the `multipage_sections` argument. This defaults to `True` meaning that a page break does _not_ start a new chunk. Setting this to `False` will separate elements that occur on different pages into distinct chunks.
-    
-*   **Combine small sections.** In certain documents, partitioning may identify a list-item or other short paragraph as a `Title` element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the `combine_text_under_n_chars` argument. This defaults to the same value as `max_characters` such that sequential small sections are combined to maximally fill the chunking window. Setting this to `0` will disable section combining.
-    
+<SharedChunkingStrategies/>
 
 ## Recovering Chunk Elements
 
diff --git a/open-source/core-functionality/embedding.mdx b/open-source/core-functionality/embedding.mdx
diff --git a/snippets/concepts/chunking-strategies.mdx b/snippets/concepts/chunking-strategies.mdx
diff --git a/snippets/concepts/glossary.mdx b/snippets/concepts/glossary.mdx

Original file line number	Diff line number	Diff line change
`@@ -32,7 +32,7 @@`
`32`	`32`	`{`
`33`	`33`	`"name": "Community",`
`34`	`34`	`"icon": "slack",`
`35`		`- "url": "https://unstructuredw-kbe4326.slack.com/signup#/domain-signup"`
	`35`	`+ "url": "https://short.unstructured.io/pzw05l7"`
`36`	`36`	`},`
`37`	`37`	`{`
`38`	`38`	`"name": "Product",`
`@@ -312,7 +312,8 @@`
`312`	`312`	`"group": "Concepts",`
`313`	`313`	`"pages": [`
`314`	`314`	`"api-reference/api-services/document-elements",`
`315`		`- "api-reference/api-services/partitioning"`
	`315`	`+ "api-reference/api-services/partitioning",`
	`316`	`+ "api-reference/api-services/chunking"`
`316`	`317`	`]`
`317`	`318`	`},`
`318`	`319`