From 60a9b0365b2bacf457daa6e7c107db3a8323293c Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Wed, 10 Dec 2025 14:39:55 -0800 Subject: [PATCH 1/2] Unstructured API: On-Demand Jobs - Quickstart and Walkthrough example notebooks --- ...ctured_API_On_Demand_Jobs_Quickstart.ipynb | 486 +++++ ...tured_API_On_Demand_Jobs_Walkthrough.ipynb | 1686 +++++++++++++++++ 2 files changed, 2172 insertions(+) create mode 100644 notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb create mode 100644 notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb diff --git a/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb b/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb new file mode 100644 index 0000000..dea44fa --- /dev/null +++ b/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb @@ -0,0 +1,486 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Unstructured API On-Demand Jobs Quickstart\n", + "\n", + "This notebook shows how to use the [Unstructured Python SDK](https://docs.unstructured.io/api-reference/workflow/overview#unstructured-python-sdk) to have Unstructured process local files by using its _on-demand jobs_ functionality, which is part of the Unstructured API's collection of [workflow operations](https://docs.unstructured.io/api-reference/workflow/overview).\n", + "\n", + "---\n", + "\n", + "πŸ“ **Note**: The on-demand jobs functionality is designed to work *only by processing local files*.\n", + "\n", + "To process files (and data) in remote file and blob storage, databases, and vector stores, you must use other workflow operations in the Unstructured API. To learn how, see the notebook [Dropbox-To-Pinecone Connector API Quickstart for Unstructured](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Dropbox_To_Pinecone_Connector_Quickstart.ipynb). \n", + "\n", + "---" + ], + "metadata": { + "id": "HF-z7mJua4Ms" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Requirements\n", + "\n", + "To run this notebook, you will need:\n", + "\n", + "- An Unstructured account. To sign up for an account, go to https://unstructured.io. In the top navigation bar, click **Get started for free**, and follow the on-screen directions to finish signing up. After you sign up, you are immediately signed in to your new **Let's Go** account, at https://platform.unstructured.io.\n", + "- An Unstructured API key, as follows:\n", + "\n", + " 1. After you are signed in to your account, on the sidebar, click **API Keys**.\n", + " 2. Click **Generate New Key**.\n", + " 3. Enter some meaningful display name for the key, and then click **Continue**.\n", + " 4. Next to the new key's name, click the **Copy** icon. The key's value is copied to your system's clipboard. If you lose this key, simply return to the list and click **Copy** again.\n", + "\n", + "- One or more local files for Unstructured to process. This notebook assumes that the local files you want to process are all PDF files, and that these PDFs are in a folder that is accessible from this notebook. The easiest and fastest way to create this folder is as follows:\n", + "\n", + " 1. On this notebook's sidebar, click the folder (**Files**) icon. The **Files** pane opens and displays the contents of the `/content` folder. (This folder should have a hidden `.config` subfolder and a `sample_data` subfolder.)\n", + " 2. Right-click any blank area within the **Files** pane, and then click **New folder**.\n", + " 3. Enter a name for the new folder. This notebook assumes the folder is named `input`, and the path to this new folder is `/content/input`.\n", + " 4. To upload files to this folder, do the following:\n", + "\n", + " a. Rest your mouse pointer on the `input` folder.
\n", + " b. Click the ellipsis (three dots) icon, and then click **Upload**.
\n", + " c. Browse to and select the files on your local machine that you want to upload to this `input` folder.
\n", + "\n", + "---\n", + "\n", + "⚠️ **Important**: Each on-demand job is limited to 10 files, and each file is limited to 10 MB in size.\n", + "\n", + "---\n", + "\n", + "- A destination folder for Unstructured to send its processed results to. This notebook assumes that the destination folder is accessible from this notebook. The easiest and fastest way to create this folder is as follows:\n", + "\n", + " 1. If the **Files** pane is not already showing, on this notebook's sidebar, click the folder (**Files**) icon. The **Files** pane opens and displays the contents of the `/content` folder.\n", + " 2. Right-click any blank area within the **Files** pane, and then click **New folder**.\n", + " 3. Enter a name for the new destination folder. This notebook assumes the folder is named `output`, and the path to this new folder is `/content/output`.\n", + "\n", + "---\n", + "\n", + "⚠️ **Warning**: Any files that you upload to these `input` or `output` folders will be deleted whenever Google Colab disconnects or resets, for example due to inactivity, manual restart, or session timeout.\n", + "\n", + "---\n" + ], + "metadata": { + "id": "lM6kKnbmdI7O" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Install the Unstructured Python SDK and other required packages\n", + "\n", + "Run the following cell to install the Unstructured Python SDK on a virtual machine (VM) in Google's cloud. This VM is associated with this notebook." + ], + "metadata": { + "id": "Zzios_nIfjrP" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install unstructured-client" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ZlP0HJe-aoBX", + "outputId": "37a704de-c1a2-43a7-f3d6-251b594c6bff" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.4-py3-none-any.whl.metadata (23 kB)\n", + "Requirement already satisfied: httpx in /usr/local/lib/python3.12/dist-packages (0.28.1)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (2.12.3)\n", + "Collecting pypdf>=6.2.0 (from unstructured-client)\n", + " Downloading pypdf-6.4.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx) (4.11.0)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx) (2025.11.12)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx) (3.11)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.12/dist-packages (from cryptography>=3.1->unstructured-client) (2.0.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.41.4)\n", + "Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.15.0)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.2)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.12/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.4)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.12/dist-packages (from anyio->httpx) (1.3.1)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.23)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Downloading unstructured_client-0.42.4-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.9/207.9 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-6.4.0-py3-none-any.whl (329 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m329.5/329.5 kB\u001b[0m \u001b[31m16.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: pypdf, unstructured-client\n", + "Successfully installed pypdf-6.4.0 unstructured-client-0.42.4\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 2: Set your Unstructured API key\n", + "\n", + "In the following cell, replace `` with the value of your API key, and then run the cell.\n", + "\n", + "As a security best practice, you would typically set this key elsewhere (for example, as an environment variable or stored in a secure key vault) and then access it programmatically here. But to keep things simple here for demonstration purposes, just specify your API key in plaintext in the following cell." + ], + "metadata": { + "id": "1AxBbfvEamKi" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-GdAyX5SalRV" + }, + "outputs": [], + "source": [ + "UNSTRUCTURED_API_KEY = \"\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 3: Run an on-demand job\n", + "\n", + "In this step, you use the Python package `httpx` to run an on-demand job. (The `httpx` package is required for now, until on-demand job functionality is added to the Unstructured Python SDK.) This job is based on a predefined Unstructured [workflow](https://docs.unstructured.io/ui/overview#how-does-it-work) definition that contains the following workflow nodes:\n", + "\n", + "- A [High Res partitoner](https://docs.unstructured.io/ui/partitioning).\n", + "- An [image description enrichment](https://docs.unstructured.io/ui/enriching/image-descriptions).\n", + "- A [tables to HTML enrichment](https://docs.unstructured.io/ui/enriching/table-to-html).\n", + "- A [generative OCR optimization enrichment](https://docs.unstructured.io/ui/enriching/generative-ocr).\n", + "\n", + "This predefined workflow definition does not apply [chunking](https://docs.unstructured.io/ui/chunking) or generate [embeddings](https://docs.unstructured.io/ui/embedding).\n", + "\n", + "The result of this job is a unique job ID; a list of input file IDs, one input file ID per file; and a list of processing result outputs, one processing result output per file. Each processing result output is referenced by its output node ID and file ID.\n", + "\n", + "To complete this step, run the following cell." + ], + "metadata": { + "id": "GtYim9e-f94B" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client import UnstructuredClient\n", + "from unstructured_client.models.operations import CreateJobRequest\n", + "from unstructured_client.models.shared import BodyCreateJob, InputFiles\n", + "import os, json\n", + "\n", + "# Set variables for:\n", + "\n", + "# - On-demand job's type.\n", + "# - On-demand job's workflow template name.\n", + "# - On-demand job's settings.\n", + "# - Path to local input files.\n", + "\n", + "# - Input files array.\n", + "# - On-demand job ID.\n", + "# - On-demand job output file IDs and output node IDs.\n", + "\n", + "job_type = \"template\"\n", + "template_id = \"hi_res_and_enrichment\"\n", + "request_data = json.dumps({\"job_type\": job_type, \"template_id\": template_id})\n", + "input_dir = \"/content/input/\"\n", + "\n", + "files = []\n", + "job_id = \"\"\n", + "job_input_file_ids = []\n", + "job_output_node_files = []\n", + "\n", + "# Read in all input files.\n", + "for filename in os.listdir(input_dir):\n", + " full_path = os.path.join(input_dir, filename)\n", + "\n", + " # Skip non-files (for example, directories).\n", + " if not os.path.isfile(full_path):\n", + " continue\n", + "\n", + " files.append(\n", + " (\n", + " InputFiles(\n", + " content=open(full_path, \"rb\"),\n", + " file_name=filename,\n", + " content_type=\"application/pdf\"\n", + " )\n", + " )\n", + " )\n", + "\n", + "# Run the on-demand job, capturing the job ID and the job's\n", + "# input/output file IDs and output node IDs.\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " response = client.jobs.create_job(\n", + " request=CreateJobRequest(\n", + " body_create_job=BodyCreateJob(\n", + " request_data=request_data,\n", + " input_files=files\n", + " )\n", + " )\n", + " )\n", + "\n", + " job_id = response.job_information.id\n", + " print(f\"Job ID: {job_id}\\n\")\n", + "\n", + " job_input_file_ids = response.job_information.input_file_ids\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " job_output_node_files = response.job_information.output_node_files\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f2aelIqfhlUS", + "outputId": "cb149a2c-0e13-4f86-d511-9dc1399399f1" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job ID: 9bd1a2f3-90a2-410a-9064-8d29dd296067\n", + "\n", + "Input file details:\n", + "\n", + "250713305v1-d88db765.pdf\n", + "H03-Cryptosystem-proposed-by-Nash-61a58972.pdf\n", + "\n", + "Output node file details:\n", + "\n", + "file_id='250713305v1-d88db765.pdf' node_id='93fc2ce8-e7c8-424f-a6aa-41460fc5d35d' node_subtype='unstructured_api' node_type='partition'\n", + "file_id='250713305v1-d88db765.pdf' node_id='4eb78731-4669-438c-9e2c-c76fcb1c9a52' node_subtype='openai_image_description' node_type='prompter'\n", + "file_id='250713305v1-d88db765.pdf' node_id='35cacdfe-3ac1-4183-bbf4-826cd88c882c' node_subtype='anthropic_ocr' node_type='prompter'\n", + "file_id='250713305v1-d88db765.pdf' node_id='ee5d4bf2-3783-4818-9f69-9ebbaa8778ea' node_subtype='anthropic_table2html' node_type='prompter'\n", + "file_id='H03-Cryptosystem-proposed-by-Nash-61a58972.pdf' node_id='93fc2ce8-e7c8-424f-a6aa-41460fc5d35d' node_subtype='unstructured_api' node_type='partition'\n", + "file_id='H03-Cryptosystem-proposed-by-Nash-61a58972.pdf' node_id='4eb78731-4669-438c-9e2c-c76fcb1c9a52' node_subtype='openai_image_description' node_type='prompter'\n", + "file_id='H03-Cryptosystem-proposed-by-Nash-61a58972.pdf' node_id='35cacdfe-3ac1-4183-bbf4-826cd88c882c' node_subtype='anthropic_ocr' node_type='prompter'\n", + "file_id='H03-Cryptosystem-proposed-by-Nash-61a58972.pdf' node_id='ee5d4bf2-3783-4818-9f69-9ebbaa8778ea' node_subtype='anthropic_table2html' node_type='prompter'\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 4: Poll for job completion\n", + "\n", + "In this step, you monitor your job's progress and confirm its completion.\n", + "\n", + "To complete this step, run the following cell, which lets you know how the job is progressing and when the job is completed.\n", + "\n", + "Do not proceed to the next step until you see the message `Job is completed`." + ], + "metadata": { + "id": "fkE6aXJKnKkf" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "\n", + "def poll_job_status(client, job_id):\n", + " while True:\n", + " response = client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(\"Job is scheduled, polling again in 10 seconds...\")\n", + " time.sleep(10)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(\"Job is in progress, polling again in 10 seconds...\")\n", + " time.sleep(10)\n", + " else:\n", + " print(\"Job is completed.\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HCgZ0gnWnZOS", + "outputId": "e7aa59f2-9699-4380-bfb9-976dab4b3097" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is completed.\n", + "Job details:\n", + "---\n", + "{\n", + " \"created_at\": \"2025-12-10T18:45:45.911031Z\",\n", + " \"id\": \"de7b344b-f30a-4739-880b-e7e204d4be4f\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"a61a2082-a37b-4273-96fe-37e69763ff7b\",\n", + " \"workflow_name\": \"Job de7b344b\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 5: Download the job's processed results\n", + "\n", + "In this step, you use the on-demand job's job ID and the input file IDs from Step 3 to download the job's results into the `/content/output` folder that you created during this notebook's Requirements.\n", + "\n", + "To complete this step, run the following cell." + ], + "metadata": { + "id": "S5RA8rofwPER" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import DownloadJobOutputRequest\n", + "import json\n", + "\n", + "output_dir = \"/content/output/\"\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(f\"Attempting to get processed results from file_id '{job_input_file_id}'...\")\n", + "\n", + " response = client.jobs.download_job_output(\n", + " request=DownloadJobOutputRequest(\n", + " job_id=job_id,\n", + " file_id=job_input_file_id\n", + " )\n", + " )\n", + "\n", + " output_path = os.path.join(output_dir, f\"{job_input_file_id}.json\")\n", + "\n", + " with open(output_path, \"w\") as f:\n", + " json.dump(response.any, f, indent=4)\n", + "\n", + " print(f\"Saved output for file_id '{job_input_file_id}' to '{output_path}'.\\n\")" + ], + "metadata": { + "id": "sKvW8WdAn1Ex", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "3307a653-e05d-475d-b5b1-92c9daae6e7a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Attempting to get processed results from file_id '250713305v1-d88db765.pdf'...\n", + "Saved output for file_id '250713305v1-d88db765.pdf' to '/content/output/250713305v1-d88db765.pdf.json'.\n", + "\n", + "Attempting to get processed results from file_id 'H03-Cryptosystem-proposed-by-Nash-61a58972.pdf'...\n", + "Saved output for file_id 'H03-Cryptosystem-proposed-by-Nash-61a58972.pdf' to '/content/output/H03-Cryptosystem-proposed-by-Nash-61a58972.pdf.json'.\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 6: View the downloaded results\n", + "\n", + "To view the downloaded job's results, do the following:\n", + "\n", + "1. On this notebook's sidebar, click the folder (**Files**) icon, if the **Files** pane is not already shown.\n", + "2. In the **Files** pane, click to expand the `output` folder.\n", + "3. Double-click one of the files that end in `.json`.\n", + "4. The file's contents appear in a pane on the right side of this notebook. You should notice the following:\n", + "\n", + "- Unstructured outputs its results in industry-standard [JSON](https://www.json.org/) format, which is ideal for RAG, agentic AI, and model fine-tuning.\n", + "- Each object in the JSON is called a [document element](https://docs.unstructured.io/ui/document-elements) and contains a `text` representation of the content that Unstructured detected for the particular portion of the document that was analyzed.\n", + "- The `type` is the kind of document element that Unstructured categorizes it as, such as whether it is a title (`Title`), a table (`Table`), an image (`Image`), a series of well-formulated sentences (`NarrativeText`), some kind of free text (`UncategorizedText`), a part of a list (`ListItem`), and so on. [Learn more](https://docs.unstructured.io/ui/document-elements#element-type).\n", + "- The `element_id` is a unique identifier that Unstructured generates to refer to each document element. [Learn more](https://docs.unstructured.io/ui/document-elements#element-id).\n", + "- `metadata` contains supporting details about each document element, such as the page number it occurred on, the file it occurred in, and so on. [Learn more](https://docs.unstructured.io/ui/document-elements#metadata).\n" + ], + "metadata": { + "id": "swwJKKDbw3Lo" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Next steps\n", + "\n", + "Congratulations! You have just run an on-demand job with Unstructured.\n", + "\n", + "Learn more about [on-demand jobs](https://docs.unstructured.io/api-reference/workflow/overview#run-an-on-demand-job).\n", + "\n", + "You can also learn more about the [Unstructured API](https://docs.unstructured.io/api-reference/overview).\n", + "\n", + "This notebook shows how to process local files only. To process files (and data) in remote file and blob storage, databases, and vector stores, you must use other workflow operations in the Unstructured API. To learn how, see the notebook [Dropbox-To-Pinecone Connector API Quickstart for Unstructured](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Dropbox_To_Pinecone_Connector_Quickstart.ipynb)." + ], + "metadata": { + "id": "nuCVt0tVw9Vr" + } + } + ] +} \ No newline at end of file diff --git a/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb b/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb new file mode 100644 index 0000000..a9ba5af --- /dev/null +++ b/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb @@ -0,0 +1,1686 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Unstructured API On-Demand Jobs Walkthrough\n", + "\n", + "This walkthrough provides you with deep, hands-on experience with the [Unstructured API](https://docs.unstructured.io/api-reference/overview). As you follow along, you will learn how to use many of Unstructured's features for\n", + "[partitioning](https://docs.unstructured.io/ui/partitioning),\n", + "[enriching](https://docs.unstructured.io/ui/enriching/overview),\n", + "[chunking](https://docs.unstructured.io/ui/chunking),\n", + "and [embedding](https://docs.unstructured.io/ui/embedding).\n", + "These features are optimized for turning your source documents and data into information that is well-tuned for\n", + "[retrieval-augmented generation (RAG)](https://unstructured.io/blog/rag-whitepaper),\n", + "[agentic AI](https://unstructured.io/problems-we-solve#powering-agentic-ai),\n", + "and [model fine-tuning](https://www.geeksforgeeks.org/deep-learning/what-is-fine-tuning/).\n", + "\n", + "This walkthrough uses two sample files to demonstrate how Unstructured identifies and processes content such as image, graphs, complex tables, non-English characters, handwriting, and poorly-scanned content. These files, which are available for you to download to your local machine, are as follows:\n", + "\n", + "- Wang, Z., Liu, X., & Zhang, M. (2022, November 23). _Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling_. arXiv.org. https://arxiv.org/pdf/2211.12781. This 12-page PDF file features English and non-English characters, images, graphs, and complex tables. Throughout this walkthrough, this file's title is shortened to β€œChinese characters” for brevity.\n", + "- United States Central Security Service. (2012, January 27). _National Cryptologic Museum Opens New Exhibit on Dr. John Nash_. United States National Security Agency. https://courses.csail.mit.edu/6.857/2012/files/H03-Cryptosystem-proposed-by-Nash.pdf. This PDF file features English handwriting and scanned images of documents. Throughout this walkthrough, this file's title is shortened to β€œNash letters” for brevity.\n", + "\n", + "If you are not able to complete any of the following steps, contact Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).\n", + "\n", + "---\n", + ">\n", + "> πŸ“ _Note_\n", + ">\n", + "> This notebook uses Unstructured's _on-demand jobs_ functionality, which\n", + "> is designed to work *only with local files*.\n", + ">\n", + "> To process files (and data) in remote file and blob storage, databases, and\n", + "> vector stores, you must use other workflow operations in the Unstructured\n", + "> API. To learn how, see for example the notebook\n", + ">[Dropbox-To-Pinecone Connector API Quickstart for Unstructured](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Dropbox_To_Pinecone_Connector_Quickstart.ipynb). \n", + ">\n", + "---\n", + ">\n", + "> πŸ’‘ _What's this?_\n", + ">\n", + "> As you move through this walkthrough, you will notice tips like this one.\n", + "> These tips are designed to help expand your knowledge about Unstructured as\n", + "> you go. Feel free to skip these tips for now if you are in a hurry. You can\n", + "> always return to them later to learn more.\n", + ">\n", + "---" + ], + "metadata": { + "id": "1cWcvwA4Uv9A" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Requirements\n", + "\n", + "To run this notebook, you will need:\n", + "\n", + "- An Unstructured account. To sign up for an account, go to https://unstructured.io/?modal=try-for-free. After you sign up, you are immediately signed in to your new Unstructured **Let's Go** account, at https://platform.unstructured.io.\n", + "- An Unstructured API key, as follows:\n", + "\n", + " 1. After you are signed in to your account, on the sidebar click **API Keys**.\n", + " 2. Click **Generate New Key**.\n", + " 3. Enter some meaningful display name for the key, and then click **Continue**.\n", + " 4. Next to the new key's name, click the **Copy** icon. The key's value is copied to your system's clipboard. If you lose this key, simply return to the list and click **Copy** again.\n", + "\n", + "- One or more local files for Unstructured to process. This notebook assumes that you are using the \"Chinese characters\" and \"Nash letters\" PDF files, and that these PDFs are in a subfolder that is accessible from this notebook. The easiest and fastest way to create this subfolder is as follows:\n", + "\n", + " 1. On this notebook's sidebar, click the folder (**Files**) icon. By default, the **Files** pane will show all files and subfolders within the `/content` folder. (This folder also contains a hidden `.config` subfolder and a `sample_data` subfolder.)\n", + " 2. Right-click in a blank area anywhere below the existing list of subfolders and files in the **Files** pane, and then click **New folder**.\n", + " 3. Enter a name for the new subfolder within `/content`. This notebook assumes the subfolder is named `input`.\n", + " 4. To upload files to this subfolder, do the following:\n", + "\n", + " a. Rest your mouse pointer on the `input` subfolder.
\n", + " b. Click the ellipsis (three dots) icon, and then click **Upload**.
\n", + " c. Browse to and select the \"Chinese characters\" file on your local machine that you want to upload to this `input` subfolder.
\n", + " d. Repeat this process to upload the \"Nash letters\" file to this `input` subfolder.
\n", + "\n", + "---\n", + "\n", + "⚠️ **Important**: Each on-demand job is limited to 10 files, and each file is limited to 10 MB in size.\n", + "\n", + "---\n", + "\n", + "- A destination subfolder for Unstructured to send its processed results to. This notebook assumes that the destination folder is accessible from this notebook. The easiest and fastest way to create this folder is as follows:\n", + "\n", + " 1. If the **Files** pane is not already open, then on this notebook's sidebar, click the folder (**Files**) icon. By default, the **Files** pane will show all files and subfolders within the `/content` folder.\n", + " 2. Right-click in a blank area anywhere below the existing list of subfolders and files in the **Files** pane, and then click **New folder**.\n", + " 3. Enter a name for the new subfolder within `/content`. This notebook assumes the subfolder is named `output`.\n", + "\n", + "---\n", + ">\n", + "> ⚠️ **Warning**: Any files that you upload to these `input` or `output`\n", + "> subfolders will be deleted whenever Google Colab disconnects or resets, for\n", + "> example due to inactivity, manual restart, or session timeout.\n", + ">\n", + "---" + ], + "metadata": { + "id": "SUUJdV_ZWUQw" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Run a basic on-demand job\n", + "\n", + "In this step, you run an [on-demand job](https://docs.unstructured.io/api-reference/workflow/overview#run-an-on-demand-job) in your Unstructured account. This on-demand job contains a basic [workflow](https://docs.unstructured.io/api-reference/workflow/workflows). The workflow lasts only for the duration of this on-demand job.\n", + "\n", + "Workflows are defined sequences of processes that automate the flow of data from your source documents and data into Unstructured for processing. Unstructured then sends its processed data over into your destination file storage locations, databases, and vector stores. Your RAG apps, agents, and models can then use this processed data in those destinations to do things more quickly and accurately such as\n", + "[answering users' questions](https://learn.microsoft.com/en-us/azure/developer/ai/advanced-retrieval-augmented-generation),\n", + "[automating business processes](https://unstructured.io/problems-we-solve#business-process-automation),\n", + "and [expanding your organization's available body of knowledge](http://knowledgemanagement.ie/the-critical-role-of-knowledge-management-as-a-foundation-for-llms-and-ai/).\n", + "\n", + "In this notebook, your source documents are local, and Unstructured stores its processed data locally as well. However, as just described, Unstructured can work with remote source documents (and data), too.\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _Which kinds of remote sources and destinations does Unstructured support?_\n", + ">\n", + "> Unstructured can connect to many types of remote sources and destinations including\n", + "> file storage services such as Amazon S3 and Google Cloud Storage; databases\n", + "> such as PostgreSQL; and vector storage and database services such as MongoDB\n", + "> Atlas and Pinecone.\n", + ">\n", + "> See the full list of [supported source connectors](https://docs.unstructured.io/api-reference/workflow/sources/overview) and [supported destination connectors](https://docs.unstructured.io/api-reference/workflow/destinations/overview).\n", + ">\n", + "---\n", + ">\n", + "> πŸ’‘ _Which kinds of files does Unstructured support?_\n", + ">\n", + "> Unstructured can process a wide variety of file types including PDFs, word\n", + "> processing documents, spreadsheets, slide decks, HTML, image files, emails,\n", + "> and more.\n", + ">\n", + "> See the full list of [supported file types](https://docs.unstructured.io/api-reference/supported-file-types).\n", + ">\n", + "---\n", + "\n", + "1. In the following cell, replace `` with your pasted Unstructured API key's value and then run the cell, which sets the constant `UNSTRUCTURED_API_KEY` to your API key's value." + ], + "metadata": { + "id": "1oa8u7hNW3hM" + } + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "hRE1zws-Ut9c" + }, + "outputs": [], + "source": [ + "UNSTRUCTURED_API_KEY = \"\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "2. Run the following cell to install the Unstructured Python SDK on a virtual machine (VM) in Google's cloud. This VM is associated with this notebook." + ], + "metadata": { + "id": "LDSQZELjZHTR" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install unstructured-client" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "OICi0TNdZQiK", + "outputId": "9ea283c8-d61b-4beb-9668-c00fd6e56f49" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.4-py3-none-any.whl.metadata (23 kB)\n", + "Requirement already satisfied: httpx in /usr/local/lib/python3.12/dist-packages (0.28.1)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (2.12.3)\n", + "Collecting pypdf>=6.2.0 (from unstructured-client)\n", + " Downloading pypdf-6.4.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx) (4.11.0)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx) (2025.11.12)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx) (3.11)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.12/dist-packages (from cryptography>=3.1->unstructured-client) (2.0.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.41.4)\n", + "Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.15.0)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.2)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.12/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.4)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.12/dist-packages (from anyio->httpx) (1.3.1)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.23)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Downloading unstructured_client-0.42.4-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.9/207.9 kB\u001b[0m \u001b[31m5.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-6.4.0-py3-none-any.whl (329 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m329.5/329.5 kB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: pypdf, unstructured-client\n", + "Successfully installed pypdf-6.4.0 unstructured-client-0.42.4\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "3. Run the following cell, which sets the notebook's local input and output subfolders." + ], + "metadata": { + "id": "2S8QYnyRZVUf" + } + }, + { + "cell_type": "code", + "source": [ + "INPUT_PATH = \"/content/input\"\n", + "OUTPUT_PATH = \"/content/output\"" + ], + "metadata": { + "id": "D0Cet6NNaLlm" + }, + "execution_count": 17, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "4. Run the following cell, which runs a basic on-demand job. This job contains an Unstructured custom workflow with a single stage or step (known as a workflow _node_) for partitioning. This workflow uses the **VLM** [partitioning](https://docs.unstructured.io/ui/partitioning) strategy to turn the contents of your documents and semi-structured data into a data format that is fine-tuned for well-tuned for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.\n", + "\n", + " The results of running this cell include the on-demand job's ID and a list of job outputs. This list's items include a file ID and an output node ID for each file that Unstructured processes. You will use these file IDs and output node IDs later to retrieve Unstructured's processed file data." + ], + "metadata": { + "id": "wKsxxyP4bFbh" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client import UnstructuredClient\n", + "from unstructured_client.models.operations import CreateJobRequest\n", + "from unstructured_client.models.shared import BodyCreateJob, InputFiles\n", + "import os, json\n", + "\n", + "def run_on_demand_job(client, input_dir, job_type, template_id=None, job_nodes=None):\n", + " request_data = {}\n", + " files = []\n", + "\n", + " for filename in os.listdir(input_dir):\n", + " full_path = os.path.join(input_dir, filename)\n", + "\n", + " # Skip non-files (for example, directories).\n", + " if not os.path.isfile(full_path):\n", + " continue\n", + "\n", + " files.append(\n", + " (\n", + " InputFiles(\n", + " content=open(full_path, \"rb\"),\n", + " file_name=filename,\n", + " content_type=\"application/pdf\"\n", + " )\n", + " )\n", + " )\n", + "\n", + " if job_type == \"template\":\n", + " request_data = json.dumps({\"job_type\": job_type, \"template_id\": template_id})\n", + " elif job_type == \"ephemeral\":\n", + " request_data = json.dumps({\"job_type\": job_type, \"job_nodes\": job_nodes})\n", + " else:\n", + " raise ValueError(f\"Invalid job type: '{job_type}'. Must be 'template' or 'ephemeral'.\")\n", + "\n", + " # Run the on-demand job, capturing the job ID and the job's\n", + " # input/output file IDs and output node IDs.\n", + " response = client.jobs.create_job(\n", + " request=CreateJobRequest(\n", + " body_create_job=BodyCreateJob(\n", + " request_data=request_data,\n", + " input_files=files\n", + " )\n", + " )\n", + " )\n", + "\n", + " job_id = response.job_information.id\n", + " job_input_file_ids = response.job_information.input_file_ids\n", + " job_output_node_files = response.job_information.output_node_files\n", + "\n", + " return job_id, job_input_file_ids, job_output_node_files\n", + "\n", + "vlm_partitioner_node = {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"vlm\",\n", + " \"type\": \"partition\",\n", + " \"settings\": {\n", + " \"provider\": \"vertexai\",\n", + " \"model\": \"gemini-2.0-flash-001\",\n", + " \"is_dynamic\": False,\n", + " \"allow_fast\": True\n", + " }\n", + "}\n", + "\n", + "job_nodes = [ vlm_partitioner_node ]\n", + "\n", + "job_id = \"\"\n", + "job_input_file_ids = []\n", + "job_output_node_files = []\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YSbKsE-lbUnB", + "outputId": "cdc43792-3232-415b-ca0d-12f728b27b1b" + }, + "execution_count": 29, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job ID: 1f91e440-eeb2-407d-a4e8-9131693e392d\n", + "\n", + "Input file details:\n", + "\n", + "221112781v1-3b190ed1.pdf\n", + "H03-Cryptosystem-proposed-by-Nash-3edb6a01.pdf\n", + "\n", + "Output node file details:\n", + "\n", + "file_id='221112781v1-3b190ed1.pdf' node_id='28a283ce-fd39-4bfa-8e4a-7cb847856853' node_subtype='vlm' node_type='partition'\n", + "file_id='H03-Cryptosystem-proposed-by-Nash-3edb6a01.pdf' node_id='28a283ce-fd39-4bfa-8e4a-7cb847856853' node_subtype='vlm' node_type='partition'\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "---\n", + ">\n", + "> πŸ’‘ _What other kinds of settings can I set for my workflow?_\n", + ">\n", + "> Your workflow can be set to run automatically (on a regular time schedule).\n", + "> You can also set your workflow to use a paritioning strategy such as\n", + "> **Auto**, **Fast**, or **High Res**, instead of **VLM**.\n", + "> [Learn how](https://docs.unstructured.io/api-reference/workflow/workflows).\n", + ">\n", + "---\n", + ">\n", + "> πŸ’‘ _Why am I specifying this particular vision language model here? When would I choose one of the available models over another?_\n", + ">\n", + "> A _vision language model_ (VLM) is designed to use sophisticated AI\n", + "> techniques and logic to combine advanced image and text understanding,\n", + "> resulting in more accurate and contextually-rich output.\n", + ">\n", + "> The following code uses the Gemini Flash 2.0 VLM offered by Vertex AI.\n", + "> As VLMs are constantly being released\n", + "> and improved, Unstructured is always\n", + "> adding to and updating its list of supported VLMs. If you aren't getting\n", + "> consistent results with one VLM for a particular set of files, switching over\n", + "> to another one might improve your results, depending on that VLM's\n", + "> capabilities and the sample data that is was trained on.\n", + ">\n", + "---\n", + ">\n", + "> πŸ’‘ _When would I choose **Auto**, **Fast**, **High Res**, or **VLM**?_\n", + ">\n", + "> - **Auto** is recommended in most cases. It lets Unstructured figure out the\n", + "> best strategy to switch over to for each incoming file (and even for each\n", + "> page if the incoming file is a PDF), so you don't have to!\n", + "> - **Fast** is only for when you know for certain that none of your files have\n", + "> tables, images, or multilanguage, scanned, or handwritten content in them.\n", + "> It's optimized for partitioning text-only content and is the fastest of all\n", + "> the strategies. It can recognize the text for only a few languages other than\n", + "> English.\n", + "> - **High Res** is only for when you know for certain that at least one of your\n", + "> files has images or simple tables in them, and that none of your files also\n", + "> have scanned or handwritten content in them. It can recognize the text for\n", + "> more languages than **Fast** but not as many as **VLM**.\n", + "> - **VLM** is great for any file, but it is best when you know for certain that\n", + "> some of your files have a combination of tables (especially complex ones),\n", + "> images, and multilanguage, scanned, or handwritten content. It's the highest\n", + "> quality but slowest of all the strategies.\n", + ">\n", + "> In this walkthrough, you use the **VLM** strategy\n", + "> only to see how each of these strategies works with a combination of complex\n", + "> tables, images, and multilanguage, scanned, and handwritten content. In\n", + "> practice, for these kinds of files you would likely just want to choose\n", + "> **Auto**.\n", + ">\n", + "---\n", + ">\n", + "> πŸ’‘ _This workflow has only a partitioner. What about enriching, chunking, and embedding?_\n", + ">\n", + "> Don't worrry; you will add nodes to this workflow for enriching, chunking,\n", + "> and embedding later in this notebook.\n", + ">\n", + "---" + ], + "metadata": { + "id": "kKvxoAck9B_O" + } + }, + { + "cell_type": "markdown", + "source": [ + "5. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more.\n", + "\n", + " - If the job's final status is completed but with errors, or is failed, continue ahead to step 6 in this procedure to try to find out why.\n", + " - If the job's final status is stopped, try running the previous cell in step 4 again to produce a new job for this workflow.\n", + " - If the job's final status is successfully completed, skip ahead to step 7 in this procedure.\n", + "\n", + " [Learn more about this code](https://docs.unstructured.io/api-reference/workflow/overview#get-processing-details-for-a-job)." + ], + "metadata": { + "id": "kUoMey7J4MKL" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "from unstructured_client import UnstructuredClient\n", + "\n", + "def poll_job_status(client, job_id):\n", + " while True:\n", + " response = client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(\"Job is scheduled, polling again in 10 seconds...\")\n", + " time.sleep(10)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(\"Job is in progress, polling again in 10 seconds...\")\n", + " time.sleep(10)\n", + " else:\n", + " print(\"Job is completed.\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "BeZ9vBiS4QaP", + "outputId": "4897b373-18dc-4ac6-e75d-0753cc4a9b1e" + }, + "execution_count": 30, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is in progress, polling again in 10 seconds...\n", + "Job is completed.\n", + "Job details:\n", + "---\n", + "{\n", + " \"created_at\": \"2025-12-10T20:10:20.316640Z\",\n", + " \"id\": \"1f91e440-eeb2-407d-a4e8-9131693e392d\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"760b3ad5-b747-47ed-b882-b84534c7d9b2\",\n", + " \"workflow_name\": \"Job 1f91e440\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "6. If the job is completed but with errors, or is failed, run the following cell to try to find out why. Otherwise, skip ahead to step 7 in this procedure.\n", + "\n", + " [Learn more about this code](https://docs.unstructured.io/api-reference/workflow/overview#get-failed-file-details-for-a-job)." + ], + "metadata": { + "id": "e2gjO7s44fOt" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import GetJobFailedFilesRequest\n", + "\n", + "def get_failed_files_details(client, job_id):\n", + " response = client.jobs.get_job_failed_files(\n", + " request=GetJobFailedFilesRequest(\n", + " job_id=job_id\n", + " )\n", + " )\n", + "\n", + " info = response.job_failed_files\n", + "\n", + " if info.failed_files.__len__() > 0:\n", + " print(f\"{info.failed_files.__len__()} failed file(s):\")\n", + "\n", + " for failed_file in info.failed_files:\n", + " print(f\"---\")\n", + " print(f\"document: {failed_file.document}\")\n", + " print(f\"error: {failed_file.error}\")\n", + " else:\n", + " print(f\"No failed files.\")\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " get_failed_files_details(client, job_id)" + ], + "metadata": { + "id": "U4Rvuhq99le8", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "2a3d2b50-036f-4707-e008-567d9cebf77d" + }, + "execution_count": 31, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "No failed files.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "7. If the job is successfully completed, run the following cell to download the data that Unstructured generated for the \"Chinese characters\" file to this notebook's local `/content/output` subfolder.\n", + "\n", + " [Learn more about this code](https://docs.unstructured.io/api-reference/workflow/overview#download-a-processed-local-file-from-a-job)." + ], + "metadata": { + "id": "RuY4qxqy9vj8" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import DownloadJobOutputRequest\n", + "\n", + "def download_job_output(client, job_id, job_input_file_ids, output_dir):\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(f\"Attempting to get processed results from file_id '{job_input_file_id}'...\")\n", + "\n", + " response = client.jobs.download_job_output(\n", + " request=DownloadJobOutputRequest(\n", + " job_id=job_id,\n", + " file_id=job_input_file_id\n", + " )\n", + " )\n", + "\n", + " output_path = os.path.join(output_dir, f\"{job_input_file_id}.json\")\n", + "\n", + " with open(output_path, \"w\") as f:\n", + " json.dump(response.any, f, indent=4)\n", + "\n", + " print(f\"Saved output for file_id '{job_input_file_id}' to '{output_path}'.\\n\")\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "74ZHppUe4gTM", + "outputId": "9d3b12e3-39f9-4265-d545-9dca62b79821" + }, + "execution_count": 32, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Attempting to get processed results from file_id '221112781v1-3b190ed1.pdf'...\n", + "Saved output for file_id '221112781v1-3b190ed1.pdf' to '/content/output/221112781v1-3b190ed1.pdf.json'.\n", + "\n", + "Attempting to get processed results from file_id 'H03-Cryptosystem-proposed-by-Nash-3edb6a01.pdf'...\n", + "Saved output for file_id 'H03-Cryptosystem-proposed-by-Nash-3edb6a01.pdf' to '/content/output/H03-Cryptosystem-proposed-by-Nash-3edb6a01.pdf.json'.\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "8. You can now look at the data that Unstructured generated. Do this as follows:\n", + "\n", + " a. On this notebook's sidebar, click the folder (**Files**) icon, if the **Files** pane is not already shown.
\n", + " b. In the **Files** pane, click to expand the `output` folder.
\n", + " c. Double-click the file titled `221112781v1-.pdf.json`, where `221112781v1-.pdf` matches the related `file_id` value above.
\n", + " d. The file's contents appear in a pane on the right side of this notebook.\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _What am I looking at in the output here?_\n", + ">\n", + "> - Unstructured outputs its results in industry-standard [JSON](https://www.json.org/) format, which is\n", + "> ideal for RAG, agentic AI, and model fine-tuning.\n", + "> - Each object in the JSON is called a [document element](https://docs.unstructured.io/ui/document-elements) and contains a `text`\n", + "> representation of the content that Unstructured detected for the particular\n", + "> portion of the document that was analyzed.\n", + "> - The `type` is the kind of document element that Unstructured categorizes it\n", + "> as, such as whether it is a title (`Title`), a table (`Table`), an image\n", + "> (`Image`), a series of well-formulated sentences (`NarrativeText`), some kind\n", + "> of free text (`UncategorizedText`), a part of a list (`ListItem`), and so on. [Learn more](https://docs.unstructured.io/ui/document-elements#element-type).\n", + "> - The `element_id` is a unique identifier that Unstructured generates to\n", + "> refer to each document element. [Learn more](https://docs.unstructured.io/ui/document-elements#element-id).\n", + "> - `metadata` contains supporting details about each document element, such as\n", + "> the page number it occurred on, the file it occurred in, and so on. [Learn more](https://docs.unstructured.io/ui/document-elements#metadata).\n", + ">\n", + "---\n", + "\n", + "9. Some interesting portions of the output include the following. To search for text in the output, right-click anywhere inside the output pane, click **Command Palette**, enter or click **Find**, and then enter the text you want to search for:\n", + "\n", + " - The Chinese characters on page 3. Search for the text `verbs. The characters`. Notice how the Chinese characters are interpreted as their Unicode equivalents. For example, `\\u624e` is ζ‰Ž.\n", + " - The tables on pages 1, 3, 6, 7-9, and 12. Search for the text `\"Table\"` (including the quotation marks). Notice especially the `text` and `metadata.text_as_html` output for each of these table elements. To quickly jump among each of the tables in this output, click the **Next Match (Enter)** (down arrow) icon in the **Find** bar.\n", + " - The images on pages 3, 7, and 8. Search for the text `\"Image\"` (including the quotation marks). Notice especially the `text` and `metadata.text_as_html` output for each of these image elements. To quickly jump among each of the images in this output, click the **Next Match (Enter)** (down arrow) icon in the **Find** bar.\n", + "\n", + "10. Now open the `H03-Cryptosystem-proposed-by-Nash-.pdf.json` file, where `H03-Cryptosystem-proposed-by-Nash-.pdf` matches the related `file_id` value above.\n", + "11. Some interesting portions of the output include the following:\n", + "\n", + " - The handwriting on page 3. Search for the text `Dear Major Grosjean`.\n", + " - The mimeograph on page 18. Search for the text `The system which`." + ], + "metadata": { + "id": "0Ws3ZC51_9GO" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 2: Experiment with enriching\n", + "\n", + "In this step, you add several [enrichments](https://docs.unstructured.io/ui/enriching/overview) to your workflow, such as generating summary descriptions of detected images and tables, HTML representations of detected tables, detected entities (such as people and organizations) and the inferred relationships among these entities, and improving the accuracy of initially-detected complex text.\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _Can you tell me more about what each of these enrichments actually do?_\n", + ">\n", + "> The _image description_ enrichment generates a summary description of each\n", + "> detected image. This can help you to more quickly and easily understand what\n", + "> each image is all about without having to stop to manually visualize and\n", + "> interpret the image's content yourself. This also provides additional helpful\n", + "> context about the image for your RAG apps, agents, and models.\n", + "> [Learn more](https://docs.unstructured.io/ui/enriching/image-descriptions).\n", + ">\n", + "> The _table description_ enrichment generates a summary description of each\n", + "> detected table. This can help you to more quickly and easily understand what\n", + "> each table is all about without having to stop to manually read through the\n", + "> table's content yourself. This also provides additional helpful context about\n", + "> the table for your RAG apps, agents, and models.\n", + "> [Learn more](https://docs.unstructured.io/ui/enriching/table-descriptions).\n", + ">\n", + "> The _table-to-HTML_ enrichment generates an HTML representation of each\n", + "> detected table. This can help you to more quickly and accurately recreate the\n", + "> table's content elsewhere later as needed. This also provides additional\n", + "> context about the table's structure for your RAG apps, agents, and models.\n", + "> [Learn more](https://docs.unstructured.io/ui/enriching/table-to-html).\n", + ">\n", + "> The _named entity recognition (NER)_ enrichment generates a list of detected\n", + "> entities (such as people and organizations) and the inferred relationships\n", + "> among these entities. This provides additional context about these entities'\n", + "> types and their relationships for your graph databases, RAG apps, agents, and\n", + "> models.\n", + "> [Learn more](https://docs.unstructured.io/ui/enriching/ner).\n", + ">\n", + "> The _generative optical character recognition (generative OCR)_ enrichment\n", + "> improves the accuracy of initially-detected complex text, such as text that\n", + "> is presented in non-standard orientations, watermarked text, or text with a\n", + "> mixture of international characters.\n", + ">[Learn more](https://docs.unstructured.io/ui/enriching/generative-ocr).\n", + ">\n", + "---\n", + "\n", + "1. Run the following cell, which adds workflow nodes for generating the image and table summary descriptions, table HTML, entities and their relationships, and generative OCR.\n", + "\n", + " [Learn more about this code](https://docs.unstructured.io/api-reference/workflow/workflows#enrichment-node)." + ], + "metadata": { + "id": "adV8bM9fBCRK" + } + }, + { + "cell_type": "code", + "source": [ + "image_description_enrichment_node = {\n", + " \"name\": \"Anthropic Image Description\",\n", + " \"subtype\": \"anthropic_image_description\",\n", + " \"type\": \"prompter\",\n", + " \"settings\": {}\n", + "}\n", + "\n", + "table_description_enrichment_node = {\n", + " \"name\": \"Anthropic Table Description\",\n", + " \"subtype\": \"anthropic_table_description\",\n", + " \"type\": \"prompter\",\n", + " \"settings\": {}\n", + "}\n", + "\n", + "table_to_html_enrichment_node = {\n", + " \"name\": \"Anthropic Table to HTML\",\n", + " \"subtype\": \"anthropic_table2html\",\n", + " \"type\": \"prompter\",\n", + " \"settings\": {}\n", + "}\n", + "\n", + "named_entity_recognition_enrichment_node = {\n", + " \"name\": \"Anthropic NER\",\n", + " \"subtype\": \"anthropic_ner\",\n", + " \"type\": \"prompter\",\n", + " \"settings\": {}\n", + "}\n", + "\n", + "generative_ocr_enrichment_node = {\n", + " \"name\": \"Anthropic Generative OCR\",\n", + " \"subtype\": \"anthropic_ocr\",\n", + " \"type\": \"prompter\",\n", + " \"settings\": {}\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "oqPlD5fyBaBp" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "2. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "DFj4GPGlPUDV" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "7t0JTj5RPMBe" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "3. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "9MFRSjtAPdVC" + } + }, + { + "cell_type": "code", + "source": [ + "def clear_output_dir_files(output_dir):\n", + " for name in os.listdir(output_dir):\n", + " file_path = os.path.join(output_dir, name)\n", + " if os.path.isfile(file_path): # Skip any subfolders.\n", + " os.remove(file_path)\n", + "\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "vkVjtLTqP_C8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "4. Some interesting portions of the output, for example in the \"Chinese characters\" file, include the following:\n", + "\n", + " - The tables on pages 1, 3, 6, 7-9, and 12. Search for the text `\"Table\"` (including the quotation marks). Notice the summary description for each of these tables in these table elements' `text` fields. Also notice the `text_as_html` field for each of these tables. Remember, to quickly jump among each of the tables in this output, click the **Next Match (Enter)** (down arrow) icon in the **Find** bar.\n", + " - The images on pages 3, 7, and 8. Search for the text `\"Image\"` (including the quotation marks). Notice the summary description for each of these image elements' `text` fields. Remember, to quickly jump among each of the images in this output, click the **Next Match (Enter)** (down arrow) icon in the **Find** bar.\n", + " - The identified entities and inferred relationships among them. Search for the text `Zhijun Wang`. Of the eight instances of this name, notice the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice. Remember, to quickly jump among each of the instances of `Zhijun Wang` in this output, click the **Next Match (Enter)** (down arrow) icon in the **Find** bar." + ], + "metadata": { + "id": "H8maYX2sQB9i" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 3: Experiment with chunking\n", + "\n", + "In this step, you apply [chunking](https://docs.unstructured.io/ui/chunking) to your workflow. Chunking is the process where Unstructured rearranges the resulting document elements' `text` content into manageable \"chunks\" to stay within the limits of an AI model and to improve retrieval precision.\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _What kind of chunking strategy should I use, and how big should my chunks be?_\n", + ">\n", + "> Unfortunately, there is no one-size-fits-all answer to this question.\n", + "> However, there are some general considerations and guidelines that can help\n", + "> you to determine the best chunking strategy and chunk size for your specific\n", + "> use case. Be sure of course to also consult the documentation for your target\n", + "> AI model and downstream application toolsets.\n", + ">\n", + "> Is your content primarily organized by title, by page, by interrelated\n", + "> subject matter, or none of these? This can help you determine whether a\n", + "> by-title, by-page, by-similarity, or basic (by-character) chunking strategy\n", + "> is best. (You'll experiment with each of these strategies here later.)\n", + ">\n", + "> If your chunks are too small, they might lose necessary context, leading to\n", + "> the model providing inaccurate, irrelevant, or hallucinated results. On the\n", + "> other hand, if your chunks are too large, the model can struggle with the\n", + "> sheer volume of information, leading to information overload, diluted\n", + "> meaning, and potentially higher processing costs. You should aim to find a\n", + "> balance between chunks that are big enough to contain meaningful information,\n", + "> while small enough to enable performant applications and low latency\n", + "> responses.\n", + ">\n", + "> For example, smaller chunks of 128 or 256 tokens might be sufficient for\n", + "> capturing more granular semantic information, while larger chunks of 512 or\n", + "> 1024 tokens might be better for retaining more context. It's important here\n", + "> to note that _tokens_ and _characters_ are not the same thing! In terms of\n", + "> characters, for English text, a common approximation is 1 token being equal\n", + "> to about 3 or 4 characters or three-quarters of a word. Many AI model\n", + "> providers publish their own token-to-character calculators online that you\n", + "> can use for estimation purposes.\n", + ">\n", + "> You should experiement with a variety of chunk sizes, taking into account the\n", + "> kinds of content, the length and complexity of user queries and agent tasks,\n", + "> the intended end use, and of course the limits of the models you are using.\n", + "> Try different chunking strategies and sizes with your models and evaluate the\n", + "> results for yourself.\n", + ">\n", + "---\n", + "\n", + "1. Run the following code, which changes the workflow to add a workflow node that chunks the document elements' `text` by using a character-based strategy.\n", + "\n", + " [Learn more about this code](https://docs.unstructured.io/api-reference/workflow/workflows#chunker-node).\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _What kinds of chunking settings are available, and what does each setting do?_\n", + ">\n", + "> - Contextual chunking (`contextual_chunking_strategy`) prepends\n", + "> chunk-specific explanatory context to each chunk, which has been shown to\n", + "> yield significant improvements in downstream retrieval accuracy.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#contextual-chunking).\n", + "> - Include original elements (`include_orig_elements`) outputs into each\n", + "> chunk's `metadata` field's `orig_elements` value the elements that were\n", + "> used to form that particular chunk.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#include-original-elements-setting).\n", + "> - Max characters (`max_characters`) is the \"hard\" or maximum number of\n", + "> characters that any one chunk can contain. Unstructured cannot exceed this\n", + "> number when forming chunks.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#max-characters-setting).\n", + "> - New after n characters (`new_after_n_characters`) is the \"soft\" or\n", + "> approximate number of characters that any one chunk can contain.\n", + "> Unstructured can exceed this number if needed when forming chunks (but\n", + "> still cannot exceed the max characters setting).\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#new-after-n-characters-setting).\n", + "> - Overlap (`overlap`), when applied (see `overlap_all`), prepends to the\n", + "> current chunk the specified number of characters from the previous chunk,\n", + "> which can help provide additional context about this chunk relative to the\n", + "> previous chunk.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#overlap-setting).\n", + "> - Overlap all (`overlap_all`), applies the `overlap` setting (if greater than\n", + "> zero) to all chunks. Otherwise, setting `overlap_all` to `False` that the\n", + "> `overlap` setting (if greater than zero) is applied only in edge cases\n", + "> where `normal` chunks cannot be formed by combining whole elements. Set\n", + "> `overlap_all` to `True` with caution as it can introduce noise into\n", + "> otherwise clean semantic units.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#overlap-all-setting).\n", + "> - Combine text under n characters (`combine_text_under_n_characters`)\n", + "> combines elements from a section into a chunk until a section reaches a\n", + "> length of this many characters.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#combine-text-under-n-characters-setting).\n", + "> - Multipage sections (`multipage_sections`), when set to `True`, allows\n", + "> sections to span multiple pages.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#multipage-sections-setting).\n", + "> - The similarity threshold (`similarity_threshold`) is a number between `0`\n", + "> and `1` exclusive (`0.01` to `0.99` inclusive). `0.01` means that any two\n", + "> segments of text that are being compared to each other and are considered\n", + "> least identical in semantic meaning to each other are more likely to be\n", + "> combined into the same chunk together, when such combining must occur.\n", + "> `0.99` means that any two segments of text that are being compared to each\n", + "> other and are considered almost identical in semantic meaning to each other\n", + "> are more likely to be combined into the same chunk together, when such\n", + "> combining must occur. Numbers toward `0.01` bias toward least-identical\n", + "> semantic matches, while numbers toward `0.99` bias toward near-identical\n", + "> semantic matches.\n", + "> [Learn more](https://docs.unstructured.io/ui/chunking#similarity-threshold-setting)." + ], + "metadata": { + "id": "-lhFW_RRST27" + } + }, + { + "cell_type": "code", + "source": [ + "chunk_by_character_node = {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_character\",\n", + " \"type\": \"chunk\",\n", + " \"settings\": {\n", + " \"include_orig_elements\": True,\n", + " \"new_after_n_chars\": 400,\n", + " \"max_characters\": 500,\n", + " \"overlap\": 50\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_character_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "Ip3q2GySS6ty" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "2. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "uQx-k377UCL1" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "GeRPuD5SUBkW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "3. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "5a28Vt-4UP6k" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "SeyIy8VbUXNx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "4. Open the new version of the output files. To explore the chunker's results, search for the text `\"CompositeElement\"` (with quotation marks).\n", + "6. (Steps 5-8 are optional) Now run the following cell, which changes the workflow's chunker node to chunk the document elements' text by using a title-based strategy." + ], + "metadata": { + "id": "tMgvcLc4Ukrv" + } + }, + { + "cell_type": "code", + "source": [ + "chunk_by_title_node = {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"settings\": {\n", + " \"include_orig_elements\": True,\n", + " \"new_after_n_chars\": 400,\n", + " \"max_characters\": 500,\n", + " \"overlap\": 50\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_title_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "1vUfplrqVNhL" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "6. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "osbEG5h9XDNF" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "YgLZpZenW-Ao" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "7. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "dkyAdsw1XKeC" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "-Tkj4utxXWP-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "8. Open the new version of the output files. To explore the chunker's results, search for the text \"CompositeElement\" (with quotation marks). Notice that the lengths of some of the chunks that immediately precede titles might be shortened due to the presence of the title impacting the chunk's size.\n", + "9. (Steps 9-12 are optional) Now run the following cell, which changes the workflow's chunker node to chunk the document elements' text by using a page-based strategy." + ], + "metadata": { + "id": "q63j7DXkXXJG" + } + }, + { + "cell_type": "code", + "source": [ + "chunk_by_page_node = {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_page\",\n", + " \"type\": \"chunk\",\n", + " \"settings\": {\n", + " \"include_orig_elements\": True,\n", + " \"new_after_n_chars\": 400,\n", + " \"max_characters\": 500,\n", + " \"overlap\": 50\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_page_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "ZTGkuvLlYb3z" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "10. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "9ihOQWMsYq3P" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "cfArKeNFYvRN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "11. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "gSclprpgYzXt" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "3L43Upg_Y6SY" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "12. Open the new version of the output files. To explore the chunker's results, search for the text \"CompositeElement\" (with quotation marks). Notice that the lengths of some of the chunks that immediately precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.\n", + "13. (Steps 13-20 are optional) Now run the following cell, which changes the workflow's chunker node to chunk the document elements' text by using a similarity-based strategy." + ], + "metadata": { + "id": "LKqpiaxnY-t8" + } + }, + { + "cell_type": "code", + "source": [ + "chunk_by_similarity_node = {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_similarity\",\n", + " \"type\": \"chunk\",\n", + " \"settings\": {\n", + " \"similarity_threshold\": 0.99,\n", + " \"include_orig_elements\": True,\n", + " \"max_characters\": 500\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_similarity_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "v0P2cBHVZUEk" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "14. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "QDS2fSJOZzmS" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "cpbpmfAJZ3O4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "15. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "zcBaoi54Z4FZ" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "1EUqCguBaJ83" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "16. Open the new version of the output files. To explore the chunker's results, search for the text `\"CompositeElement\"` (with quotation marks). Notice that the lengths of many of the chunks fall well short of the `max_characters` limit. This is because a similarity threshold of `0.99` means that only sentences or text segments with a near-perfect semantic match will be grouped together into the same chunk. This is an extremely high threshold, resulting in very short, highly specific chunks of text.\n", + "17. Now run the following cell, which changes the existing workflow's chunker node to use a different similarity threshold." + ], + "metadata": { + "id": "ftldsqBAaPsL" + } + }, + { + "cell_type": "code", + "source": [ + "chunk_by_similarity_node = {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_similarity\",\n", + " \"type\": \"chunk\",\n", + " \"settings\": {\n", + " \"similarity_threshold\": 0.01,\n", + " \"include_orig_elements\": True,\n", + " \"max_characters\": 500\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_similarity_node\n", + "]\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "_luQtP0Eagwz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "18. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "DAVK1P-kapvj" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "adJT1R50bpGN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "19. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "ZPUfLmXPbvHF" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "N0SutwT4buJC" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "20. Open the new version of the output files. To explore the chunker's results, search for the text `\"CompositeElement\"` (with quotation marks). Notice now that many of the chunks will now come closer to the `max_characters` limit. This is because a similarity threshold of `0.01` provides an extreme tolerance of differences between pieces of text, grouping almost anything together." + ], + "metadata": { + "id": "RFaL_aYYiasP" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 4 (Optional): Experiment with embedding\n", + "\n", + "In this step, you generate [embeddings](https://docs.unstructured.io/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured. These vectors are stored or \"embedded\" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use these vector embeddings to more efficiently and effectively find, analyze, and use the associated text. These vector embeddings are generated by an embedding model that is provided by an embedding provider. For the best embedding model to apply to your use case, see the documentation for your target downstream application toolsets.\n", + "\n", + "1. Run the following cell, which changes the workflow to add a workflow node that uses the `text-embedding-3-small` embedding model provided by Azure OpenAI." + ], + "metadata": { + "id": "bnIsZ4c8imB9" + } + }, + { + "cell_type": "code", + "source": [ + "embedder_node = {\n", + " \"name\": \"Embedder\",\n", + " \"subtype\": \"azure_openai\",\n", + " \"type\": \"embed\",\n", + " \"settings\": {\n", + " \"model_name\": \"text-embedding-3-small\"\n", + " }\n", + "}\n", + "\n", + "job_nodes = [\n", + " vlm_partitioner_node,\n", + " image_description_enrichment_node,\n", + " table_description_enrichment_node,\n", + " table_to_html_enrichment_node,\n", + " named_entity_recognition_enrichment_node,\n", + " generative_ocr_enrichment_node,\n", + " chunk_by_character_node,\n", + " embedder_node\n", + "]\n", + "\n", + "job_id, output_node_files = run_on_demand_job(\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + ")\n", + "\n", + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", + " client = client,\n", + " input_dir = INPUT_PATH,\n", + " job_type = \"ephemeral\",\n", + " job_nodes = job_nodes\n", + " )\n", + "\n", + " print(f\"Job ID: {job_id}\\n\")\n", + " print(\"Input file details:\\n\")\n", + "\n", + " for job_input_file_id in job_input_file_ids:\n", + " print(job_input_file_id)\n", + "\n", + " print(\"\\nOutput node file details:\\n\")\n", + "\n", + " for output_node_file in job_output_node_files:\n", + " print(output_node_file)" + ], + "metadata": { + "id": "uPKzBQaGi-RM" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "2. Run the following cell, which uses the job's ID to get the job's status. Do not proceed until the job polling is complete. This job polling could take a few minutes or more." + ], + "metadata": { + "id": "LC__oMvyjWKW" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " job = poll_job_status(client, job_id)\n", + " print(f\"Job details:\\n---\\n{job.model_dump_json(indent=4)}\")" + ], + "metadata": { + "id": "fXGqu06SjV4n" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "3. If the job is successfully completed, run the following cell to clear out the downloaded data from the previous on-demand job, and then download the data that Unstructured generated for the \"Chinese characters\" and \"Nash letters\" files to this notebook's local `/content/output` subfolder." + ], + "metadata": { + "id": "rjfdnm2Djd0x" + } + }, + { + "cell_type": "code", + "source": [ + "with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:\n", + " clear_output_dir_files(OUTPUT_PATH)\n", + " download_job_output(client, job_id, job_input_file_ids, OUTPUT_PATH)" + ], + "metadata": { + "id": "l0AkMu-zjiI-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "4. To explore the embeddings, open the new version of the output files, and search for the text `\"embeddings\"`.\n", + "\n", + "---\n", + ">\n", + "> πŸ’‘ _What do all of these numbers mean?_\n", + ">\n", + "> All by themselves, the numbers in the `embeddings` field of the output have\n", + "> no human-interpretable meaning on their own. However, when combined with the\n", + "> specific text that these numbers are associated with, and the embedding\n", + "> model's logic that was used to generate these numbers, the numbers in the\n", + "> `embeddings` field are extremely powerful when leveraged by downstream\n", + "> chatbots, agents, and other AI solutions.\n", + ">\n", + "> These numbers typically represent complex, abstract attributes about the text\n", + "> that are known only to the embedding model that generated these numbers.\n", + "> These attributes can be about the text's overall sentiment, intent, subject,\n", + "> semantic meaning, grammatical function, relationships between words, or any\n", + "> number of other things that the model is good at figuring out. This is why\n", + "> the embedding model you choose here must be the exact same embedding model\n", + "> that you use in any related chatbot, agent, or other AI solution that relies\n", + "> on these numbers. Otherwise, the numbers that are generated here will not\n", + "> have the same meaning downstream as well. Also, the number of dimensions (or\n", + "> the number of numbers in the `embeddings` field) you choose here must also be\n", + "> the exact same number of dimensions downstream as well.\n", + ">\n", + "> To repeat, the name and number of dimensions for the embedding model you\n", + "> choose here must be the exact same name and number of dimensions for the\n", + "> embedding model you use in your related downstream chatbots, agents, and\n", + "> other AI solutions that rely on this particular text and its associated\n", + "> embeddings that were generated here." + ], + "metadata": { + "id": "Bvvl9AchjkBR" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Next steps\n", + "\n", + "Congratulations! You are now able to run Unstructured on-demand jobs that that partition, enrich, chunk, and embed your local source documents, producing context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.\n", + "\n", + "Right now, your workflow only accepts local files for input. Your workflow also only sends Unstructured's processed data to be saved locally as a JSON file. You can create workflows that accept files and data from—and send Unstructured's processed data to—one or more remote file storage locations, databases, and vector stores. To learn how to do this, try the [Dropbox-to-Pinecone Connector API Quickstart for Unstructured](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Dropbox_To_Pinecone_Connector_Quickstart.ipynb) notebook.\n", + "\n", + "Unstructured also offers a user interface (UI), which allows you to use a graphical user interface to work with Unstructured instead of only with the API. For details, see the [Unstructured UI Overview](https://docs.unstructured.io/ui/overview).\n", + "\n", + "If you are not able to complete any of the preceding quickstarts, contact Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io)." + ], + "metadata": { + "id": "4uE0NSPLkBQj" + } + } + ] +} \ No newline at end of file From 6dabcf664d2e4888482395959373ac638a890b60 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Fri, 12 Dec 2025 12:56:46 -0800 Subject: [PATCH 2/2] Remove job_type --- ...ctured_API_On_Demand_Jobs_Quickstart.ipynb | 4 +-- ...tured_API_On_Demand_Jobs_Walkthrough.ipynb | 33 +++++++------------ 2 files changed, 13 insertions(+), 24 deletions(-) diff --git a/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb b/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb index dea44fa..b38bcd7 100644 --- a/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb +++ b/notebooks/Unstructured_API_On_Demand_Jobs_Quickstart.ipynb @@ -200,7 +200,6 @@ "\n", "# Set variables for:\n", "\n", - "# - On-demand job's type.\n", "# - On-demand job's workflow template name.\n", "# - On-demand job's settings.\n", "# - Path to local input files.\n", @@ -209,9 +208,8 @@ "# - On-demand job ID.\n", "# - On-demand job output file IDs and output node IDs.\n", "\n", - "job_type = \"template\"\n", "template_id = \"hi_res_and_enrichment\"\n", - "request_data = json.dumps({\"job_type\": job_type, \"template_id\": template_id})\n", + "request_data = json.dumps({\"template_id\": template_id})\n", "input_dir = \"/content/input/\"\n", "\n", "files = []\n", diff --git a/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb b/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb index a9ba5af..69c187d 100644 --- a/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb +++ b/notebooks/Unstructured_API_On_Demand_Jobs_Walkthrough.ipynb @@ -159,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "id": "hRE1zws-Ut9c" }, @@ -247,7 +247,7 @@ "metadata": { "id": "D0Cet6NNaLlm" }, - "execution_count": 17, + "execution_count": null, "outputs": [] }, { @@ -269,7 +269,7 @@ "from unstructured_client.models.shared import BodyCreateJob, InputFiles\n", "import os, json\n", "\n", - "def run_on_demand_job(client, input_dir, job_type, template_id=None, job_nodes=None):\n", + "def run_on_demand_job(client, input_dir, template_id=None, job_nodes=None):\n", " request_data = {}\n", " files = []\n", "\n", @@ -290,12 +290,12 @@ " )\n", " )\n", "\n", - " if job_type == \"template\":\n", - " request_data = json.dumps({\"job_type\": job_type, \"template_id\": template_id})\n", - " elif job_type == \"ephemeral\":\n", - " request_data = json.dumps({\"job_type\": job_type, \"job_nodes\": job_nodes})\n", + " if template_id is not None:\n", + " request_data = json.dumps({\"template_id\": template_id})\n", + " elif job_nodes is not None:\n", + " request_data = json.dumps({\"job_nodes\": job_nodes})\n", " else:\n", - " raise ValueError(f\"Invalid job type: '{job_type}'. Must be 'template' or 'ephemeral'.\")\n", + " raise ValueError(f\"Must provide a workflow template ID or a custom workflow definition for this job (but not both).\")\n", "\n", " # Run the on-demand job, capturing the job ID and the job's\n", " # input/output file IDs and output node IDs.\n", @@ -336,7 +336,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -358,7 +357,7 @@ "id": "YSbKsE-lbUnB", "outputId": "cdc43792-3232-415b-ca0d-12f728b27b1b" }, - "execution_count": 29, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -502,7 +501,7 @@ "id": "BeZ9vBiS4QaP", "outputId": "4897b373-18dc-4ac6-e75d-0753cc4a9b1e" }, - "execution_count": 30, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -583,7 +582,7 @@ }, "outputId": "2a3d2b50-036f-4707-e008-567d9cebf77d" }, - "execution_count": 31, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -638,7 +637,7 @@ "id": "74ZHppUe4gTM", "outputId": "9d3b12e3-39f9-4265-d545-9dca62b79821" }, - "execution_count": 32, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -805,7 +804,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1028,7 +1026,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1132,7 +1129,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1236,7 +1232,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1339,7 +1334,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1442,7 +1436,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n", @@ -1554,7 +1547,6 @@ "\n", "job_id, output_node_files = run_on_demand_job(\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", ")\n", "\n", @@ -1562,7 +1554,6 @@ " job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(\n", " client = client,\n", " input_dir = INPUT_PATH,\n", - " job_type = \"ephemeral\",\n", " job_nodes = job_nodes\n", " )\n", "\n",