Unstructured-IO
diff --git a/‎…-practices/table-extraction-from-pdf.mdx‎ ‎…les/apioss/table-extraction-from-pdf.mdx‎open-source/best-practices/table-extraction-from-pdf.mdx renamed to examplecode/codesamples/apioss/table-extraction-from-pdf.mdx
Lines changed: 1 addition & 2 deletions b/‎…-practices/table-extraction-from-pdf.mdx‎ ‎…les/apioss/table-extraction-from-pdf.mdx‎open-source/best-practices/table-extraction-from-pdf.mdx renamed to examplecode/codesamples/apioss/table-extraction-from-pdf.mdx
Lines changed: 1 addition & 2 deletions
diff --git a/‎examplecode/codesamples/oss/multi-files-api-processing.mdx‎
Lines changed: 110 additions & 0 deletions b/‎examplecode/codesamples/oss/multi-files-api-processing.mdx‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎examplecode/codesamples/oss/overview.mdx‎
Lines changed: 5 additions & 0 deletions b/‎examplecode/codesamples/oss/overview.mdx‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examplecode/codesamples/oss/table-source-connector.mdx‎
Lines changed: 61 additions & 0 deletions b/‎examplecode/codesamples/oss/table-source-connector.mdx‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎examplecode/codesamples/oss/vector-database.mdx‎
Lines changed: 74 additions & 0 deletions b/‎examplecode/codesamples/oss/vector-database.mdx‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎examplecode/notebooks.mdx‎
Lines changed: 20 additions & 6 deletions b/‎examplecode/notebooks.mdx‎
Lines changed: 20 additions & 6 deletions
diff --git a/‎examplecode/sampleapps.mdx‎
Lines changed: 0 additions & 8 deletions b/‎examplecode/sampleapps.mdx‎
Lines changed: 0 additions & 8 deletions
@@ -4,8 +4,7 @@ description: This section describes two methods for extracting tables from PDF f
 ---
 
 <Note>
-
-To extract tables from any documents, set the `strategy` parameter to `hi_res` for both methods below.
+This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") library and also provides an alternative method the utilizing the [Unstructured SaaS API](api-reference/api-services/overview "SaaS API"). 
 </Note>
 
 ## Method 1: Using partition\_pdf
 
@@ -0,0 +1,110 @@
+---
+title: Multi-File API Processing 
+---
+<Note>
+This sample code utilizes the [Unstructured SaaS API](api-reference/api-services/overview "SaaS API").
+</Note>
+
+## Introduction
+
+This guide demonstrates how to process multiple files using the Unstructured API and S3 Connector and implement context-aware chunking. The process involves installing dependencies, configuring settings, and utilizing Python scripts to manage and chunk data effectively.
+
+## Prerequisites
+
+Ensure you have Unstructured API key and access to an S3 bucket containing the target files.
+
+## Step-by-Step Process
+
+### Step 1: Install Unstructured and S3 Dependency
+
+Install the unstructured package with S3 support.
+
+```bash
+pip install "unstructured[s3]"
+
+```
+
+
+### Step 2: Import Libraries
+
+Import necessary libraries from the unstructured package for chunking and S3 processing.
+
+```python
+from unstructured.ingest.interfaces import (
+    FsspecConfig,
+    PartitionConfig,
+    ProcessorConfig,
+    ReadConfig,
+)
+from unstructured.ingest.runner import S3Runner
+
+from unstructured.chunking.title import chunk_by_title
+from unstructured.staging.base import dict_to_elements
+
+```
+
+
+### Step 3: Configuration
+
+Set up the API key and S3 URL for accessing the data.
+
+```python
+UNSTRUCTURED_API_KEY = os.getenv('UNSTRUCTURED_API_KEY')
+S3_URL = "s3://rh-financial-reports/world-development-bank-2023/"
+
+```
+
+
+### Step 4: Python Runner
+
+Configure and run the S3Runner for processing the data.
+
+```python
+runner = S3Runner(
+     processor_config=ProcessorConfig(
+         verbose=True,
+         output_dir="Connector-Output",
+         num_processes=8,
+     ),
+     read_config=ReadConfig(),
+     partition_config=PartitionConfig(
+         partition_endpoint="https://api.unstructured.io/general/v0/general",
+         partition_by_api=True,
+         api_key=UNSTRUCTURED_API_KEY,
+         strategy="hi_res",
+         hi_res_model_name="yolox",
+     ),
+     fsspec_config=FsspecConfig(
+         remote_url=S3_URL,
+     ),
+ )
+
+runner.run(anonymous=True)
+
+```
+
+
+### Step 5: Combine JSON Files from Multi-files Ingestion
+
+Combine JSON files into a single dataset for further processing.
+
+```json
+combined_json_data = read_and_combine_json("Connector-Output/world-development-bank-2023")
+
+```
+
+
+### Step 6: Convert into Unstructured Elements for Chunking
+
+Convert the combined JSON data into Unstructured Elements and apply chunking by title.
+
+```python
+elements = dict_to_elements(combined_json_data)
+chunks = chunk_by_title(elements)
+
+```
+
+
+## Conclusion
+
+Following these steps allows for efficient processing of multiple files using the Unstructured S3 Connector. The context-aware chunking helps in organizing and analyzing the data effectively.
@@ -0,0 +1,5 @@
+---
+title: Examples
+description: The following are some examples of how to use the library to parse documents. You can find example documents in the example-docs, along with instructions on how to download additional documents that are too large to store in the repo.
+sidebarTitle: Overview
+---
@@ -0,0 +1,61 @@
+---
+title: Delta Table Source Connector 
+---
+<Note>
+This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library.
+</Note>
+
+## Objectives
+
+1.  Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
+    
+2.  Process and store this data in a Databricks Delta Table.
+    
+3.  Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
+    
+
+## Prerequisites
+
+*   Unstructured Python SDK
+    
+*   Databricks account and workspace
+    
+*   AWS S3 for Delta Table storage
+    
+
+## Processing and Storing into Databricks Delta Table
+
+3.  Initialize PySpark
+    
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName('sparkdf').getOrCreate()
+
+```
+
+
+4.  Convert JSON output into Dataframe
+    
+
+```python
+import pyspark
+
+dataframe = spark.createDataFrame(res.elements)
+
+```
+
+
+5.  Store DataFrame as Delta Table
+    
+
+```
+dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
+
+```
+
+
+## Conclusion
+
+This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
@@ -0,0 +1,74 @@
+---
+title: Vector Database Ingestion
+---
+
+<Note> This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library. </Note>
+
+In this guide, we demonstrate how to leverage Unstructured.IO, ChromaDB, and LangChain to summarize topics from the front page of CNN Lite. Utilizing the modern LLM stack, including Unstructured, Chroma, and LangChain, this workflow is streamlined to less than two dozen lines of code.
+
+## Gather Links with Unstructured
+
+First, we gather links from the CNN Lite homepage using the partition\_html function from Unstructured. When Unstructured partitions HTML pages, links are included in the metadata for each element, making link collection straightforward.
+
+```python
+from unstructured.partition.html import partition_html
+
+cnn_lite_url = "https://lite.cnn.com/"
+elements = partition_html(url=cnn_lite_url)
+links = []
+
+for element in elements:
+    if element.metadata.link_urls:
+        relative_link = element.metadata.link_urls[0][1:]
+        if relative_link.startswith("2024"):
+            links.append(f"{cnn_lite_url}{relative_link}")
+
+```
+
+
+## Ingest Individual Articles with UnstructuredURLLoader
+
+With the links in hand, we preprocess individual news articles using UnstructuredURLLoader. This loader fetches content from the web and then uses the unstructured partition function to extract content and metadata. Here we preprocess HTML files, but it also works with other response types like application/pdf. The result is a list of LangChain Document objects.
+
+```python
+from langchain.document_loaders import UnstructuredURLLoader
+
+loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)
+docs = loaders.load()
+
+```
+
+
+## Load Documents into ChromaDB
+
+The next step is to load the preprocessed documents into ChromaDB. This process involves vectorizing the documents using OpenAI embeddings and loading them into Chroma’s vector store. Once in Chroma, similarity search can be performed to retrieve documents related to specific topics.
+
+```python
+from langchain.vectorstores.chroma import Chroma
+from langchain.embeddings import OpenAIEmbeddings
+
+embeddings = OpenAIEmbeddings()
+vectorstore = Chroma.from_documents(docs, embeddings)
+query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)
+
+```
+
+
+## Summarize the Documents
+
+After retrieving relevant documents from Chroma, we summarize them using LangChain. The load\_summarization\_chain function allows for easy summarization, simply requiring the selection of an LLM and summarization chain.
+
+```python
+from langchain.chat_models import ChatOpenAI
+from langchain.chains.summarize import load_summarize_chain
+
+llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
+chain = load_summarize_chain(llm, chain_type="stuff")
+chain.run(query_docs)
+
+```
+
+
+## Jupyter Notebook
+
+To delve deeper into this example, you can access the full Jupyter Notebook here: [News of the Day Notebook](https://github.com/Unstructured-IO/unstructured/blob/main/examples/chroma-news-of-the-day/news-of-the-day.ipynb)
@@ -2,35 +2,49 @@
 title: Notebooks
 sidebarTitle: Notebooks
 mode: wide
-description: ""
+description: "Notebooks contain complete working sample code for end to end solutions."
 ---
 
 <CardGroup cols={2}>
-    <Card title="Llama 3 Local RAG" icon="square-1" href="https://colab.research.google.com/drive/1ieDJ4LoxARrHFqxXWif8Lv8e8aZTgmtH">
+
+    <Card title="Simple PDF and HTML Parsing" icon="square-1" href="https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ">
+    <br/>
+       Quickstart guide for parsing simple PDF and HTML documents with Unstructured.
+       <br/>
+       ``Unstructured``
+    <br/>
+    </Card>
+    <Card title="Llama 3 Local RAG" icon="square-2" href="https://colab.research.google.com/drive/1ieDJ4LoxARrHFqxXWif8Lv8e8aZTgmtH">
     <br/>
        Build a local RAG app for your emails with Unstructured, LangChain and Ollama.
     <br/>
     ```Unstructured``` ```LangChain``` ```Ollama``` ```Llama 3```
     </Card>
-    <Card title="RAG with PDFs, LangChain and Llama 3" icon="square-2" href="https://colab.research.google.com/drive/1BJYYyrPVe0_9EGyXqeNyzmVZDrCRZwsg">
+    <Card title="RAG with PDFs, LangChain and Llama 3" icon="square-3" href="https://colab.research.google.com/drive/1BJYYyrPVe0_9EGyXqeNyzmVZDrCRZwsg">
     <br/>
         A RAG system with the Llama 3 model from Hugging Face. 
     <br/>
     ```Unstructured```  ```🤗 Hugging Face``` ```LangChain``` ```Llama 3``` 
     </Card>
-    <Card title="Building RAG With Powerpoint Presentations" icon="square-3" href="https://colab.research.google.com/drive/1NmLSmUMb9ozlELnWa3J4WwdrBfGomwPk">
+    <Card title="Building RAG With Powerpoint Presentations" icon="square-4" href="https://colab.research.google.com/drive/1NmLSmUMb9ozlELnWa3J4WwdrBfGomwPk">
     <br/>
         A RAG solution that is based on Powerpoint files.
     <br/>
     ```Unstructured```  ```🤗 Hugging Face``` ```LangChain``` ```Llama 3``` 
     </Card>
-    <Card title="LLM Chatbot With Databricks" icon="square-4" href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#">
+    <Card title="LLM Chatbot With Databricks" icon="square-5" href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#">
     <br/>
         A Chatbot on Databricks with RAG, DBRX Instruct & Vector Search
     <br/>
     ```Unstructured``` ```Databricks``` ```LangChain``` 
     </Card>
-
+    <Card title="Synthetic Test Dataset Generation" icon="square-6" href="https://colab.research.google.com/drive/1VvOauC46xXeZrhh8nlTyv77yvoroMQjr?usp=sharing">
+    <br/>
+       Build a Synthetic Test Dataset for your RAG system in 5 easy steps
+    <br/>
+       ``Unstructured`` ``GPT-4o`` ``Ragas`` ``LangChain``
+    <br/>
+    </Card>
 </CardGroup>