Skip to content

Commit 7f0e105

Browse files
cmscmaddChristopher Maddock
andauthored
OSS, Nav, FAQ, Code Samples, Notebooks Update (#38)
Co-authored-by: Christopher Maddock <chris@unstructured.io>
1 parent b7dc886 commit 7f0e105

File tree

17 files changed

+468
-86
lines changed

17 files changed

+468
-86
lines changed

open-source/best-practices/table-extraction-from-pdf.mdx renamed to examplecode/codesamples/apioss/table-extraction-from-pdf.mdx

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ description: This section describes two methods for extracting tables from PDF f
44
---
55

66
<Note>
7-
8-
To extract tables from any documents, set the `strategy` parameter to `hi_res` for both methods below.
7+
This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") library and also provides an alternative method the utilizing the [Unstructured SaaS API](api-reference/api-services/overview "SaaS API").
98
</Note>
109

1110
## Method 1: Using partition\_pdf
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: Multi-File API Processing
3+
---
4+
<Note>
5+
This sample code utilizes the [Unstructured SaaS API](api-reference/api-services/overview "SaaS API").
6+
</Note>
7+
8+
## Introduction
9+
10+
This guide demonstrates how to process multiple files using the Unstructured API and S3 Connector and implement context-aware chunking. The process involves installing dependencies, configuring settings, and utilizing Python scripts to manage and chunk data effectively.
11+
12+
## Prerequisites
13+
14+
Ensure you have Unstructured API key and access to an S3 bucket containing the target files.
15+
16+
## Step-by-Step Process
17+
18+
### Step 1: Install Unstructured and S3 Dependency
19+
20+
Install the unstructured package with S3 support.
21+
22+
```bash
23+
pip install "unstructured[s3]"
24+
25+
```
26+
27+
28+
### Step 2: Import Libraries
29+
30+
Import necessary libraries from the unstructured package for chunking and S3 processing.
31+
32+
```python
33+
from unstructured.ingest.interfaces import (
34+
FsspecConfig,
35+
PartitionConfig,
36+
ProcessorConfig,
37+
ReadConfig,
38+
)
39+
from unstructured.ingest.runner import S3Runner
40+
41+
from unstructured.chunking.title import chunk_by_title
42+
from unstructured.staging.base import dict_to_elements
43+
44+
```
45+
46+
47+
### Step 3: Configuration
48+
49+
Set up the API key and S3 URL for accessing the data.
50+
51+
```python
52+
UNSTRUCTURED_API_KEY = os.getenv('UNSTRUCTURED_API_KEY')
53+
S3_URL = "s3://rh-financial-reports/world-development-bank-2023/"
54+
55+
```
56+
57+
58+
### Step 4: Python Runner
59+
60+
Configure and run the S3Runner for processing the data.
61+
62+
```python
63+
runner = S3Runner(
64+
processor_config=ProcessorConfig(
65+
verbose=True,
66+
output_dir="Connector-Output",
67+
num_processes=8,
68+
),
69+
read_config=ReadConfig(),
70+
partition_config=PartitionConfig(
71+
partition_endpoint="https://api.unstructured.io/general/v0/general",
72+
partition_by_api=True,
73+
api_key=UNSTRUCTURED_API_KEY,
74+
strategy="hi_res",
75+
hi_res_model_name="yolox",
76+
),
77+
fsspec_config=FsspecConfig(
78+
remote_url=S3_URL,
79+
),
80+
)
81+
82+
runner.run(anonymous=True)
83+
84+
```
85+
86+
87+
### Step 5: Combine JSON Files from Multi-files Ingestion
88+
89+
Combine JSON files into a single dataset for further processing.
90+
91+
```json
92+
combined_json_data = read_and_combine_json("Connector-Output/world-development-bank-2023")
93+
94+
```
95+
96+
97+
### Step 6: Convert into Unstructured Elements for Chunking
98+
99+
Convert the combined JSON data into Unstructured Elements and apply chunking by title.
100+
101+
```python
102+
elements = dict_to_elements(combined_json_data)
103+
chunks = chunk_by_title(elements)
104+
105+
```
106+
107+
108+
## Conclusion
109+
110+
Following these steps allows for efficient processing of multiple files using the Unstructured S3 Connector. The context-aware chunking helps in organizing and analyzing the data effectively.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
title: Examples
3+
description: The following are some examples of how to use the library to parse documents. You can find example documents in the example-docs, along with instructions on how to download additional documents that are too large to store in the repo.
4+
sidebarTitle: Overview
5+
---
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
title: Delta Table Source Connector
3+
---
4+
<Note>
5+
This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library.
6+
</Note>
7+
8+
## Objectives
9+
10+
1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
11+
12+
2. Process and store this data in a Databricks Delta Table.
13+
14+
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
15+
16+
17+
## Prerequisites
18+
19+
* Unstructured Python SDK
20+
21+
* Databricks account and workspace
22+
23+
* AWS S3 for Delta Table storage
24+
25+
26+
## Processing and Storing into Databricks Delta Table
27+
28+
3. Initialize PySpark
29+
30+
31+
```python
32+
from pyspark.sql import SparkSession
33+
34+
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
35+
36+
```
37+
38+
39+
4. Convert JSON output into Dataframe
40+
41+
42+
```python
43+
import pyspark
44+
45+
dataframe = spark.createDataFrame(res.elements)
46+
47+
```
48+
49+
50+
5. Store DataFrame as Delta Table
51+
52+
53+
```
54+
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
55+
56+
```
57+
58+
59+
## Conclusion
60+
61+
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Vector Database Ingestion
3+
---
4+
5+
<Note> This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library. </Note>
6+
7+
In this guide, we demonstrate how to leverage Unstructured.IO, ChromaDB, and LangChain to summarize topics from the front page of CNN Lite. Utilizing the modern LLM stack, including Unstructured, Chroma, and LangChain, this workflow is streamlined to less than two dozen lines of code.
8+
9+
## Gather Links with Unstructured
10+
11+
First, we gather links from the CNN Lite homepage using the partition\_html function from Unstructured. When Unstructured partitions HTML pages, links are included in the metadata for each element, making link collection straightforward.
12+
13+
```python
14+
from unstructured.partition.html import partition_html
15+
16+
cnn_lite_url = "https://lite.cnn.com/"
17+
elements = partition_html(url=cnn_lite_url)
18+
links = []
19+
20+
for element in elements:
21+
if element.metadata.link_urls:
22+
relative_link = element.metadata.link_urls[0][1:]
23+
if relative_link.startswith("2024"):
24+
links.append(f"{cnn_lite_url}{relative_link}")
25+
26+
```
27+
28+
29+
## Ingest Individual Articles with UnstructuredURLLoader
30+
31+
With the links in hand, we preprocess individual news articles using UnstructuredURLLoader. This loader fetches content from the web and then uses the unstructured partition function to extract content and metadata. Here we preprocess HTML files, but it also works with other response types like application/pdf. The result is a list of LangChain Document objects.
32+
33+
```python
34+
from langchain.document_loaders import UnstructuredURLLoader
35+
36+
loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)
37+
docs = loaders.load()
38+
39+
```
40+
41+
42+
## Load Documents into ChromaDB
43+
44+
The next step is to load the preprocessed documents into ChromaDB. This process involves vectorizing the documents using OpenAI embeddings and loading them into Chroma’s vector store. Once in Chroma, similarity search can be performed to retrieve documents related to specific topics.
45+
46+
```python
47+
from langchain.vectorstores.chroma import Chroma
48+
from langchain.embeddings import OpenAIEmbeddings
49+
50+
embeddings = OpenAIEmbeddings()
51+
vectorstore = Chroma.from_documents(docs, embeddings)
52+
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)
53+
54+
```
55+
56+
57+
## Summarize the Documents
58+
59+
After retrieving relevant documents from Chroma, we summarize them using LangChain. The load\_summarization\_chain function allows for easy summarization, simply requiring the selection of an LLM and summarization chain.
60+
61+
```python
62+
from langchain.chat_models import ChatOpenAI
63+
from langchain.chains.summarize import load_summarize_chain
64+
65+
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
66+
chain = load_summarize_chain(llm, chain_type="stuff")
67+
chain.run(query_docs)
68+
69+
```
70+
71+
72+
## Jupyter Notebook
73+
74+
To delve deeper into this example, you can access the full Jupyter Notebook here: [News of the Day Notebook](https://github.com/Unstructured-IO/unstructured/blob/main/examples/chroma-news-of-the-day/news-of-the-day.ipynb)

examplecode/notebooks.mdx

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,35 +2,49 @@
22
title: Notebooks
33
sidebarTitle: Notebooks
44
mode: wide
5-
description: ""
5+
description: "Notebooks contain complete working sample code for end to end solutions."
66
---
77

88
<CardGroup cols={2}>
9-
<Card title="Llama 3 Local RAG" icon="square-1" href="https://colab.research.google.com/drive/1ieDJ4LoxARrHFqxXWif8Lv8e8aZTgmtH">
9+
10+
<Card title="Simple PDF and HTML Parsing" icon="square-1" href="https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ">
11+
<br/>
12+
Quickstart guide for parsing simple PDF and HTML documents with Unstructured.
13+
<br/>
14+
``Unstructured``
15+
<br/>
16+
</Card>
17+
<Card title="Llama 3 Local RAG" icon="square-2" href="https://colab.research.google.com/drive/1ieDJ4LoxARrHFqxXWif8Lv8e8aZTgmtH">
1018
<br/>
1119
Build a local RAG app for your emails with Unstructured, LangChain and Ollama.
1220
<br/>
1321
```Unstructured``` ```LangChain``` ```Ollama``` ```Llama 3```
1422
</Card>
15-
<Card title="RAG with PDFs, LangChain and Llama 3" icon="square-2" href="https://colab.research.google.com/drive/1BJYYyrPVe0_9EGyXqeNyzmVZDrCRZwsg">
23+
<Card title="RAG with PDFs, LangChain and Llama 3" icon="square-3" href="https://colab.research.google.com/drive/1BJYYyrPVe0_9EGyXqeNyzmVZDrCRZwsg">
1624
<br/>
1725
A RAG system with the Llama 3 model from Hugging Face.
1826
<br/>
1927
```Unstructured``` ```🤗 Hugging Face``` ```LangChain``` ```Llama 3```
2028
</Card>
21-
<Card title="Building RAG With Powerpoint Presentations" icon="square-3" href="https://colab.research.google.com/drive/1NmLSmUMb9ozlELnWa3J4WwdrBfGomwPk">
29+
<Card title="Building RAG With Powerpoint Presentations" icon="square-4" href="https://colab.research.google.com/drive/1NmLSmUMb9ozlELnWa3J4WwdrBfGomwPk">
2230
<br/>
2331
A RAG solution that is based on Powerpoint files.
2432
<br/>
2533
```Unstructured``` ```🤗 Hugging Face``` ```LangChain``` ```Llama 3```
2634
</Card>
27-
<Card title="LLM Chatbot With Databricks" icon="square-4" href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#">
35+
<Card title="LLM Chatbot With Databricks" icon="square-5" href="https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html#">
2836
<br/>
2937
A Chatbot on Databricks with RAG, DBRX Instruct & Vector Search
3038
<br/>
3139
```Unstructured``` ```Databricks``` ```LangChain```
3240
</Card>
33-
41+
<Card title="Synthetic Test Dataset Generation" icon="square-6" href="https://colab.research.google.com/drive/1VvOauC46xXeZrhh8nlTyv77yvoroMQjr?usp=sharing">
42+
<br/>
43+
Build a Synthetic Test Dataset for your RAG system in 5 easy steps
44+
<br/>
45+
``Unstructured`` ``GPT-4o`` ``Ragas`` ``LangChain``
46+
<br/>
47+
</Card>
3448
</CardGroup>
3549

3650

examplecode/sampleapps.mdx

Lines changed: 0 additions & 8 deletions
This file was deleted.

0 commit comments

Comments
 (0)