From 0652d4ee52e0978c575b54db917076c2242df50d Mon Sep 17 00:00:00 2001 From: "David S. Batista" Date: Fri, 11 Oct 2024 15:48:01 +0200 Subject: [PATCH 1/7] initial import --- tutorials/36_Sentence_Window_Retriever.ipynb | 510 +++++++++++++++++++ 1 file changed, 510 insertions(+) create mode 100644 tutorials/36_Sentence_Window_Retriever.ipynb diff --git a/tutorials/36_Sentence_Window_Retriever.ipynb b/tutorials/36_Sentence_Window_Retriever.ipynb new file mode 100644 index 00000000..0792eb7b --- /dev/null +++ b/tutorials/36_Sentence_Window_Retriever.ipynb @@ -0,0 +1,510 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b79cad40-2c9c-4598-8195-0d6cf525ff87", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "The Sentence-Window retrieval technique is a simple and effective way to retrieve more context given a user query which matched some document. It is based on the idea that the most relevant sentences are likely to be close to each other in the document. The technique involves selecting a window of sentences around a sentence matching a user query and instead of returning the matching sentence, the entire window is returned. This technique can be particularly useful when the user query is a question or a phrase that requires more context to be understood." + ] + }, + { + "cell_type": "markdown", + "id": "53dee123-cac6-4451-a2f2-87248d218a7f", + "metadata": {}, + "source": [ + "## Haystack Component\n", + "\n", + "The `SentenceWindowRetriever` is the Haystack component that can be used in a Pipeline to implement the Sentence-Window retrieval technique." + ] + }, + { + "cell_type": "markdown", + "id": "5d9b88f6-f8d2-4450-a00f-105962f3f188", + "metadata": {}, + "source": [ + "`SentenceWindowRetriever(document_store=doc_store, window_size=2)`" + ] + }, + { + "cell_type": "markdown", + "id": "f60548f2-e188-4948-9b53-1d478f6e3e3a", + "metadata": {}, + "source": [ + "The component takes a document_store and a window_size as input. The document_store contains the documents we want to query, and the window_size is used to determine the number of sentences to return around the matching sentence. So the number of sentences returned will be `2 * window_size + 1`. Although we use the term \"sentence\" as it's inertly attached to this technique, the `SentenceWindowRetriever` actually works with any splitter from the `DocumentSplitter` class, for instance: `word`, `sentence`, `page`." + ] + }, + { + "cell_type": "markdown", + "id": "ee689359-94bf-45b6-b69b-f17266382ff8", + "metadata": {}, + "source": [ + "## Introductory Example\n", + "\n", + "Let's see a simple example of how to use the `SentenceWindowRetriever` in isolation, and later we can see how to use it within a pipeline. We start by creating a document and splitting it into sentences using the `DocumentSplitter` class." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "edf1b350-43da-4dcb-a1ef-68f3b7ad3ca7", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install haystack-ai" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "aae36c73-f04d-4a99-a63a-0a5ebfa0242a", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.retrievers import SentenceWindowRetriever" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "fb04d1f9-5329-499c-9479-7e3b4b4fa126", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document\n", + "from haystack.components.preprocessors import DocumentSplitter\n", + "\n", + "splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", + "text = (\"Paul fell asleep to dream of an Arrakeen cavern, silent people all around him moving in the dim light \"\n", + " \"of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the \"\n", + " \"drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon \"\n", + " \"awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel \"\n", + " \"himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or \"\n", + " \"companions his own age, perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had \"\n", + " \"hinted that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered \"\n", + " \"people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people \"\n", + " \"called Fremen, marked down on no census of the Imperial Regate.\")\n", + "\n", + "doc = Document(content=text)\n", + "docs = splitter.run([doc])" + ] + }, + { + "cell_type": "markdown", + "id": "61d94c11-4c32-4022-979d-8316f9069ac8", + "metadata": {}, + "source": [ + "this will result in 9 sentences represented as Haystack Document objects. We can then write these documents to a DocumentStore and use the SentenceWindowRetriever to retrieve a window of sentences around a matching sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "24a7fd39-19df-486e-9914-166dc3e77cc4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.document_stores.types import DuplicatePolicy\n", + "\n", + "doc_store = InMemoryDocumentStore()\n", + "doc_store.write_documents(docs['documents'], policy=DuplicatePolicy.OVERWRITE)" + ] + }, + { + "cell_type": "markdown", + "id": "a1253dd5-e71e-4a26-814d-e8d256750aff", + "metadata": {}, + "source": [ + "Now we use the `SentenceWindowRetriever` to retrieve a window of sentences around a certain sentence. Note that the `SentenceWindowRetriever` receives as input in run time a `Document` present in the document store, and it will rely on the documents metadata to retrieve the window of sentences around the matching sentence. So, one important aspect to notice is that the `SentenceWindowRetriever` needs to be used in conjunction with another `Retriever` that handles the initial user query, such as the `InMemoryBM25Retriever`, and returns the matching documents.\n", + "\n", + "Let's pass the Document containing the sentence `The dream faded.` to the `SentenceWindowRetriever` and retrieve a window of 2 sentences around it. Note that we need to wrap it in a list as the `run` method expects a list of documents." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "c028b53b-e68c-4b02-bd83-f9096aa54079", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.retrievers import SentenceWindowRetriever\n", + "\n", + "retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)\n", + "result = retriever.run(retrieved_documents=[docs['documents'][4]])" + ] + }, + { + "cell_type": "markdown", + "id": "0992d153-6770-4519-a930-6b4c85115611", + "metadata": {}, + "source": [ + "The result is a dictionary with two keys:\n", + "\n", + "- `context_windows`: a list of strings containing the context windows around the matching sentence.\n", + "- `context_documents`: a list of lists of `Document` objects containing the context windows around the matching sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "050d1751-5742-44b9-a522-436088458653", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['context_windows', 'context_documents'])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "b92920a5-5937-4f6b-87fb-a68db4c79401", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[' Even while he remained in the dream, Paul knew he would remember it upon awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadness in farewell.']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['context_windows']" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "faded9fe-725a-4b50-8855-7356ec0749e7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[Document(id=5d093b6ec1a4bdc7e75f033ae0b570e237053213a09b42a56ad815b4d118943d, content: ' Even while he remained in the dream, Paul knew he would remember it upon awakening.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 2, 'split_idx_start': 219}),\n", + " Document(id=4ed71ff61df531053cc7d5f80e8a0bd1e702f3a396f3f3983ceeffe89878a684, content: ' He always remembered the dreams that were predictions.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 3, 'split_idx_start': 303}),\n", + " Document(id=f485258001abdf2deab98249c7f0826b4f6b1bef7c37763d14318e7b595f434f, content: ' The dream faded.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 4, 'split_idx_start': 358}),\n", + " Document(id=f39c29c3a3122affc5909dc7b98f5880d9bd984731380420134c440da6fee363, content: ' Paul awoke to feel himself in the warmth of his bed—thinking thinking.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 5, 'split_idx_start': 375}),\n", + " Document(id=15401623a2a4fed533db7c1bbe8df157f79a9395cf8d3d6e92dc5ae553d0dded, content: ' This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadn...', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 6, 'split_idx_start': 446})]]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['context_documents']" + ] + }, + { + "cell_type": "markdown", + "id": "289e207c-7f8f-45da-bdfd-61ed4955942d", + "metadata": {}, + "source": [ + "## Advanced Example" + ] + }, + { + "cell_type": "markdown", + "id": "6bc10c96-e453-4e43-9b64-d05fae6de040", + "metadata": {}, + "source": [ + "We will use the BBC news dataset to show how the `SentenceWindowRetriever` works with a dataset containing multiple news articles.\n", + "\n", + "### Reading the dataset\n", + "\n", + "The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but it was already preprocessed and stored in\n", + "a single CSV file available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "82565cae-a730-4cd7-85f3-40be0e77b94d", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "import csv\n", + "from haystack import Document\n", + "\n", + "def read_documents(file: str) -> List[Document]:\n", + " with open(file, \"r\") as file:\n", + " reader = csv.reader(file, delimiter=\"\\t\")\n", + " next(reader, None) # skip the headers\n", + " documents = []\n", + " for row in reader:\n", + " category = row[0].strip()\n", + " title = row[2].strip()\n", + " text = row[3].strip()\n", + " documents.append(Document(content=text, meta={\"category\": category, \"title\": title}))\n", + "\n", + " return documents" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "4f581e8b-0693-4b09-b82e-71e78cb83f1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-10-11 15:46:13-- https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 5080260 (4.8M) [text/plain]\n", + "Saving to: ‘bbc-news-data.csv.2’\n", + "\n", + "bbc-news-data.csv.2 100%[===================>] 4.84M 8.53MB/s in 0.6s \n", + "\n", + "2024-10-11 15:46:14 (8.53 MB/s) - ‘bbc-news-data.csv.2’ saved [5080260/5080260]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "1ab23051-7df1-49e6-a009-ba187855aab3", + "metadata": {}, + "outputs": [], + "source": [ + "docs = read_documents(\"bbc-news-data.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "e3adbfda-86ba-44fc-a28c-681eb1b23351", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2225" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(docs)" + ] + }, + { + "cell_type": "markdown", + "id": "4a003472-19c1-4bc0-b6df-995bc66e8904", + "metadata": {}, + "source": [ + "### Indexing the documents\n", + "\n", + "We will now apply the `DocumentSplitter` to split the documents into sentences and write them to an `InMemoryDocumentStore`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "eb3203f3-2f75-4a60-9d2a-f530a09113a0", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document\n", + "from haystack.components.preprocessors import DocumentSplitter\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.document_stores.types import DuplicatePolicy\n", + "\n", + "def index_documents(documents: List[Document]):\n", + " splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", + " docs = splitter.run(documents)\n", + " doc_store = InMemoryDocumentStore()\n", + " doc_store.write_documents(docs[\"documents\"], policy=DuplicatePolicy.OVERWRITE)\n", + "\n", + " return doc_store" + ] + }, + { + "cell_type": "markdown", + "id": "a0b6030d-ace7-471e-ae5f-b7dfc0ec1064", + "metadata": {}, + "source": [ + "### Querying the documents\n", + "\n", + "Let's now build a pipeline to query the documents using the `InMemoryBM25Retriever` and the `SentenceWindowRetriever`." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "7048bc9c-8c6a-4df0-92c4-20b6162cfdb4", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Pipeline\n", + "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n", + "from haystack.components.retrievers import SentenceWindowRetriever\n", + "\n", + "def querying_pipeline(doc_store: InMemoryDocumentStore, window_size: int = 2):\n", + " pipeline = Pipeline()\n", + " bm25_retriever = InMemoryBM25Retriever(document_store=doc_store)\n", + " sentence_window_retriever = SentenceWindowRetriever(doc_store, window_size=window_size)\n", + " pipeline.add_component(instance=bm25_retriever, name=\"BM25Retriever\")\n", + " pipeline.add_component(instance=sentence_window_retriever, name=\"SentenceWindowRetriever\")\n", + " pipeline.connect(\"BM25Retriever.documents\", \"SentenceWindowRetriever.retrieved_documents\")\n", + "\n", + " return pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "e67abab4-10c1-4d95-b8fe-1ff24bb93161", + "metadata": {}, + "source": [ + "### Putting it all together\n", + "\n", + "We now read the raw documents, index them, build the querying pipeline, and query the document store for \"phishing attacks\", returning only the first top most scored document. We also include the outputs from the BM25Retriever\n", + "so that we can compare the results with and without the SentenceWindowRetriever." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "fdadd81e-3a4c-474d-9a92-85fa2f6a1264", + "metadata": {}, + "outputs": [], + "source": [ + "docs = read_documents(\"bbc-news-data.csv\")\n", + "doc_store = index_documents(docs)\n", + "pipeline = querying_pipeline(doc_store, window_size=2)\n", + "result = pipeline.run(data={'BM25Retriever': {'query': \"phishing attacks\", \"top_k\": 1}}, include_outputs_from={'BM25Retriever'})" + ] + }, + { + "cell_type": "markdown", + "id": "bbda91dc-238e-44a8-a241-9c35115efe88", + "metadata": {}, + "source": [ + "Let's now inspect the results from the BM25Retriever and the SentenceWindowRetriever. Since we split the documents by sentence, the BM25Retriever returns only the sentence associated with the matching query." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "ff53bf0b-ec2f-49aa-a5ff-82e0686ac81d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "' The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month.'" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['BM25Retriever']['documents'][0].content" + ] + }, + { + "cell_type": "markdown", + "id": "aca630e2-f0a9-4a07-bdcb-5f7e10415802", + "metadata": {}, + "source": [ + "The SentenceWindowRetriever, on the other hand, returns a window of sentences around the matching sentence, giving us more context to understand the sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "b3453305-b536-460f-be3f-8fc6d7169673", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\" In particular, phishing attacks, which typically use fake versions of bank websites to grab login details of customers, boomed during 2004. Web portal Lycos Europe reported a 500% increase in the number of phishing e-mail messages it was catching. The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month. Those who fall victim to these attacks can find that their bank account has been cleaned out or that their good name has been ruined by someone stealing their identity. This change in the ranks of virus writers could mean the end of the mass-mailing virus which attempts to spread by tricking people into opening infected attachments on e-mail messages.'" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['SentenceWindowRetriever']['context_windows'][0]" + ] + }, + { + "cell_type": "markdown", + "id": "06614ff0-7b82-46dc-8416-a78633704583", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We saw how the `SentenceWindowRetriever` works and how it can be used to retrieve a window of sentences around a matching document, give us more context to understand the document. One important aspect to notice is that the `SentenceWindowRetriever` doesn't handle queries directly but relies on the output of another `Retriever` that handles the initial user query. This allows the `SentenceWindowRetriever` to be used in conjunction with any other retriever in the pipeline, such as the `InMemoryBM25Retriever`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From d73049513f38d8a45e134ce73f74c41ae8aadbfa Mon Sep 17 00:00:00 2001 From: "David S. Batista" Date: Fri, 11 Oct 2024 15:58:06 +0200 Subject: [PATCH 2/7] updating README.MD --- README.md | 56 +++++++++++++++++++++++++++---------------------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/README.md b/README.md index 41477d1b..1dbc12a3 100644 --- a/README.md +++ b/README.md @@ -34,31 +34,31 @@ Haystack 2.0 -| Code | Colab | Code | Colab | -| :-------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [Build Your First Question Answering System](./tutorials/01_Basic_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb) | [Your First QA Pipeline with Retrieval-Augmentation](./tutorials/27_First_RAG_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/27_First_RAG_Pipeline.ipynb) | -| [Fine Tune a Model on Your Data](./tutorials/02_Finetune_a_model_on_your_data.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_Finetune_a_model_on_your_data.ipynb) | [Generating Structured Output with Loop-Based Auto-Correction](./tutorials/28_Structured_Output_With_Loop.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/28_Structured_Output_With_Loop.ipynb) | -| [Build a Scalable Question Answering System](./tutorials/03_Scalable_QA_System.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/03_Scalable_QA_System.ipynb) | [Serializing Pipelines](./tutorials/29_Serializing_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/29_Serializing_Pipelines.ipynb) | -| [FAQ Style QA](./tutorials/04_FAQ_style_QA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/04_FAQ_style_QA.ipynb) | [Preprocessing Different File Types](./tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | -| [Evaluation](./tutorials/05_Evaluation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/05_Evaluation.ipynb) | [Metadata Filtering](./tutorials/31_Metadata_Filtering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/31_Metadata_Filtering.ipynb) | -| [Better Retrieval via Embedding Retrieval](./tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [Classifying Documents & Queries by Language](./tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb)| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb)| -| [[OUTDATED] RAG Generator](./tutorials/07_RAG_Generator.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/07_RAG_Generator.ipynb) | [Build an Extractive QA Pipeline](./tutorials/34_Extractive_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/34_Extractive_QA_Pipeline.ipynb) | -| [Preprocessing](./tutorials/08_Preprocessing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/08_Preprocessing.ipynb) | [Model-Based Evaluation of RAG Pipelines](./tutorials/35_Model_Based_Evaluation_of_RAG_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/35_Model_Based_Evaluation_of_RAG_Pipelines.ipynb)| -| [DPR Training](./tutorials/09_DPR_training.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/09_DPR_training.ipynb) | | | -| [[OUTDATED] Knowledge Graph](./tutorials/10_Knowledge_Graph.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/10_Knowledge_Graph.ipynb) | | | -| [Pipelines](./tutorials/11_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/11_Pipelines.ipynb) | | | -| [[OUTDATED] Seq2SeqGenerator](./tutorials/12_LFQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/12_LFQA.ipynb) | | | -| [Question Generation](./tutorials/13_Question_generation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb) | | | -| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | | | -| [Table QA](./tutorials/15_TableQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb) | | | -| [Document Classifier at Index Time](./tutorials/16_Document_Classifier_at_Index_Time.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb) | | | -| [Make Your QA Pipelines Talk!](./tutorials/17_Audio.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/17_Audio.ipynb) | | | -| [Generative Pseudo Labeling](./tutorials/18_GPL.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/18_GPL.ipynb) | | | -| [Text-to-Image search](./tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | | | -| [Using Haystack with REST API](./tutorials/20_Using_Haystack_with_REST_API.ipynb) | Download | | | -| [Customizing PromptNode](./tutorials/21_Customizing_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/21_Customizing_PromptNode.ipynb) | | | -| [Generative QA Pipeline with Retrieval-Augmentation](./tutorials/22_Pipeline_with_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/22_Pipeline_with_PromptNode.ipynb) | | | -| [Answering Complex Questions with Agents](./tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | | | -| [Building a Conversational Chat App](./tutorials/24_Building_Chat_App.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/24_Building_Chat_App.ipynb) | | | -| [Customizing Agent to Chat with Your Documents](./tutorials/25_Customizing_Agent.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/25_Customizing_Agent.ipynb) | | | -| [Creating a Hybrid Retrieval Pipeline](./tutorials/26_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/26_Hybrid_Retrieval.ipynb) | | | +| Code | Colab | Code | Colab | +| :-------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |:------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Build Your First Question Answering System](./tutorials/01_Basic_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb) | [Your First QA Pipeline with Retrieval-Augmentation](./tutorials/27_First_RAG_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/27_First_RAG_Pipeline.ipynb) | +| [Fine Tune a Model on Your Data](./tutorials/02_Finetune_a_model_on_your_data.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_Finetune_a_model_on_your_data.ipynb) | [Generating Structured Output with Loop-Based Auto-Correction](./tutorials/28_Structured_Output_With_Loop.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/28_Structured_Output_With_Loop.ipynb) | +| [Build a Scalable Question Answering System](./tutorials/03_Scalable_QA_System.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/03_Scalable_QA_System.ipynb) | [Serializing Pipelines](./tutorials/29_Serializing_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/29_Serializing_Pipelines.ipynb) | +| [FAQ Style QA](./tutorials/04_FAQ_style_QA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/04_FAQ_style_QA.ipynb) | [Preprocessing Different File Types](./tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | +| [Evaluation](./tutorials/05_Evaluation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/05_Evaluation.ipynb) | [Metadata Filtering](./tutorials/31_Metadata_Filtering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/31_Metadata_Filtering.ipynb) | +| [Better Retrieval via Embedding Retrieval](./tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [Classifying Documents & Queries by Language](./tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb) | +| [[OUTDATED] RAG Generator](./tutorials/07_RAG_Generator.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/07_RAG_Generator.ipynb) | [Build an Extractive QA Pipeline](./tutorials/34_Extractive_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/34_Extractive_QA_Pipeline.ipynb) | +| [Preprocessing](./tutorials/08_Preprocessing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/08_Preprocessing.ipynb) | [Model-Based Evaluation of RAG Pipelines](./tutorials/35_Model_Based_Evaluation_of_RAG_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/35_Model_Based_Evaluation_of_RAG_Pipelines.ipynb) | +| [DPR Training](./tutorials/09_DPR_training.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/09_DPR_training.ipynb) | [Sentence Window Retriever](./tutorials/36_Sentence_Window_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/36_Sentence_Window_Retriever.ipynb) | | | +| [[OUTDATED] Knowledge Graph](./tutorials/10_Knowledge_Graph.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/10_Knowledge_Graph.ipynb) | | | +| [Pipelines](./tutorials/11_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/11_Pipelines.ipynb) | | | +| [[OUTDATED] Seq2SeqGenerator](./tutorials/12_LFQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/12_LFQA.ipynb) | | | +| [Question Generation](./tutorials/13_Question_generation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb) | | | +| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | | | +| [Table QA](./tutorials/15_TableQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb) | | | +| [Document Classifier at Index Time](./tutorials/16_Document_Classifier_at_Index_Time.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb) | | | +| [Make Your QA Pipelines Talk!](./tutorials/17_Audio.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/17_Audio.ipynb) | | | +| [Generative Pseudo Labeling](./tutorials/18_GPL.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/18_GPL.ipynb) | | | +| [Text-to-Image search](./tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | | | +| [Using Haystack with REST API](./tutorials/20_Using_Haystack_with_REST_API.ipynb) | Download | | | +| [Customizing PromptNode](./tutorials/21_Customizing_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/21_Customizing_PromptNode.ipynb) | | | +| [Generative QA Pipeline with Retrieval-Augmentation](./tutorials/22_Pipeline_with_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/22_Pipeline_with_PromptNode.ipynb) | | | +| [Answering Complex Questions with Agents](./tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | | | +| [Building a Conversational Chat App](./tutorials/24_Building_Chat_App.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/24_Building_Chat_App.ipynb) | | | +| [Customizing Agent to Chat with Your Documents](./tutorials/25_Customizing_Agent.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/25_Customizing_Agent.ipynb) | | | +| [Creating a Hybrid Retrieval Pipeline](./tutorials/26_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/26_Hybrid_Retrieval.ipynb) | | | From 622bcc444d32ca776193b590a55521a55543e474 Mon Sep 17 00:00:00 2001 From: "David S. Batista" Date: Fri, 11 Oct 2024 16:03:23 +0200 Subject: [PATCH 3/7] updating README.MD --- README.md | 56 +- tutorials/36_Sentence_Window_Retriever.ipynb | 510 ------------------- 2 files changed, 28 insertions(+), 538 deletions(-) delete mode 100644 tutorials/36_Sentence_Window_Retriever.ipynb diff --git a/README.md b/README.md index 559dae2f..016441cc 100644 --- a/README.md +++ b/README.md @@ -32,31 +32,31 @@ Haystack 2.0 -| Code | Colab | Code | Colab | -| :-------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [Build Your First Question Answering System](./tutorials/01_Basic_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb) | [Your First QA Pipeline with Retrieval-Augmentation](./tutorials/27_First_RAG_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/27_First_RAG_Pipeline.ipynb) | -| [Fine Tune a Model on Your Data](./tutorials/02_Finetune_a_model_on_your_data.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_Finetune_a_model_on_your_data.ipynb) | [Generating Structured Output with Loop-Based Auto-Correction](./tutorials/28_Structured_Output_With_Loop.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/28_Structured_Output_With_Loop.ipynb) | -| [Build a Scalable Question Answering System](./tutorials/03_Scalable_QA_System.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/03_Scalable_QA_System.ipynb) | [Serializing Pipelines](./tutorials/29_Serializing_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/29_Serializing_Pipelines.ipynb) | -| [FAQ Style QA](./tutorials/04_FAQ_style_QA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/04_FAQ_style_QA.ipynb) | [Preprocessing Different File Types](./tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | -| [Evaluation](./tutorials/05_Evaluation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/05_Evaluation.ipynb) | [Metadata Filtering](./tutorials/31_Metadata_Filtering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/31_Metadata_Filtering.ipynb) | -| [Better Retrieval via Embedding Retrieval](./tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [Classifying Documents & Queries by Language](./tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb)| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb)| -| [[OUTDATED] RAG Generator](./tutorials/07_RAG_Generator.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/07_RAG_Generator.ipynb) | [Creating a Hybrid Retrieval Pipeline](./tutorials/33_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/33_Hybrid_Retrieval.ipynb) | -| [Preprocessing](./tutorials/08_Preprocessing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/08_Preprocessing.ipynb) | [Build an Extractive QA Pipeline](./tutorials/34_Extractive_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/34_Extractive_QA_Pipeline.ipynb) | -| [DPR Training](./tutorials/09_DPR_training.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/09_DPR_training.ipynb) | [Evaluating RAG Pipelines](./tutorials/35_Evaluating_RAG_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/35_Evaluating_RAG_Pipelines.ipynb)| -| [[OUTDATED] Knowledge Graph](./tutorials/10_Knowledge_Graph.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/10_Knowledge_Graph.ipynb) | [Building Pipelines with Conditional Routing](./tutorials/36_Building_Fallbacks_with_Conditional_Routing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/36_Building_Fallbacks_with_Conditional_Routing.ipynb)| -| [Pipelines](./tutorials/11_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/11_Pipelines.ipynb) | [[OUTDATED] Simplifying Pipeline Inputs with Multiplexer](./tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb)| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb)| -| [[OUTDATED] Seq2SeqGenerator](./tutorials/12_LFQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/12_LFQA.ipynb) | [Embedding Metadata for Improved Retrieval](./tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb)| -| [Question Generation](./tutorials/13_Question_generation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb) | [Building a Chat Application with Function Calling](./tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| -| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | | | -| [Table QA](./tutorials/15_TableQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb) | | | -| [Document Classifier at Index Time](./tutorials/16_Document_Classifier_at_Index_Time.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb) | | | -| [Make Your QA Pipelines Talk!](./tutorials/17_Audio.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/17_Audio.ipynb) | | | -| [Generative Pseudo Labeling](./tutorials/18_GPL.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/18_GPL.ipynb) | | | -| [Text-to-Image search](./tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | | | -| [Using Haystack with REST API](./tutorials/20_Using_Haystack_with_REST_API.ipynb) | Download | | | -| [Customizing PromptNode](./tutorials/21_Customizing_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/21_Customizing_PromptNode.ipynb) | | | -| [Generative QA Pipeline with Retrieval-Augmentation](./tutorials/22_Pipeline_with_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/22_Pipeline_with_PromptNode.ipynb) | | | -| [Answering Complex Questions with Agents](./tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | | | -| [Building a Conversational Chat App](./tutorials/24_Building_Chat_App.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/24_Building_Chat_App.ipynb) | | | -| [Customizing Agent to Chat with Your Documents](./tutorials/25_Customizing_Agent.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/25_Customizing_Agent.ipynb) | | | -| [Creating a Hybrid Retrieval Pipeline](./tutorials/26_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/26_Hybrid_Retrieval.ipynb) | | | +| Code | Colab | Code | Colab | +|:----------------------------------------------------------------------------------------------------------| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |:-----------------------------------------------------------------------------------------------------------------------------| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| [Build Your First Question Answering System](./tutorials/01_Basic_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb) | [Your First QA Pipeline with Retrieval-Augmentation](./tutorials/27_First_RAG_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/27_First_RAG_Pipeline.ipynb) | +| [Fine Tune a Model on Your Data](./tutorials/02_Finetune_a_model_on_your_data.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_Finetune_a_model_on_your_data.ipynb) | [Generating Structured Output with Loop-Based Auto-Correction](./tutorials/28_Structured_Output_With_Loop.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/28_Structured_Output_With_Loop.ipynb) | +| [Build a Scalable Question Answering System](./tutorials/03_Scalable_QA_System.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/03_Scalable_QA_System.ipynb) | [Serializing Pipelines](./tutorials/29_Serializing_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/29_Serializing_Pipelines.ipynb) | +| [FAQ Style QA](./tutorials/04_FAQ_style_QA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/04_FAQ_style_QA.ipynb) | [Preprocessing Different File Types](./tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/30_File_Type_Preprocessing_Index_Pipeline.ipynb) | +| [Evaluation](./tutorials/05_Evaluation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/05_Evaluation.ipynb) | [Metadata Filtering](./tutorials/31_Metadata_Filtering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/31_Metadata_Filtering.ipynb) | +| [Better Retrieval via Embedding Retrieval](./tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/06_Better_Retrieval_via_Embedding_Retrieval.ipynb) | [Classifying Documents & Queries by Language](./tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/32_Classifying_Documents_and_Queries_by_Language.ipynb)| +| [[OUTDATED] RAG Generator](./tutorials/07_RAG_Generator.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/07_RAG_Generator.ipynb) | [Creating a Hybrid Retrieval Pipeline](./tutorials/33_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/33_Hybrid_Retrieval.ipynb) | +| [Preprocessing](./tutorials/08_Preprocessing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/08_Preprocessing.ipynb) | [Build an Extractive QA Pipeline](./tutorials/34_Extractive_QA_Pipeline.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/34_Extractive_QA_Pipeline.ipynb) | +| [DPR Training](./tutorials/09_DPR_training.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/09_DPR_training.ipynb) | [Evaluating RAG Pipelines](./tutorials/35_Evaluating_RAG_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/35_Evaluating_RAG_Pipelines.ipynb)| +| [[OUTDATED] Knowledge Graph](./tutorials/10_Knowledge_Graph.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/10_Knowledge_Graph.ipynb) | [Building Pipelines with Conditional Routing](./tutorials/36_Building_Fallbacks_with_Conditional_Routing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/36_Building_Fallbacks_with_Conditional_Routing.ipynb)| +| [Pipelines](./tutorials/11_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/11_Pipelines.ipynb) | [[OUTDATED] Simplifying Pipeline Inputs with Multiplexer](./tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb)| +| [[OUTDATED] Seq2SeqGenerator](./tutorials/12_LFQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/12_LFQA.ipynb) | [Embedding Metadata for Improved Retrieval](./tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb)| +| [Question Generation](./tutorials/13_Question_generation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb) | [Building a Chat Application with Function Calling](./tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| +| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | [Sentence Window Retriever](./tutorials/41_Sentence_Window_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| | | +| [Table QA](./tutorials/15_TableQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb) | | | +| [Document Classifier at Index Time](./tutorials/16_Document_Classifier_at_Index_Time.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb) | | | +| [Make Your QA Pipelines Talk!](./tutorials/17_Audio.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/17_Audio.ipynb) | | | +| [Generative Pseudo Labeling](./tutorials/18_GPL.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/18_GPL.ipynb) | | | +| [Text-to-Image search](./tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb) | | | +| [Using Haystack with REST API](./tutorials/20_Using_Haystack_with_REST_API.ipynb) | Download | | | +| [Customizing PromptNode](./tutorials/21_Customizing_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/21_Customizing_PromptNode.ipynb) | | | +| [Generative QA Pipeline with Retrieval-Augmentation](./tutorials/22_Pipeline_with_PromptNode.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/22_Pipeline_with_PromptNode.ipynb) | | | +| [Answering Complex Questions with Agents](./tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/23_Answering_Multihop_Questions_with_Agents.ipynb) | | | +| [Building a Conversational Chat App](./tutorials/24_Building_Chat_App.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/24_Building_Chat_App.ipynb) | | | +| [Customizing Agent to Chat with Your Documents](./tutorials/25_Customizing_Agent.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/25_Customizing_Agent.ipynb) | | | +| [Creating a Hybrid Retrieval Pipeline](./tutorials/26_Hybrid_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/26_Hybrid_Retrieval.ipynb) | | | diff --git a/tutorials/36_Sentence_Window_Retriever.ipynb b/tutorials/36_Sentence_Window_Retriever.ipynb deleted file mode 100644 index 0792eb7b..00000000 --- a/tutorials/36_Sentence_Window_Retriever.ipynb +++ /dev/null @@ -1,510 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "b79cad40-2c9c-4598-8195-0d6cf525ff87", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "The Sentence-Window retrieval technique is a simple and effective way to retrieve more context given a user query which matched some document. It is based on the idea that the most relevant sentences are likely to be close to each other in the document. The technique involves selecting a window of sentences around a sentence matching a user query and instead of returning the matching sentence, the entire window is returned. This technique can be particularly useful when the user query is a question or a phrase that requires more context to be understood." - ] - }, - { - "cell_type": "markdown", - "id": "53dee123-cac6-4451-a2f2-87248d218a7f", - "metadata": {}, - "source": [ - "## Haystack Component\n", - "\n", - "The `SentenceWindowRetriever` is the Haystack component that can be used in a Pipeline to implement the Sentence-Window retrieval technique." - ] - }, - { - "cell_type": "markdown", - "id": "5d9b88f6-f8d2-4450-a00f-105962f3f188", - "metadata": {}, - "source": [ - "`SentenceWindowRetriever(document_store=doc_store, window_size=2)`" - ] - }, - { - "cell_type": "markdown", - "id": "f60548f2-e188-4948-9b53-1d478f6e3e3a", - "metadata": {}, - "source": [ - "The component takes a document_store and a window_size as input. The document_store contains the documents we want to query, and the window_size is used to determine the number of sentences to return around the matching sentence. So the number of sentences returned will be `2 * window_size + 1`. Although we use the term \"sentence\" as it's inertly attached to this technique, the `SentenceWindowRetriever` actually works with any splitter from the `DocumentSplitter` class, for instance: `word`, `sentence`, `page`." - ] - }, - { - "cell_type": "markdown", - "id": "ee689359-94bf-45b6-b69b-f17266382ff8", - "metadata": {}, - "source": [ - "## Introductory Example\n", - "\n", - "Let's see a simple example of how to use the `SentenceWindowRetriever` in isolation, and later we can see how to use it within a pipeline. We start by creating a document and splitting it into sentences using the `DocumentSplitter` class." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "edf1b350-43da-4dcb-a1ef-68f3b7ad3ca7", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install haystack-ai" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "aae36c73-f04d-4a99-a63a-0a5ebfa0242a", - "metadata": {}, - "outputs": [], - "source": [ - "from haystack.components.retrievers import SentenceWindowRetriever" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "fb04d1f9-5329-499c-9479-7e3b4b4fa126", - "metadata": {}, - "outputs": [], - "source": [ - "from haystack import Document\n", - "from haystack.components.preprocessors import DocumentSplitter\n", - "\n", - "splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", - "text = (\"Paul fell asleep to dream of an Arrakeen cavern, silent people all around him moving in the dim light \"\n", - " \"of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the \"\n", - " \"drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon \"\n", - " \"awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel \"\n", - " \"himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or \"\n", - " \"companions his own age, perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had \"\n", - " \"hinted that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered \"\n", - " \"people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people \"\n", - " \"called Fremen, marked down on no census of the Imperial Regate.\")\n", - "\n", - "doc = Document(content=text)\n", - "docs = splitter.run([doc])" - ] - }, - { - "cell_type": "markdown", - "id": "61d94c11-4c32-4022-979d-8316f9069ac8", - "metadata": {}, - "source": [ - "this will result in 9 sentences represented as Haystack Document objects. We can then write these documents to a DocumentStore and use the SentenceWindowRetriever to retrieve a window of sentences around a matching sentence." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "24a7fd39-19df-486e-9914-166dc3e77cc4", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "9" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", - "from haystack.document_stores.types import DuplicatePolicy\n", - "\n", - "doc_store = InMemoryDocumentStore()\n", - "doc_store.write_documents(docs['documents'], policy=DuplicatePolicy.OVERWRITE)" - ] - }, - { - "cell_type": "markdown", - "id": "a1253dd5-e71e-4a26-814d-e8d256750aff", - "metadata": {}, - "source": [ - "Now we use the `SentenceWindowRetriever` to retrieve a window of sentences around a certain sentence. Note that the `SentenceWindowRetriever` receives as input in run time a `Document` present in the document store, and it will rely on the documents metadata to retrieve the window of sentences around the matching sentence. So, one important aspect to notice is that the `SentenceWindowRetriever` needs to be used in conjunction with another `Retriever` that handles the initial user query, such as the `InMemoryBM25Retriever`, and returns the matching documents.\n", - "\n", - "Let's pass the Document containing the sentence `The dream faded.` to the `SentenceWindowRetriever` and retrieve a window of 2 sentences around it. Note that we need to wrap it in a list as the `run` method expects a list of documents." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "c028b53b-e68c-4b02-bd83-f9096aa54079", - "metadata": {}, - "outputs": [], - "source": [ - "from haystack.components.retrievers import SentenceWindowRetriever\n", - "\n", - "retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)\n", - "result = retriever.run(retrieved_documents=[docs['documents'][4]])" - ] - }, - { - "cell_type": "markdown", - "id": "0992d153-6770-4519-a930-6b4c85115611", - "metadata": {}, - "source": [ - "The result is a dictionary with two keys:\n", - "\n", - "- `context_windows`: a list of strings containing the context windows around the matching sentence.\n", - "- `context_documents`: a list of lists of `Document` objects containing the context windows around the matching sentence." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "050d1751-5742-44b9-a522-436088458653", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "dict_keys(['context_windows', 'context_documents'])" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "b92920a5-5937-4f6b-87fb-a68db4c79401", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[' Even while he remained in the dream, Paul knew he would remember it upon awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadness in farewell.']" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result['context_windows']" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "faded9fe-725a-4b50-8855-7356ec0749e7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[[Document(id=5d093b6ec1a4bdc7e75f033ae0b570e237053213a09b42a56ad815b4d118943d, content: ' Even while he remained in the dream, Paul knew he would remember it upon awakening.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 2, 'split_idx_start': 219}),\n", - " Document(id=4ed71ff61df531053cc7d5f80e8a0bd1e702f3a396f3f3983ceeffe89878a684, content: ' He always remembered the dreams that were predictions.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 3, 'split_idx_start': 303}),\n", - " Document(id=f485258001abdf2deab98249c7f0826b4f6b1bef7c37763d14318e7b595f434f, content: ' The dream faded.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 4, 'split_idx_start': 358}),\n", - " Document(id=f39c29c3a3122affc5909dc7b98f5880d9bd984731380420134c440da6fee363, content: ' Paul awoke to feel himself in the warmth of his bed—thinking thinking.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 5, 'split_idx_start': 375}),\n", - " Document(id=15401623a2a4fed533db7c1bbe8df157f79a9395cf8d3d6e92dc5ae553d0dded, content: ' This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadn...', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 6, 'split_idx_start': 446})]]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result['context_documents']" - ] - }, - { - "cell_type": "markdown", - "id": "289e207c-7f8f-45da-bdfd-61ed4955942d", - "metadata": {}, - "source": [ - "## Advanced Example" - ] - }, - { - "cell_type": "markdown", - "id": "6bc10c96-e453-4e43-9b64-d05fae6de040", - "metadata": {}, - "source": [ - "We will use the BBC news dataset to show how the `SentenceWindowRetriever` works with a dataset containing multiple news articles.\n", - "\n", - "### Reading the dataset\n", - "\n", - "The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but it was already preprocessed and stored in\n", - "a single CSV file available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "82565cae-a730-4cd7-85f3-40be0e77b94d", - "metadata": {}, - "outputs": [], - "source": [ - "from typing import List\n", - "import csv\n", - "from haystack import Document\n", - "\n", - "def read_documents(file: str) -> List[Document]:\n", - " with open(file, \"r\") as file:\n", - " reader = csv.reader(file, delimiter=\"\\t\")\n", - " next(reader, None) # skip the headers\n", - " documents = []\n", - " for row in reader:\n", - " category = row[0].strip()\n", - " title = row[2].strip()\n", - " text = row[3].strip()\n", - " documents.append(Document(content=text, meta={\"category\": category, \"title\": title}))\n", - "\n", - " return documents" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "4f581e8b-0693-4b09-b82e-71e78cb83f1a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2024-10-11 15:46:13-- https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 5080260 (4.8M) [text/plain]\n", - "Saving to: ‘bbc-news-data.csv.2’\n", - "\n", - "bbc-news-data.csv.2 100%[===================>] 4.84M 8.53MB/s in 0.6s \n", - "\n", - "2024-10-11 15:46:14 (8.53 MB/s) - ‘bbc-news-data.csv.2’ saved [5080260/5080260]\n", - "\n" - ] - } - ], - "source": [ - "!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "1ab23051-7df1-49e6-a009-ba187855aab3", - "metadata": {}, - "outputs": [], - "source": [ - "docs = read_documents(\"bbc-news-data.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "e3adbfda-86ba-44fc-a28c-681eb1b23351", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2225" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(docs)" - ] - }, - { - "cell_type": "markdown", - "id": "4a003472-19c1-4bc0-b6df-995bc66e8904", - "metadata": {}, - "source": [ - "### Indexing the documents\n", - "\n", - "We will now apply the `DocumentSplitter` to split the documents into sentences and write them to an `InMemoryDocumentStore`." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "eb3203f3-2f75-4a60-9d2a-f530a09113a0", - "metadata": {}, - "outputs": [], - "source": [ - "from haystack import Document\n", - "from haystack.components.preprocessors import DocumentSplitter\n", - "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", - "from haystack.document_stores.types import DuplicatePolicy\n", - "\n", - "def index_documents(documents: List[Document]):\n", - " splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", - " docs = splitter.run(documents)\n", - " doc_store = InMemoryDocumentStore()\n", - " doc_store.write_documents(docs[\"documents\"], policy=DuplicatePolicy.OVERWRITE)\n", - "\n", - " return doc_store" - ] - }, - { - "cell_type": "markdown", - "id": "a0b6030d-ace7-471e-ae5f-b7dfc0ec1064", - "metadata": {}, - "source": [ - "### Querying the documents\n", - "\n", - "Let's now build a pipeline to query the documents using the `InMemoryBM25Retriever` and the `SentenceWindowRetriever`." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "7048bc9c-8c6a-4df0-92c4-20b6162cfdb4", - "metadata": {}, - "outputs": [], - "source": [ - "from haystack import Pipeline\n", - "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n", - "from haystack.components.retrievers import SentenceWindowRetriever\n", - "\n", - "def querying_pipeline(doc_store: InMemoryDocumentStore, window_size: int = 2):\n", - " pipeline = Pipeline()\n", - " bm25_retriever = InMemoryBM25Retriever(document_store=doc_store)\n", - " sentence_window_retriever = SentenceWindowRetriever(doc_store, window_size=window_size)\n", - " pipeline.add_component(instance=bm25_retriever, name=\"BM25Retriever\")\n", - " pipeline.add_component(instance=sentence_window_retriever, name=\"SentenceWindowRetriever\")\n", - " pipeline.connect(\"BM25Retriever.documents\", \"SentenceWindowRetriever.retrieved_documents\")\n", - "\n", - " return pipeline" - ] - }, - { - "cell_type": "markdown", - "id": "e67abab4-10c1-4d95-b8fe-1ff24bb93161", - "metadata": {}, - "source": [ - "### Putting it all together\n", - "\n", - "We now read the raw documents, index them, build the querying pipeline, and query the document store for \"phishing attacks\", returning only the first top most scored document. We also include the outputs from the BM25Retriever\n", - "so that we can compare the results with and without the SentenceWindowRetriever." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "fdadd81e-3a4c-474d-9a92-85fa2f6a1264", - "metadata": {}, - "outputs": [], - "source": [ - "docs = read_documents(\"bbc-news-data.csv\")\n", - "doc_store = index_documents(docs)\n", - "pipeline = querying_pipeline(doc_store, window_size=2)\n", - "result = pipeline.run(data={'BM25Retriever': {'query': \"phishing attacks\", \"top_k\": 1}}, include_outputs_from={'BM25Retriever'})" - ] - }, - { - "cell_type": "markdown", - "id": "bbda91dc-238e-44a8-a241-9c35115efe88", - "metadata": {}, - "source": [ - "Let's now inspect the results from the BM25Retriever and the SentenceWindowRetriever. Since we split the documents by sentence, the BM25Retriever returns only the sentence associated with the matching query." - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "ff53bf0b-ec2f-49aa-a5ff-82e0686ac81d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "' The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month.'" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result['BM25Retriever']['documents'][0].content" - ] - }, - { - "cell_type": "markdown", - "id": "aca630e2-f0a9-4a07-bdcb-5f7e10415802", - "metadata": {}, - "source": [ - "The SentenceWindowRetriever, on the other hand, returns a window of sentences around the matching sentence, giving us more context to understand the sentence." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "b3453305-b536-460f-be3f-8fc6d7169673", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'\" In particular, phishing attacks, which typically use fake versions of bank websites to grab login details of customers, boomed during 2004. Web portal Lycos Europe reported a 500% increase in the number of phishing e-mail messages it was catching. The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month. Those who fall victim to these attacks can find that their bank account has been cleaned out or that their good name has been ruined by someone stealing their identity. This change in the ranks of virus writers could mean the end of the mass-mailing virus which attempts to spread by tricking people into opening infected attachments on e-mail messages.'" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result['SentenceWindowRetriever']['context_windows'][0]" - ] - }, - { - "cell_type": "markdown", - "id": "06614ff0-7b82-46dc-8416-a78633704583", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "We saw how the `SentenceWindowRetriever` works and how it can be used to retrieve a window of sentences around a matching document, give us more context to understand the document. One important aspect to notice is that the `SentenceWindowRetriever` doesn't handle queries directly but relies on the output of another `Retriever` that handles the initial user query. This allows the `SentenceWindowRetriever` to be used in conjunction with any other retriever in the pipeline, such as the `InMemoryBM25Retriever`." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.7" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 97cae28d14eab8862684098cea78dc22bc09f2fb Mon Sep 17 00:00:00 2001 From: "David S. Batista" Date: Fri, 11 Oct 2024 16:04:00 +0200 Subject: [PATCH 4/7] adding notebook --- tutorials/41_Sentence_Window_Retriever.ipynb | 510 +++++++++++++++++++ 1 file changed, 510 insertions(+) create mode 100644 tutorials/41_Sentence_Window_Retriever.ipynb diff --git a/tutorials/41_Sentence_Window_Retriever.ipynb b/tutorials/41_Sentence_Window_Retriever.ipynb new file mode 100644 index 00000000..0792eb7b --- /dev/null +++ b/tutorials/41_Sentence_Window_Retriever.ipynb @@ -0,0 +1,510 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b79cad40-2c9c-4598-8195-0d6cf525ff87", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "The Sentence-Window retrieval technique is a simple and effective way to retrieve more context given a user query which matched some document. It is based on the idea that the most relevant sentences are likely to be close to each other in the document. The technique involves selecting a window of sentences around a sentence matching a user query and instead of returning the matching sentence, the entire window is returned. This technique can be particularly useful when the user query is a question or a phrase that requires more context to be understood." + ] + }, + { + "cell_type": "markdown", + "id": "53dee123-cac6-4451-a2f2-87248d218a7f", + "metadata": {}, + "source": [ + "## Haystack Component\n", + "\n", + "The `SentenceWindowRetriever` is the Haystack component that can be used in a Pipeline to implement the Sentence-Window retrieval technique." + ] + }, + { + "cell_type": "markdown", + "id": "5d9b88f6-f8d2-4450-a00f-105962f3f188", + "metadata": {}, + "source": [ + "`SentenceWindowRetriever(document_store=doc_store, window_size=2)`" + ] + }, + { + "cell_type": "markdown", + "id": "f60548f2-e188-4948-9b53-1d478f6e3e3a", + "metadata": {}, + "source": [ + "The component takes a document_store and a window_size as input. The document_store contains the documents we want to query, and the window_size is used to determine the number of sentences to return around the matching sentence. So the number of sentences returned will be `2 * window_size + 1`. Although we use the term \"sentence\" as it's inertly attached to this technique, the `SentenceWindowRetriever` actually works with any splitter from the `DocumentSplitter` class, for instance: `word`, `sentence`, `page`." + ] + }, + { + "cell_type": "markdown", + "id": "ee689359-94bf-45b6-b69b-f17266382ff8", + "metadata": {}, + "source": [ + "## Introductory Example\n", + "\n", + "Let's see a simple example of how to use the `SentenceWindowRetriever` in isolation, and later we can see how to use it within a pipeline. We start by creating a document and splitting it into sentences using the `DocumentSplitter` class." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "edf1b350-43da-4dcb-a1ef-68f3b7ad3ca7", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install haystack-ai" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "aae36c73-f04d-4a99-a63a-0a5ebfa0242a", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.retrievers import SentenceWindowRetriever" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "fb04d1f9-5329-499c-9479-7e3b4b4fa126", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document\n", + "from haystack.components.preprocessors import DocumentSplitter\n", + "\n", + "splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", + "text = (\"Paul fell asleep to dream of an Arrakeen cavern, silent people all around him moving in the dim light \"\n", + " \"of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the \"\n", + " \"drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon \"\n", + " \"awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel \"\n", + " \"himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or \"\n", + " \"companions his own age, perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had \"\n", + " \"hinted that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered \"\n", + " \"people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people \"\n", + " \"called Fremen, marked down on no census of the Imperial Regate.\")\n", + "\n", + "doc = Document(content=text)\n", + "docs = splitter.run([doc])" + ] + }, + { + "cell_type": "markdown", + "id": "61d94c11-4c32-4022-979d-8316f9069ac8", + "metadata": {}, + "source": [ + "this will result in 9 sentences represented as Haystack Document objects. We can then write these documents to a DocumentStore and use the SentenceWindowRetriever to retrieve a window of sentences around a matching sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "24a7fd39-19df-486e-9914-166dc3e77cc4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.document_stores.types import DuplicatePolicy\n", + "\n", + "doc_store = InMemoryDocumentStore()\n", + "doc_store.write_documents(docs['documents'], policy=DuplicatePolicy.OVERWRITE)" + ] + }, + { + "cell_type": "markdown", + "id": "a1253dd5-e71e-4a26-814d-e8d256750aff", + "metadata": {}, + "source": [ + "Now we use the `SentenceWindowRetriever` to retrieve a window of sentences around a certain sentence. Note that the `SentenceWindowRetriever` receives as input in run time a `Document` present in the document store, and it will rely on the documents metadata to retrieve the window of sentences around the matching sentence. So, one important aspect to notice is that the `SentenceWindowRetriever` needs to be used in conjunction with another `Retriever` that handles the initial user query, such as the `InMemoryBM25Retriever`, and returns the matching documents.\n", + "\n", + "Let's pass the Document containing the sentence `The dream faded.` to the `SentenceWindowRetriever` and retrieve a window of 2 sentences around it. Note that we need to wrap it in a list as the `run` method expects a list of documents." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "c028b53b-e68c-4b02-bd83-f9096aa54079", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.retrievers import SentenceWindowRetriever\n", + "\n", + "retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)\n", + "result = retriever.run(retrieved_documents=[docs['documents'][4]])" + ] + }, + { + "cell_type": "markdown", + "id": "0992d153-6770-4519-a930-6b4c85115611", + "metadata": {}, + "source": [ + "The result is a dictionary with two keys:\n", + "\n", + "- `context_windows`: a list of strings containing the context windows around the matching sentence.\n", + "- `context_documents`: a list of lists of `Document` objects containing the context windows around the matching sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "050d1751-5742-44b9-a522-436088458653", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['context_windows', 'context_documents'])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "b92920a5-5937-4f6b-87fb-a68db4c79401", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[' Even while he remained in the dream, Paul knew he would remember it upon awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadness in farewell.']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['context_windows']" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "faded9fe-725a-4b50-8855-7356ec0749e7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[Document(id=5d093b6ec1a4bdc7e75f033ae0b570e237053213a09b42a56ad815b4d118943d, content: ' Even while he remained in the dream, Paul knew he would remember it upon awakening.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 2, 'split_idx_start': 219}),\n", + " Document(id=4ed71ff61df531053cc7d5f80e8a0bd1e702f3a396f3f3983ceeffe89878a684, content: ' He always remembered the dreams that were predictions.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 3, 'split_idx_start': 303}),\n", + " Document(id=f485258001abdf2deab98249c7f0826b4f6b1bef7c37763d14318e7b595f434f, content: ' The dream faded.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 4, 'split_idx_start': 358}),\n", + " Document(id=f39c29c3a3122affc5909dc7b98f5880d9bd984731380420134c440da6fee363, content: ' Paul awoke to feel himself in the warmth of his bed—thinking thinking.', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 5, 'split_idx_start': 375}),\n", + " Document(id=15401623a2a4fed533db7c1bbe8df157f79a9395cf8d3d6e92dc5ae553d0dded, content: ' This world of Castle Caladan, without play or companions his own age, perhaps did not deserve sadn...', meta: {'source_id': 'b56504f244b7b650096b14d678bc82f3d7fe240bb135361c6a23a14c4b809596', 'page_number': 1, 'split_id': 6, 'split_idx_start': 446})]]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['context_documents']" + ] + }, + { + "cell_type": "markdown", + "id": "289e207c-7f8f-45da-bdfd-61ed4955942d", + "metadata": {}, + "source": [ + "## Advanced Example" + ] + }, + { + "cell_type": "markdown", + "id": "6bc10c96-e453-4e43-9b64-d05fae6de040", + "metadata": {}, + "source": [ + "We will use the BBC news dataset to show how the `SentenceWindowRetriever` works with a dataset containing multiple news articles.\n", + "\n", + "### Reading the dataset\n", + "\n", + "The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but it was already preprocessed and stored in\n", + "a single CSV file available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "82565cae-a730-4cd7-85f3-40be0e77b94d", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "import csv\n", + "from haystack import Document\n", + "\n", + "def read_documents(file: str) -> List[Document]:\n", + " with open(file, \"r\") as file:\n", + " reader = csv.reader(file, delimiter=\"\\t\")\n", + " next(reader, None) # skip the headers\n", + " documents = []\n", + " for row in reader:\n", + " category = row[0].strip()\n", + " title = row[2].strip()\n", + " text = row[3].strip()\n", + " documents.append(Document(content=text, meta={\"category\": category, \"title\": title}))\n", + "\n", + " return documents" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "4f581e8b-0693-4b09-b82e-71e78cb83f1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-10-11 15:46:13-- https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 5080260 (4.8M) [text/plain]\n", + "Saving to: ‘bbc-news-data.csv.2’\n", + "\n", + "bbc-news-data.csv.2 100%[===================>] 4.84M 8.53MB/s in 0.6s \n", + "\n", + "2024-10-11 15:46:14 (8.53 MB/s) - ‘bbc-news-data.csv.2’ saved [5080260/5080260]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "1ab23051-7df1-49e6-a009-ba187855aab3", + "metadata": {}, + "outputs": [], + "source": [ + "docs = read_documents(\"bbc-news-data.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "e3adbfda-86ba-44fc-a28c-681eb1b23351", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2225" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(docs)" + ] + }, + { + "cell_type": "markdown", + "id": "4a003472-19c1-4bc0-b6df-995bc66e8904", + "metadata": {}, + "source": [ + "### Indexing the documents\n", + "\n", + "We will now apply the `DocumentSplitter` to split the documents into sentences and write them to an `InMemoryDocumentStore`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "eb3203f3-2f75-4a60-9d2a-f530a09113a0", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document\n", + "from haystack.components.preprocessors import DocumentSplitter\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.document_stores.types import DuplicatePolicy\n", + "\n", + "def index_documents(documents: List[Document]):\n", + " splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by=\"sentence\")\n", + " docs = splitter.run(documents)\n", + " doc_store = InMemoryDocumentStore()\n", + " doc_store.write_documents(docs[\"documents\"], policy=DuplicatePolicy.OVERWRITE)\n", + "\n", + " return doc_store" + ] + }, + { + "cell_type": "markdown", + "id": "a0b6030d-ace7-471e-ae5f-b7dfc0ec1064", + "metadata": {}, + "source": [ + "### Querying the documents\n", + "\n", + "Let's now build a pipeline to query the documents using the `InMemoryBM25Retriever` and the `SentenceWindowRetriever`." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "7048bc9c-8c6a-4df0-92c4-20b6162cfdb4", + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Pipeline\n", + "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n", + "from haystack.components.retrievers import SentenceWindowRetriever\n", + "\n", + "def querying_pipeline(doc_store: InMemoryDocumentStore, window_size: int = 2):\n", + " pipeline = Pipeline()\n", + " bm25_retriever = InMemoryBM25Retriever(document_store=doc_store)\n", + " sentence_window_retriever = SentenceWindowRetriever(doc_store, window_size=window_size)\n", + " pipeline.add_component(instance=bm25_retriever, name=\"BM25Retriever\")\n", + " pipeline.add_component(instance=sentence_window_retriever, name=\"SentenceWindowRetriever\")\n", + " pipeline.connect(\"BM25Retriever.documents\", \"SentenceWindowRetriever.retrieved_documents\")\n", + "\n", + " return pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "e67abab4-10c1-4d95-b8fe-1ff24bb93161", + "metadata": {}, + "source": [ + "### Putting it all together\n", + "\n", + "We now read the raw documents, index them, build the querying pipeline, and query the document store for \"phishing attacks\", returning only the first top most scored document. We also include the outputs from the BM25Retriever\n", + "so that we can compare the results with and without the SentenceWindowRetriever." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "fdadd81e-3a4c-474d-9a92-85fa2f6a1264", + "metadata": {}, + "outputs": [], + "source": [ + "docs = read_documents(\"bbc-news-data.csv\")\n", + "doc_store = index_documents(docs)\n", + "pipeline = querying_pipeline(doc_store, window_size=2)\n", + "result = pipeline.run(data={'BM25Retriever': {'query': \"phishing attacks\", \"top_k\": 1}}, include_outputs_from={'BM25Retriever'})" + ] + }, + { + "cell_type": "markdown", + "id": "bbda91dc-238e-44a8-a241-9c35115efe88", + "metadata": {}, + "source": [ + "Let's now inspect the results from the BM25Retriever and the SentenceWindowRetriever. Since we split the documents by sentence, the BM25Retriever returns only the sentence associated with the matching query." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "ff53bf0b-ec2f-49aa-a5ff-82e0686ac81d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "' The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month.'" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['BM25Retriever']['documents'][0].content" + ] + }, + { + "cell_type": "markdown", + "id": "aca630e2-f0a9-4a07-bdcb-5f7e10415802", + "metadata": {}, + "source": [ + "The SentenceWindowRetriever, on the other hand, returns a window of sentences around the matching sentence, giving us more context to understand the sentence." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "b3453305-b536-460f-be3f-8fc6d7169673", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\" In particular, phishing attacks, which typically use fake versions of bank websites to grab login details of customers, boomed during 2004. Web portal Lycos Europe reported a 500% increase in the number of phishing e-mail messages it was catching. The Anti-Phishing Working group reported that the number of phishing attacks against new targets was growing at a rate of 30% or more per month. Those who fall victim to these attacks can find that their bank account has been cleaned out or that their good name has been ruined by someone stealing their identity. This change in the ranks of virus writers could mean the end of the mass-mailing virus which attempts to spread by tricking people into opening infected attachments on e-mail messages.'" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['SentenceWindowRetriever']['context_windows'][0]" + ] + }, + { + "cell_type": "markdown", + "id": "06614ff0-7b82-46dc-8416-a78633704583", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We saw how the `SentenceWindowRetriever` works and how it can be used to retrieve a window of sentences around a matching document, give us more context to understand the document. One important aspect to notice is that the `SentenceWindowRetriever` doesn't handle queries directly but relies on the output of another `Retriever` that handles the initial user query. This allows the `SentenceWindowRetriever` to be used in conjunction with any other retriever in the pipeline, such as the `InMemoryBM25Retriever`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 6f307c7e6c0b074aa8dd0a2e1c3586e3e5a154bb Mon Sep 17 00:00:00 2001 From: "David S. Batista" Date: Fri, 11 Oct 2024 16:04:32 +0200 Subject: [PATCH 5/7] updating README.MD --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 016441cc..39ceca45 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ Haystack 2.0 | [Pipelines](./tutorials/11_Pipelines.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/11_Pipelines.ipynb) | [[OUTDATED] Simplifying Pipeline Inputs with Multiplexer](./tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/37_Simplifying_Pipeline_Inputs_with_Multiplexer.ipynb)| | [[OUTDATED] Seq2SeqGenerator](./tutorials/12_LFQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/12_LFQA.ipynb) | [Embedding Metadata for Improved Retrieval](./tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/39_Embedding_Metadata_for_Improved_Retrieval.ipynb)| | [Question Generation](./tutorials/13_Question_generation.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb) | [Building a Chat Application with Function Calling](./tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| -| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | [Sentence Window Retriever](./tutorials/41_Sentence_Window_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/40_Building_Chat_Application_with_Function_Calling.ipynb)| | | +| [Query Classifier](./tutorials/14_Query_Classifier.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb) | [Sentence Window Retriever](./tutorials/41_Sentence_Window_Retriever.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/41_Sentence_Window_Retriever.ipynb)| | | | [Table QA](./tutorials/15_TableQA.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb) | | | | [Document Classifier at Index Time](./tutorials/16_Document_Classifier_at_Index_Time.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb) | | | | [Make Your QA Pipelines Talk!](./tutorials/17_Audio.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/17_Audio.ipynb) | | | From 5511d0fdc3d0d0062b326d551cffe9bb0d312cdb Mon Sep 17 00:00:00 2001 From: Tuana Celik Date: Wed, 16 Oct 2024 17:54:45 +0200 Subject: [PATCH 6/7] trying bash --- tutorials/42_Sentence_Window_Retriever.ipynb | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tutorials/42_Sentence_Window_Retriever.ipynb b/tutorials/42_Sentence_Window_Retriever.ipynb index 512c1adc..4a6d3871 100644 --- a/tutorials/42_Sentence_Window_Retriever.ipynb +++ b/tutorials/42_Sentence_Window_Retriever.ipynb @@ -5,7 +5,7 @@ "id": "b79cad40-2c9c-4598-8195-0d6cf525ff87", "metadata": {}, "source": [ - "# Tutorial: Query Classification with TransformersTextRouter and TransformersZeroShotTextRouter\n", + "# Tutorial: Retrieving a Context Window Around a Sentence\n", "\n", "- **Level**: Beginner\n", "- **Time to complete**: 10 minutes\n", @@ -298,7 +298,9 @@ } ], "source": [ - "!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + "%%bash\n", + "\n", + "wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" ] }, { From 1d141891b9c9b54768928bf9e06a41cc1b44d8af Mon Sep 17 00:00:00 2001 From: Tuana Celik Date: Mon, 21 Oct 2024 10:52:58 +0200 Subject: [PATCH 7/7] fixing data fetch --- .gitignore | 3 +- tutorials/42_Sentence_Window_Retriever.ipynb | 41 +++++++------------- 2 files changed, 16 insertions(+), 28 deletions(-) diff --git a/.gitignore b/.gitignore index 10866894..eb7961ae 100644 --- a/.gitignore +++ b/.gitignore @@ -137,5 +137,4 @@ dmypy.json text/** tutorials/data -saved_models -tutorials/bbc-news-data.csv +saved_models \ No newline at end of file diff --git a/tutorials/42_Sentence_Window_Retriever.ipynb b/tutorials/42_Sentence_Window_Retriever.ipynb index 4a6d3871..333c4caf 100644 --- a/tutorials/42_Sentence_Window_Retriever.ipynb +++ b/tutorials/42_Sentence_Window_Retriever.ipynb @@ -250,7 +250,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 5, "id": "82565cae-a730-4cd7-85f3-40be0e77b94d", "metadata": {}, "outputs": [], @@ -275,37 +275,26 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 3, "id": "4f581e8b-0693-4b09-b82e-71e78cb83f1a", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2024-10-16 16:28:55-- https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 5080260 (4,8M) [text/plain]\n", - "Saving to: ‘bbc-news-data.csv’\n", - "\n", - "bbc-news-data.csv 100%[===================>] 4,84M 20,0MB/s in 0,2s \n", - "\n", - "2024-10-16 16:28:56 (20,0 MB/s) - ‘bbc-news-data.csv’ saved [5080260/5080260]\n", - "\n" - ] - } - ], + "outputs": [], "source": [ - "%%bash\n", + "from pathlib import Path\n", + "import requests\n", + "\n", + "doc = requests.get('https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv')\n", "\n", - "wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv" + "datafolder = Path('data')\n", + "datafolder.mkdir(exist_ok=True)\n", + "with open(datafolder/'bbc-news-data.csv', 'wb') as f:\n", + " for chunk in doc.iter_content(512):\n", + " f.write(chunk)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 6, "id": "1ab23051-7df1-49e6-a009-ba187855aab3", "metadata": {}, "outputs": [ @@ -315,13 +304,13 @@ "2225" ] }, - "execution_count": 17, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "docs = read_documents(\"bbc-news-data.csv\")\n", + "docs = read_documents(\"data/bbc-news-data.csv\")\n", "len(docs)" ] },