From 1a825c30317184803cc2a8da9e7c086eb2f3bcc2 Mon Sep 17 00:00:00 2001 From: Insutanto <19406290+Insutanto@users.noreply.github.com> Date: Sun, 24 Aug 2025 10:41:14 +0900 Subject: [PATCH] Add codebase overview documentation --- README.md | 35 ++++++++++++++++++++++++++++++++++- docs/codebase_overview.md | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 66 insertions(+), 1 deletion(-) create mode 100644 docs/codebase_overview.md diff --git a/README.md b/README.md index 00ed431..4400138 100644 --- a/README.md +++ b/README.md @@ -190,4 +190,37 @@ scrapy crawl `scrapy-rabbitmq-link`([scrapy-rabbitmq-link](https://github.com/mbriliauskas/scrapy-rabbitmq-link)) -`scrapy-redis`([scrapy-redis](https://github.com/rmax/scrapy-redis)) \ No newline at end of file +`scrapy-redis`([scrapy-redis](https://github.com/rmax/scrapy-redis)) + +## Codebase Overview + +The Scrapy-Distributed project enables building distributed crawlers on top of Scrapy. It supplies scheduler, queue, middleware, pipelines, and duplicate filtering components that coordinate work across RabbitMQ, Kafka and RedisBloom. + +### Repository Layout +- `scrapy_distributed/` – library modules + - `amqp_utils/` – RabbitMQ helpers + - `common/` – queue configuration objects + - `dupefilters/` – Redis Bloom-based duplicate filter + - `middlewares/` – downloader middlewares for ACK/requeue + - `pipelines/` – item pipelines to publish items to queues + - `queues/` – RabbitMQ and Kafka queue implementations + - `redis_utils/` – Redis connection helpers + - `schedulers/` – distributed scheduler combining queue and dupe filter + - `spiders/` – mixins and example spiders +- `examples/` – small Scrapy projects showing how to use RabbitMQ and Kafka +- `tests/` – unit tests for key components + +### Key Components +- **DistributedScheduler** combines queue-based scheduling with a Redis Bloom duplicate filter. +- **RabbitQueue** and **KafkaQueue** serialize Scrapy requests to publish/consume through RabbitMQ or Kafka. +- **RedisBloomDupeFilter** tracks seen URLs using Redis Bloom filters. +- **RabbitMiddleware** and **RabbitPipeline** handle acknowledgement and item publishing. + +### Example Usage +Example projects under `examples/` demonstrate how to configure the scheduler, queue, middleware and pipeline. Supporting services can be launched with the provided `docker-compose.dev.yaml`. + +### Learning Path +1. Run the examples to see the distributed scheduler in action. +2. Review `schedulers` and `queues` modules to understand request flow. +3. Customize queue and Bloom filter settings via objects in `common` and `redis_utils`. +4. Extend middlewares or pipelines to integrate with additional services. diff --git a/docs/codebase_overview.md b/docs/codebase_overview.md new file mode 100644 index 0000000..79139d5 --- /dev/null +++ b/docs/codebase_overview.md @@ -0,0 +1,32 @@ +# Codebase Overview + +The Scrapy-Distributed project enables building distributed crawlers on top of Scrapy. It supplies scheduler, queue, middleware, pipelines, and duplicate filtering components that coordinate work across RabbitMQ, Kafka and RedisBloom. + +## Repository Layout +- `scrapy_distributed/` – library modules + - `amqp_utils/` – RabbitMQ helpers + - `common/` – queue configuration objects + - `dupefilters/` – Redis Bloom-based duplicate filter + - `middlewares/` – downloader middlewares for ACK/requeue + - `pipelines/` – item pipelines to publish items to queues + - `queues/` – RabbitMQ and Kafka queue implementations + - `redis_utils/` – Redis connection helpers + - `schedulers/` – distributed scheduler combining queue and dupe filter + - `spiders/` – mixins and example spiders +- `examples/` – small Scrapy projects showing how to use RabbitMQ and Kafka +- `tests/` – unit tests for key components + +## Key Components +- **DistributedScheduler** combines queue-based scheduling with a Redis Bloom duplicate filter. +- **RabbitQueue** and **KafkaQueue** serialize Scrapy requests to publish/consume through RabbitMQ or Kafka. +- **RedisBloomDupeFilter** tracks seen URLs using Redis Bloom filters. +- **RabbitMiddleware** and **RabbitPipeline** handle acknowledgement and item publishing. + +## Example Usage +Example projects under `examples/` demonstrate how to configure the scheduler, queue, middleware and pipeline. Supporting services can be launched with the provided `docker-compose.dev.yaml`. + +## Learning Path +1. Run the examples to see the distributed scheduler in action. +2. Review `schedulers` and `queues` modules to understand request flow. +3. Customize queue and Bloom filter settings via objects in `common` and `redis_utils`. +4. Extend middlewares or pipelines to integrate with additional services.