Skip to content

Feature: Scrapy plugin for Pydoll (scrapy-pydoll) #248

@thalissonvs

Description

@thalissonvs

Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.

Proposed API

  • Installable optional plugin: pip install scrapy-pydoll
  • Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }
  • Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
    url,
    meta={
        "pydoll": {
            "actions": [
                {"type": "wait", "for": "networkidle"},
                {"type": "click", "selector": "#show-more"},
            ],
            "timeout": 15000,
        },
        "cookiejar": "sessionA",
    },
    callback=self.parse_page,
)

# or
yield PydollRequest(url, actions=[...], timeout=15000)

Requirements (MVP)

  • Deterministic rendered HtmlResponse compatible with .css() / .xpath()
  • Wait strategies: networkidle, selector, sleep(ms)
  • Small action set: click, type, scroll
  • Per-request headers/cookies merged with Pydoll context
  • Session reuse by cookiejar; graceful shutdown on spider_closed
  • Timeouts, retries surfaced as IgnoreRequest or similar

Follow-ups

  • Optionally attach Markdown (return_markdown=True) once exporter exists
  • Network record on error (integration with recorder feature)
  • Page bundle snapshot on exception for offline debugging
  • WebPoet/Scrapy-Poet provider to inject a Tab or rendered HTML

Example Spider

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products",
            meta={"pydoll": {
                "actions": [{"type": "wait", "for": "networkidle"}],
                "timeout": 15000
            }},
            callback=self.parse_list
        )

    def parse_list(self, response):
        for href in response.css(".item a::attr(href)").getall():
            yield scrapy.Request(
                response.urljoin(href),
                meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestfuture planningIdeas or features proposed for future development.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions