Skip to content

Conversation

@tsbhangu
Copy link
Contributor

@tsbhangu tsbhangu commented Nov 22, 2025

Summary

  • Auto-extract domain_filter and path_filter from base_url when not provided to prevent accidentally scraping beyond the intended scope
  • Reduce default chunk_size from 1000 to 800 tokens for better search granularity and semantic matching
  • Replace print statements with structured logging for better debuggability
  • Add comprehensive test suite (11 tests) for WebsiteCrawlConfig

Changes

  1. Safety improvement: When users don't provide domain_filter or path_filter, the system now automatically extracts them from the base URL. For example:

    • base_url = "https://docs.example.com/guide/intro" automatically sets:
      • domain_filter = "docs.example.com"
      • path_filter = "/guide/intro"
    • This prevents accidentally crawling the entire web
  2. Chunk size optimization: Default chunk size reduced from 1000 to 800 tokens for more granular search results

  3. Logging improvements: Replaced all print statements in the crawler with structured logging:

    • LOGGER.info() for progress updates (crawling URLs, successful chunks)
    • LOGGER.warning() for skipped pages
    • LOGGER.error() for HTTP/request errors
    • Makes debugging stuck indexing jobs much easier
  4. Test coverage: Added tests/utils/website/test_models.py with tests covering:

    • Domain and path extraction from base URL
    • Explicit filter overrides
    • Edge cases (root paths, trailing slashes, subdomains, ports)
    • Default values and overrides

Test plan

  • All 95 website-related tests passing
  • Mypy type checking passes with no errors
  • Linting and formatting passes with make code-cleanup
  • New tests verify automatic filter extraction behavior

🤖 Generated with Claude Code

- Auto-extract domain_filter and path_filter from base_url when not provided
  to prevent accidentally scraping beyond intended scope
- Reduce default chunk_size from 1000 to 800 tokens for better granularity
- Add comprehensive test suite for WebsiteCrawlConfig

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vercel
Copy link
Contributor

vercel bot commented Nov 22, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
dev.ferndocs.com Ready Ready Preview Nov 22, 2025 3:08pm
fern-dashboard Ready Ready Preview Nov 22, 2025 3:08pm
fern-dashboard-dev Ready Ready Preview Nov 22, 2025 3:08pm
prod-assets.ferndocs.com Ready Ready Preview Nov 22, 2025 3:08pm
prod.ferndocs.com Ready Ready Preview Nov 22, 2025 3:08pm
1 Skipped Deployment
Project Deployment Preview Updated (UTC)
fern-platform Ignored Ignored Nov 22, 2025 3:08pm

- Replace all print() calls with LOGGER.info/warning/error
- Improves debuggability of stuck indexing jobs
- Maintains verbose parameter to control logging output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants