feat(fai): improve website indexing safety and reduce default chunk size #5443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
domain_filterandpath_filterfrombase_urlwhen not provided to prevent accidentally scraping beyond the intended scopechunk_sizefrom 1000 to 800 tokens for better search granularity and semantic matchingWebsiteCrawlConfigChanges
Safety improvement: When users don't provide
domain_filterorpath_filter, the system now automatically extracts them from the base URL. For example:base_url = "https://docs.example.com/guide/intro"automatically sets:domain_filter = "docs.example.com"path_filter = "/guide/intro"Chunk size optimization: Default chunk size reduced from 1000 to 800 tokens for more granular search results
Logging improvements: Replaced all print statements in the crawler with structured logging:
LOGGER.info()for progress updates (crawling URLs, successful chunks)LOGGER.warning()for skipped pagesLOGGER.error()for HTTP/request errorsTest coverage: Added
tests/utils/website/test_models.pywith tests covering:Test plan
make code-cleanup🤖 Generated with Claude Code