π Release v0.2.0: GitHub Repository Scraping
Major feature release adding comprehensive GitHub repository documentation scraping capabilities!
β¨ What's New
GitHub Repository Scraping
Scrape documentation directly from GitHub repositories without cloning them locally. Perfect for:
- Documentation aggregation
- Offline reading preparation
- Multi-format conversion
- Research and analysis
Example Usage
# Using Streamlit UI
streamlit run src/scrape_api_docs/streamlit_app.py
# Enter: https://github.com/bmad-code-org/BMAD-METHOD/tree/main/src/modules/bmm/docs
# Using Python API
from scrape_api_docs import scrape_github_repo
scrape_github_repo('https://github.com/owner/repo/tree/main/docs')π― Key Features
URL Support
β
Full repository: https://github.com/owner/repo
β
Specific branch: https://github.com/owner/repo/tree/develop
β
Folder-specific: https://github.com/owner/repo/tree/main/docs
β
Single file: https://github.com/owner/repo/blob/main/README.md
β
SSH URLs: git@github.com:owner/repo.git
Intelligent Features
- Auto-detection: UI automatically detects GitHub URLs
- Smart filtering: Only processes documentation files (.md, .rst, .txt, etc.)
- Link conversion: Converts relative links to absolute GitHub URLs
- Rate limiting: Handles GitHub API limits (60/hr β 5,000/hr with token)
- Multi-format export: Markdown, PDF, EPUB, HTML, JSON
Performance
- Efficient API usage: ~1 + file_count total requests
- Concurrent file downloads
- Smart caching and session reuse
- Progress tracking in real-time
π¦ What's Included
Core Implementation
-
src/scrape_api_docs/github_scraper.py(653 lines)- GitHub REST API v3 integration
- URL detection and parsing
- Directory traversal
- File content downloading
- Relative link conversion
-
Streamlit UI Integration
- Auto-detection with metadata display
- GitHub token input for higher rate limits
- Max files limiter
- Metadata inclusion toggle
Testing
- 50+ test cases across 9 test classes
- Unit tests for all core functions
- Integration tests with real repositories
- Performance and error handling tests
- All tests passing β
Documentation
- User Guide:
docs/github-scraping-guide.md - API Reference:
docs/github-scraper-implementation.md - Integration Plan:
docs/github-integration-plan.md - Research:
docs/github-api-research.md - Examples:
examples/github_scraper_example.py - Changelog:
CHANGELOG.md
π Statistics
| Metric | Count |
|---|---|
| Files Created | 14 |
| Files Modified | 5 |
| Lines of Code Added | 9,975+ |
| Test Cases | 50+ |
| Documentation Pages | 9 |
| Functions | 7 main functions |
| URL Formats Supported | 6+ |
| File Types Supported | 8+ |
π§ Installation
From Source
git clone https://github.com/thepingdoctor/scrape-api-docs.git
cd scrape-api-docs
poetry installFrom PyPI (coming soon)
pip install scrape-api-docsπ‘οΈ Security & Best Practices
β
Token Safety: Tokens stored in memory only, never persisted
β
URL Validation: All URLs validated before API calls
β
Filename Sanitization: Prevents path traversal attacks
β
Rate Limiting: Respects GitHub's rate limits
β
SSRF Prevention: Validates all generated URLs
β
Error Handling: Graceful degradation on failures
π Changelog
See CHANGELOG.md for complete version history.
Added
- GitHub repository scraping via REST API v3
- Auto-detection of GitHub URLs in Streamlit UI
- Support for multiple GitHub URL formats
- Rate limiting with optional authentication
- Relative link to absolute URL conversion
- Comprehensive test suite (50+ tests)
- 9 documentation files
- Example usage scripts
- CHANGELOG.md for version tracking
Changed
- Version bumped from 0.1.0 to 0.2.0
- Package description updated
- Added GitHub-related keywords
- Enhanced .gitignore for Claude/claude-flow artifacts
- README.md updated with GitHub features
Dependencies
- No new dependencies required (uses existing packages)
π Acknowledgments
This feature was implemented using Claude Code with Hive Mind Collective Intelligence coordination:
- Swarm ID: swarm-1763159183466-ozce54lgp
- Agents: 4 specialized agents (researcher, coder, analyst, tester)
- Methodology: Byzantine consensus with hierarchical coordination
- Execution: Concurrent agent spawning for maximum efficiency
π Known Issues
None at this time. Please report issues at: https://github.com/thepingdoctor/scrape-api-docs/issues
π What's Next
Future enhancements planned:
- GitHub wiki scraping
- GitHub issues as documentation
- GitHub discussions scraping
- Multi-repository batch scraping
- Commit history inclusion
- Author attribution
- CLI support for GitHub URLs
Full Changelog: v0.1.0...v0.2.0