Skip to content

Release v0.2.0: GitHub Repository Scraping

Latest

Choose a tag to compare

@thepingdoctor thepingdoctor released this 14 Nov 22:51
· 9 commits to main since this release

πŸš€ Release v0.2.0: GitHub Repository Scraping

Major feature release adding comprehensive GitHub repository documentation scraping capabilities!

✨ What's New

GitHub Repository Scraping

Scrape documentation directly from GitHub repositories without cloning them locally. Perfect for:

  • Documentation aggregation
  • Offline reading preparation
  • Multi-format conversion
  • Research and analysis

Example Usage

# Using Streamlit UI
streamlit run src/scrape_api_docs/streamlit_app.py
# Enter: https://github.com/bmad-code-org/BMAD-METHOD/tree/main/src/modules/bmm/docs

# Using Python API
from scrape_api_docs import scrape_github_repo
scrape_github_repo('https://github.com/owner/repo/tree/main/docs')

🎯 Key Features

URL Support

βœ… Full repository: https://github.com/owner/repo
βœ… Specific branch: https://github.com/owner/repo/tree/develop
βœ… Folder-specific: https://github.com/owner/repo/tree/main/docs
βœ… Single file: https://github.com/owner/repo/blob/main/README.md
βœ… SSH URLs: git@github.com:owner/repo.git

Intelligent Features

  • Auto-detection: UI automatically detects GitHub URLs
  • Smart filtering: Only processes documentation files (.md, .rst, .txt, etc.)
  • Link conversion: Converts relative links to absolute GitHub URLs
  • Rate limiting: Handles GitHub API limits (60/hr β†’ 5,000/hr with token)
  • Multi-format export: Markdown, PDF, EPUB, HTML, JSON

Performance

  • Efficient API usage: ~1 + file_count total requests
  • Concurrent file downloads
  • Smart caching and session reuse
  • Progress tracking in real-time

πŸ“¦ What's Included

Core Implementation

  • src/scrape_api_docs/github_scraper.py (653 lines)

    • GitHub REST API v3 integration
    • URL detection and parsing
    • Directory traversal
    • File content downloading
    • Relative link conversion
  • Streamlit UI Integration

    • Auto-detection with metadata display
    • GitHub token input for higher rate limits
    • Max files limiter
    • Metadata inclusion toggle

Testing

  • 50+ test cases across 9 test classes
  • Unit tests for all core functions
  • Integration tests with real repositories
  • Performance and error handling tests
  • All tests passing βœ…

Documentation

  • User Guide: docs/github-scraping-guide.md
  • API Reference: docs/github-scraper-implementation.md
  • Integration Plan: docs/github-integration-plan.md
  • Research: docs/github-api-research.md
  • Examples: examples/github_scraper_example.py
  • Changelog: CHANGELOG.md

πŸ“Š Statistics

Metric Count
Files Created 14
Files Modified 5
Lines of Code Added 9,975+
Test Cases 50+
Documentation Pages 9
Functions 7 main functions
URL Formats Supported 6+
File Types Supported 8+

πŸ”§ Installation

From Source

git clone https://github.com/thepingdoctor/scrape-api-docs.git
cd scrape-api-docs
poetry install

From PyPI (coming soon)

pip install scrape-api-docs

πŸ›‘οΈ Security & Best Practices

βœ… Token Safety: Tokens stored in memory only, never persisted
βœ… URL Validation: All URLs validated before API calls
βœ… Filename Sanitization: Prevents path traversal attacks
βœ… Rate Limiting: Respects GitHub's rate limits
βœ… SSRF Prevention: Validates all generated URLs
βœ… Error Handling: Graceful degradation on failures

πŸ“ Changelog

See CHANGELOG.md for complete version history.

Added

  • GitHub repository scraping via REST API v3
  • Auto-detection of GitHub URLs in Streamlit UI
  • Support for multiple GitHub URL formats
  • Rate limiting with optional authentication
  • Relative link to absolute URL conversion
  • Comprehensive test suite (50+ tests)
  • 9 documentation files
  • Example usage scripts
  • CHANGELOG.md for version tracking

Changed

  • Version bumped from 0.1.0 to 0.2.0
  • Package description updated
  • Added GitHub-related keywords
  • Enhanced .gitignore for Claude/claude-flow artifacts
  • README.md updated with GitHub features

Dependencies

  • No new dependencies required (uses existing packages)

πŸ™ Acknowledgments

This feature was implemented using Claude Code with Hive Mind Collective Intelligence coordination:

  • Swarm ID: swarm-1763159183466-ozce54lgp
  • Agents: 4 specialized agents (researcher, coder, analyst, tester)
  • Methodology: Byzantine consensus with hierarchical coordination
  • Execution: Concurrent agent spawning for maximum efficiency

πŸ› Known Issues

None at this time. Please report issues at: https://github.com/thepingdoctor/scrape-api-docs/issues

πŸš€ What's Next

Future enhancements planned:

  • GitHub wiki scraping
  • GitHub issues as documentation
  • GitHub discussions scraping
  • Multi-repository batch scraping
  • Commit history inclusion
  • Author attribution
  • CLI support for GitHub URLs

Full Changelog: v0.1.0...v0.2.0