🚀 Release v0.2.0: GitHub Repository Scraping

Major feature release adding comprehensive GitHub repository documentation scraping capabilities!

✨ What's New

GitHub Repository Scraping

Scrape documentation directly from GitHub repositories without cloning them locally. Perfect for:

Documentation aggregation
Offline reading preparation
Multi-format conversion
Research and analysis

Example Usage

# Using Streamlit UI
streamlit run src/scrape_api_docs/streamlit_app.py
# Enter: https://github.com/bmad-code-org/BMAD-METHOD/tree/main/src/modules/bmm/docs

# Using Python API
from scrape_api_docs import scrape_github_repo
scrape_github_repo('https://github.com/owner/repo/tree/main/docs')

🎯 Key Features

URL Support

✅ Full repository: https://github.com/owner/repo
✅ Specific branch: https://github.com/owner/repo/tree/develop
✅ Folder-specific: https://github.com/owner/repo/tree/main/docs
✅ Single file: https://github.com/owner/repo/blob/main/README.md
✅ SSH URLs: git@github.com:owner/repo.git

Intelligent Features

Auto-detection: UI automatically detects GitHub URLs
Smart filtering: Only processes documentation files (.md, .rst, .txt, etc.)
Link conversion: Converts relative links to absolute GitHub URLs
Rate limiting: Handles GitHub API limits (60/hr → 5,000/hr with token)
Multi-format export: Markdown, PDF, EPUB, HTML, JSON

Performance

Efficient API usage: ~1 + file_count total requests
Concurrent file downloads
Smart caching and session reuse
Progress tracking in real-time

📦 What's Included

Core Implementation

src/scrape_api_docs/github_scraper.py (653 lines)
- GitHub REST API v3 integration
- URL detection and parsing
- Directory traversal
- File content downloading
- Relative link conversion
Streamlit UI Integration
- Auto-detection with metadata display
- GitHub token input for higher rate limits
- Max files limiter
- Metadata inclusion toggle

Testing

50+ test cases across 9 test classes
Unit tests for all core functions
Integration tests with real repositories
Performance and error handling tests
All tests passing ✅

Documentation

User Guide: docs/github-scraping-guide.md
API Reference: docs/github-scraper-implementation.md
Integration Plan: docs/github-integration-plan.md
Research: docs/github-api-research.md
Examples: examples/github_scraper_example.py
Changelog: CHANGELOG.md

📊 Statistics

Metric	Count
Files Created	14
Files Modified	5
Lines of Code Added	9,975+
Test Cases	50+
Documentation Pages	9
Functions	7 main functions
URL Formats Supported	6+
File Types Supported	8+

🔧 Installation

From Source

git clone https://github.com/thepingdoctor/scrape-api-docs.git
cd scrape-api-docs
poetry install

From PyPI (coming soon)

pip install scrape-api-docs

🛡️ Security & Best Practices

✅ Token Safety: Tokens stored in memory only, never persisted
✅ URL Validation: All URLs validated before API calls
✅ Filename Sanitization: Prevents path traversal attacks
✅ Rate Limiting: Respects GitHub's rate limits
✅ SSRF Prevention: Validates all generated URLs
✅ Error Handling: Graceful degradation on failures

📝 Changelog

See CHANGELOG.md for complete version history.

Added

GitHub repository scraping via REST API v3
Auto-detection of GitHub URLs in Streamlit UI
Support for multiple GitHub URL formats
Rate limiting with optional authentication
Relative link to absolute URL conversion
Comprehensive test suite (50+ tests)
9 documentation files
Example usage scripts
CHANGELOG.md for version tracking

Changed

Version bumped from 0.1.0 to 0.2.0
Package description updated
Added GitHub-related keywords
Enhanced .gitignore for Claude/claude-flow artifacts
README.md updated with GitHub features

Dependencies

No new dependencies required (uses existing packages)

🙏 Acknowledgments

This feature was implemented using Claude Code with Hive Mind Collective Intelligence coordination:

Swarm ID: swarm-1763159183466-ozce54lgp
Agents: 4 specialized agents (researcher, coder, analyst, tester)
Methodology: Byzantine consensus with hierarchical coordination
Execution: Concurrent agent spawning for maximum efficiency

🐛 Known Issues

None at this time. Please report issues at: https://github.com/thepingdoctor/scrape-api-docs/issues

🚀 What's Next

Future enhancements planned:

GitHub wiki scraping
GitHub issues as documentation
GitHub discussions scraping
Multi-repository batch scraping
Commit history inclusion
Author attribution
CLI support for GitHub URLs

Full Changelog: v0.1.0...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.2.0: GitHub Repository Scraping

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 Release v0.2.0: GitHub Repository Scraping

✨ What's New

GitHub Repository Scraping

Example Usage

🎯 Key Features

URL Support

Intelligent Features

Performance

📦 What's Included

Core Implementation

Testing

Documentation

📊 Statistics

🔧 Installation

From Source

From PyPI (coming soon)

🛡️ Security & Best Practices

📝 Changelog

Added

Changed

Dependencies

🙏 Acknowledgments

🐛 Known Issues

🚀 What's Next

Uh oh!