chirag127
diff --git a/‎.gitignore‎
Lines changed: 13 additions & 0 deletions b/‎.gitignore‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion b/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 59 additions & 6 deletions b/‎README.md‎
Lines changed: 59 additions & 6 deletions
diff --git a/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion b/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crawl4ai/async_configs.py‎
Lines changed: 80 additions & 0 deletions b/‎crawl4ai/async_configs.py‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎crawl4ai/async_crawler_strategy.py‎
Lines changed: 3 additions & 2 deletions b/‎crawl4ai/async_crawler_strategy.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎crawl4ai/async_url_seeder.py‎
Lines changed: 59 additions & 13 deletions b/‎crawl4ai/async_url_seeder.py‎
Lines changed: 59 additions & 13 deletions
diff --git a/‎crawl4ai/async_webcrawler.py‎
Lines changed: 11 additions & 1 deletion b/‎crawl4ai/async_webcrawler.py‎
Lines changed: 11 additions & 1 deletion
@@ -271,6 +271,8 @@ continue_config.json
 CLAUDE_MONITOR.md
 CLAUDE.md
 
+.claude/
+
 tests/**/test_site
 tests/**/reports
 tests/**/benchmark_reports
@@ -282,3 +284,14 @@ docs/apps/linkdin/debug*/
 docs/apps/linkdin/samples/insights/*
 
 scripts/
+
+
+# Databse files
+*.sqlite3
+*.sqlite3-journal
+*.db-journal
+*.db-wal
+*.db-shm
+*.db
+*.rdb
+*.ldb
@@ -1,7 +1,7 @@
 FROM python:3.12-slim-bookworm AS build
 
 # C4ai version
-ARG C4AI_VER=0.7.6
+ARG C4AI_VER=0.7.7
 ENV C4AI_VERSION=$C4AI_VER
 LABEL c4ai.version=$C4AI_VER
 
 
@@ -27,13 +27,13 @@
 
 Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
 
-[✨ Check out latest update v0.7.6](#-recent-updates)
+[✨ Check out latest update v0.7.7](#-recent-updates)
 
-✨ **New in v0.7.6**: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. No more polling! [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
+✨ **New in v0.7.7**: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. Full visibility and control over your crawling infrastructure. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
 
-✨ Recent v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
+✨ Recent v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. No more polling! [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
 
-✨ Previous v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
+✨ Previous v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
 
 <details>
   <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -296,6 +296,7 @@ pip install -e ".[all]"             # Install all optional features
 ### New Docker Features
 
 The new Docker implementation includes:
+- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility
 - **Browser pooling** with page pre-warming for faster response times
 - **Interactive playground** to test and generate request code
 - **MCP integration** for direct connection to AI tools like Claude Code
@@ -310,7 +311,8 @@ The new Docker implementation includes:
 docker pull unclecode/crawl4ai:latest
 docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
 
-# Visit the playground at http://localhost:11235/playground
+# Visit the monitoring dashboard at http://localhost:11235/dashboard
+# Or the playground at http://localhost:11235/playground
 ```
 
 ### Quick Test
@@ -339,7 +341,7 @@ else:
     result = requests.get(f"http://localhost:11235/task/{task_id}")
 ```
 
-For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
+For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/).
 
 </details>
 
@@ -550,6 +552,57 @@ async def test_news_crawl():
 
 ## ✨ Recent Updates
 
+<details>
+<summary><strong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update</strong></summary>
+
+- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
+  ```python
+  # Access the monitoring dashboard
+  # Visit: http://localhost:11235/dashboard
+
+  # Real-time metrics include:
+  # - System health (CPU, memory, network, uptime)
+  # - Active and completed request tracking
+  # - Browser pool management (permanent/hot/cold)
+  # - Janitor cleanup events
+  # - Error monitoring with full context
+  ```
+
+- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
+  ```python
+  import httpx
+
+  async with httpx.AsyncClient() as client:
+      # System health
+      health = await client.get("http://localhost:11235/monitor/health")
+
+      # Request tracking
+      requests = await client.get("http://localhost:11235/monitor/requests")
+
+      # Browser pool status
+      browsers = await client.get("http://localhost:11235/monitor/browsers")
+
+      # Endpoint statistics
+      stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
+  ```
+
+- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
+- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
+- **🧹 Janitor System**: Automatic resource management with event logging
+- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
+- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
+- **🐛 Critical Bug Fixes**:
+  - Fixed async LLM extraction blocking issue (#1055)
+  - Enhanced DFS deep crawl strategy (#1607)
+  - Fixed sitemap parsing in AsyncUrlSeeder (#1598)
+  - Resolved browser viewport configuration (#1495)
+  - Fixed CDP timing with exponential backoff (#1528)
+  - Security update for pyOpenSSL (>=25.3.0)
+
+[Full v0.7.7 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
+
+</details>
+
 <details>
 <summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
 
 
@@ -1,7 +1,7 @@
 # crawl4ai/__version__.py
 
 # This is the version that will be used for stable releases
-__version__ = "0.7.6"
+__version__ = "0.7.7"
 
 # For nightly builds, this gets set during build process
 __nightly_version__ = None
 
@@ -1,6 +1,7 @@
 import os
 from typing import Union
 import warnings
+import requests
 from .config import (
     DEFAULT_PROVIDER,
     DEFAULT_PROVIDER_API_KEY,
@@ -649,6 +650,85 @@ def load(data: dict) -> "BrowserConfig":
             return config
         return BrowserConfig.from_kwargs(config)
 
+    def set_nstproxy(
+        self,
+        token: str,
+        channel_id: str,
+        country: str = "ANY",
+        state: str = "",
+        city: str = "",
+        protocol: str = "http",
+        session_duration: int = 10,
+    ):
+        """
+        Fetch a proxy from NSTProxy API and automatically assign it to proxy_config.
+
+        Get your NSTProxy token from: https://app.nstproxy.com/profile
+
+        Args:
+            token (str): NSTProxy API token.
+            channel_id (str): NSTProxy channel ID.
+            country (str, optional): Country code (default: "ANY").
+            state (str, optional): State code (default: "").
+            city (str, optional): City name (default: "").
+            protocol (str, optional): Proxy protocol ("http" or "socks5"). Defaults to "http".
+            session_duration (int, optional): Session duration in minutes (0 = rotate each request). Defaults to 10.
+
+        Raises:
+            ValueError: If the API response format is invalid.
+            PermissionError: If the API returns an error message.
+        """
+
+        # --- Validate input early ---
+        if not token or not channel_id:
+            raise ValueError("[NSTProxy] token and channel_id are required")
+
+        if protocol not in ("http", "socks5"):
+            raise ValueError(f"[NSTProxy] Invalid protocol: {protocol}")
+
+        # --- Build NSTProxy API URL ---
+        params = {
+            "fType": 2,
+            "count": 1,
+            "channelId": channel_id,
+            "country": country,
+            "protocol": protocol,
+            "sessionDuration": session_duration,
+            "token": token,
+        }
+        if state:
+            params["state"] = state
+        if city:
+            params["city"] = city
+
+        url = "https://api.nstproxy.com/api/v1/generate/apiproxies"
+
+        try:
+            response = requests.get(url, params=params, timeout=10)
+            response.raise_for_status()
+
+            data = response.json()
+
+            # --- Handle API error response ---
+            if isinstance(data, dict) and data.get("err"):
+                raise PermissionError(f"[NSTProxy] API Error: {data.get('msg', 'Unknown error')}")
+
+            if not isinstance(data, list) or not data:
+                raise ValueError("[NSTProxy] Invalid API response — expected a non-empty list")
+
+            proxy_info = data[0]
+
+            # --- Apply proxy config ---
+            self.proxy_config = ProxyConfig(
+                server=f"{protocol}://{proxy_info['ip']}:{proxy_info['port']}",
+                username=proxy_info["username"],
+                password=proxy_info["password"],
+            )
+
+        except Exception as e:
+            print(f"[NSTProxy] ❌ Failed to set proxy: {e}")
+            raise
+
 class VirtualScrollConfig:
     """Configuration for virtual scroll handling.
     
 
@@ -1383,9 +1383,10 @@ async def remove_overlay_elements(self, page: Page) -> None:
         try:
             await self.adapter.evaluate(page,
                 f"""
-                (() => {{
+                (async () => {{
                     try {{
-                        {remove_overlays_js}
+                        const removeOverlays = {remove_overlays_js};
+                        await removeOverlays();
                         return {{ success: true }};
                     }} catch (error) {{
                         return {{
 
@@ -845,6 +845,15 @@ async def _iter_sitemap(self, url: str):
             return
 
         data = gzip.decompress(r.content) if url.endswith(".gz") else r.content
+        base_url = str(r.url)
+
+        def _normalize_loc(raw: Optional[str]) -> Optional[str]:
+            if not raw:
+                return None
+            normalized = urljoin(base_url, raw.strip())
+            if not normalized:
+                return None
+            return normalized
 
         # Detect if this is a sitemap index by checking for <sitemapindex> or presence of <sitemap> elements
         is_sitemap_index = False
@@ -857,25 +866,42 @@ async def _iter_sitemap(self, url: str):
                 # Use XML parser for sitemaps, not HTML parser
                 parser = etree.XMLParser(recover=True)
                 root = etree.fromstring(data, parser=parser)
+                # Namespace-agnostic lookups using local-name() so we honor custom or missing namespaces
+                sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
+                url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
 
-                # Define namespace for sitemap
-                ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
+                self._log(
+                    "debug",
+                    "Parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
+                    params={
+                        "url": url,
+                        "sitemap_count": len(sitemap_loc_nodes),
+                        "url_count": len(url_loc_nodes),
+                    },
+                    tag="URL_SEED",
+                )
 
                 # Check for sitemap index entries
-                sitemap_locs = root.xpath('//s:sitemap/s:loc', namespaces=ns)
-                if sitemap_locs:
+                if sitemap_loc_nodes:
                     is_sitemap_index = True
-                    for sitemap_elem in sitemap_locs:
-                        loc = sitemap_elem.text.strip() if sitemap_elem.text else ""
+                    for sitemap_elem in sitemap_loc_nodes:
+                        loc = _normalize_loc(sitemap_elem.text)
                         if loc:
                             sub_sitemaps.append(loc)
 
                 # If not a sitemap index, get regular URLs
                 if not is_sitemap_index:
-                    for loc_elem in root.xpath('//s:url/s:loc', namespaces=ns):
-                        loc = loc_elem.text.strip() if loc_elem.text else ""
+                    for loc_elem in url_loc_nodes:
+                        loc = _normalize_loc(loc_elem.text)
                         if loc:
                             regular_urls.append(loc)
+                    if not regular_urls:
+                        self._log(
+                            "warning",
+                            "No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
+                            params={"url": url},
+                            tag="URL_SEED",
+                        )
             except Exception as e:
                 self._log("error", "LXML parsing error for sitemap {url}: {error}",
                           params={"url": url, "error": str(e)}, tag="URL_SEED")
@@ -892,19 +918,39 @@ async def _iter_sitemap(self, url: str):
 
                 # Check for sitemap index entries
                 sitemaps = root.findall('.//sitemap')
+                url_entries = root.findall('.//url')
+                self._log(
+                    "debug",
+                    "ElementTree parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
+                    params={
+                        "url": url,
+                        "sitemap_count": len(sitemaps),
+                        "url_count": len(url_entries),
+                    },
+                    tag="URL_SEED",
+                )
                 if sitemaps:
                     is_sitemap_index = True
                     for sitemap in sitemaps:
                         loc_elem = sitemap.find('loc')
-                        if loc_elem is not None and loc_elem.text:
-                            sub_sitemaps.append(loc_elem.text.strip())
+                        loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
+                        if loc:
+                            sub_sitemaps.append(loc)
 
                 # If not a sitemap index, get regular URLs
                 if not is_sitemap_index:
-                    for url_elem in root.findall('.//url'):
+                    for url_elem in url_entries:
                         loc_elem = url_elem.find('loc')
-                        if loc_elem is not None and loc_elem.text:
-                            regular_urls.append(loc_elem.text.strip())
+                        loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
+                        if loc:
+                            regular_urls.append(loc)
+                    if not regular_urls:
+                        self._log(
+                            "warning",
+                            "No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
+                            params={"url": url},
+                            tag="URL_SEED",
+                        )
             except Exception as e:
                 self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
                           params={"url": url, "error": str(e)}, tag="URL_SEED")
 
@@ -617,7 +617,17 @@ async def aprocess_html(
                 else config.chunking_strategy
             )
             sections = chunking.chunk(content)
-            extracted_content = config.extraction_strategy.run(url, sections)
+            # extracted_content = config.extraction_strategy.run(url, sections)
+
+            # Use async version if available for better parallelism
+            if hasattr(config.extraction_strategy, 'arun'):
+                extracted_content = await config.extraction_strategy.arun(url, sections)
+            else:
+                # Fallback to sync version run in thread pool to avoid blocking
+                extracted_content = await asyncio.to_thread(
+                    config.extraction_strategy.run, url, sections
+                )
+
             extracted_content = json.dumps(
                 extracted_content, indent=4, default=str, ensure_ascii=False
             )