Skip to content

Commit cb637fb

Browse files
authored
Merge pull request unclecode#1613 from unclecode/release/v0.7.7
2 parents a99cd37 + 6244f56 commit cb637fb

File tree

75 files changed

+10324
-667
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+10324
-667
lines changed

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,8 @@ continue_config.json
271271
CLAUDE_MONITOR.md
272272
CLAUDE.md
273273

274+
.claude/
275+
274276
tests/**/test_site
275277
tests/**/reports
276278
tests/**/benchmark_reports
@@ -282,3 +284,14 @@ docs/apps/linkdin/debug*/
282284
docs/apps/linkdin/samples/insights/*
283285

284286
scripts/
287+
288+
289+
# Databse files
290+
*.sqlite3
291+
*.sqlite3-journal
292+
*.db-journal
293+
*.db-wal
294+
*.db-shm
295+
*.db
296+
*.rdb
297+
*.ldb

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
FROM python:3.12-slim-bookworm AS build
22

33
# C4ai version
4-
ARG C4AI_VER=0.7.6
4+
ARG C4AI_VER=0.7.7
55
ENV C4AI_VERSION=$C4AI_VER
66
LABEL c4ai.version=$C4AI_VER
77

README.md

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,13 @@
2727

2828
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
2929

30-
[✨ Check out latest update v0.7.6](#-recent-updates)
30+
[✨ Check out latest update v0.7.7](#-recent-updates)
3131

32-
**New in v0.7.6**: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. No more polling! [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
32+
**New in v0.7.7**: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. Full visibility and control over your crawling infrastructure. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
3333

34-
✨ Recent v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
34+
✨ Recent v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. No more polling! [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
3535

36-
✨ Previous v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
36+
✨ Previous v0.7.5: Docker Hooks System with function-based API for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
3737

3838
<details>
3939
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -296,6 +296,7 @@ pip install -e ".[all]" # Install all optional features
296296
### New Docker Features
297297

298298
The new Docker implementation includes:
299+
- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility
299300
- **Browser pooling** with page pre-warming for faster response times
300301
- **Interactive playground** to test and generate request code
301302
- **MCP integration** for direct connection to AI tools like Claude Code
@@ -310,7 +311,8 @@ The new Docker implementation includes:
310311
docker pull unclecode/crawl4ai:latest
311312
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
312313

313-
# Visit the playground at http://localhost:11235/playground
314+
# Visit the monitoring dashboard at http://localhost:11235/dashboard
315+
# Or the playground at http://localhost:11235/playground
314316
```
315317

316318
### Quick Test
@@ -339,7 +341,7 @@ else:
339341
result = requests.get(f"http://localhost:11235/task/{task_id}")
340342
```
341343

342-
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
344+
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/).
343345

344346
</details>
345347

@@ -550,6 +552,57 @@ async def test_news_crawl():
550552
551553
## ✨ Recent Updates
552554

555+
<details>
556+
<summary><strong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update</strong></summary>
557+
558+
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
559+
```python
560+
# Access the monitoring dashboard
561+
# Visit: http://localhost:11235/dashboard
562+
563+
# Real-time metrics include:
564+
# - System health (CPU, memory, network, uptime)
565+
# - Active and completed request tracking
566+
# - Browser pool management (permanent/hot/cold)
567+
# - Janitor cleanup events
568+
# - Error monitoring with full context
569+
```
570+
571+
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
572+
```python
573+
import httpx
574+
575+
async with httpx.AsyncClient() as client:
576+
# System health
577+
health = await client.get("http://localhost:11235/monitor/health")
578+
579+
# Request tracking
580+
requests = await client.get("http://localhost:11235/monitor/requests")
581+
582+
# Browser pool status
583+
browsers = await client.get("http://localhost:11235/monitor/browsers")
584+
585+
# Endpoint statistics
586+
stats = await client.get("http://localhost:11235/monitor/endpoints/stats")
587+
```
588+
589+
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
590+
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
591+
- **🧹 Janitor System**: Automatic resource management with event logging
592+
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
593+
- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration
594+
- **🐛 Critical Bug Fixes**:
595+
- Fixed async LLM extraction blocking issue (#1055)
596+
- Enhanced DFS deep crawl strategy (#1607)
597+
- Fixed sitemap parsing in AsyncUrlSeeder (#1598)
598+
- Resolved browser viewport configuration (#1495)
599+
- Fixed CDP timing with exponential backoff (#1528)
600+
- Security update for pyOpenSSL (>=25.3.0)
601+
602+
[Full v0.7.7 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
603+
604+
</details>
605+
553606
<details>
554607
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>
555608

crawl4ai/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# crawl4ai/__version__.py
22

33
# This is the version that will be used for stable releases
4-
__version__ = "0.7.6"
4+
__version__ = "0.7.7"
55

66
# For nightly builds, this gets set during build process
77
__nightly_version__ = None

crawl4ai/async_configs.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import os
22
from typing import Union
33
import warnings
4+
import requests
45
from .config import (
56
DEFAULT_PROVIDER,
67
DEFAULT_PROVIDER_API_KEY,
@@ -649,6 +650,85 @@ def load(data: dict) -> "BrowserConfig":
649650
return config
650651
return BrowserConfig.from_kwargs(config)
651652

653+
def set_nstproxy(
654+
self,
655+
token: str,
656+
channel_id: str,
657+
country: str = "ANY",
658+
state: str = "",
659+
city: str = "",
660+
protocol: str = "http",
661+
session_duration: int = 10,
662+
):
663+
"""
664+
Fetch a proxy from NSTProxy API and automatically assign it to proxy_config.
665+
666+
Get your NSTProxy token from: https://app.nstproxy.com/profile
667+
668+
Args:
669+
token (str): NSTProxy API token.
670+
channel_id (str): NSTProxy channel ID.
671+
country (str, optional): Country code (default: "ANY").
672+
state (str, optional): State code (default: "").
673+
city (str, optional): City name (default: "").
674+
protocol (str, optional): Proxy protocol ("http" or "socks5"). Defaults to "http".
675+
session_duration (int, optional): Session duration in minutes (0 = rotate each request). Defaults to 10.
676+
677+
Raises:
678+
ValueError: If the API response format is invalid.
679+
PermissionError: If the API returns an error message.
680+
"""
681+
682+
# --- Validate input early ---
683+
if not token or not channel_id:
684+
raise ValueError("[NSTProxy] token and channel_id are required")
685+
686+
if protocol not in ("http", "socks5"):
687+
raise ValueError(f"[NSTProxy] Invalid protocol: {protocol}")
688+
689+
# --- Build NSTProxy API URL ---
690+
params = {
691+
"fType": 2,
692+
"count": 1,
693+
"channelId": channel_id,
694+
"country": country,
695+
"protocol": protocol,
696+
"sessionDuration": session_duration,
697+
"token": token,
698+
}
699+
if state:
700+
params["state"] = state
701+
if city:
702+
params["city"] = city
703+
704+
url = "https://api.nstproxy.com/api/v1/generate/apiproxies"
705+
706+
try:
707+
response = requests.get(url, params=params, timeout=10)
708+
response.raise_for_status()
709+
710+
data = response.json()
711+
712+
# --- Handle API error response ---
713+
if isinstance(data, dict) and data.get("err"):
714+
raise PermissionError(f"[NSTProxy] API Error: {data.get('msg', 'Unknown error')}")
715+
716+
if not isinstance(data, list) or not data:
717+
raise ValueError("[NSTProxy] Invalid API response — expected a non-empty list")
718+
719+
proxy_info = data[0]
720+
721+
# --- Apply proxy config ---
722+
self.proxy_config = ProxyConfig(
723+
server=f"{protocol}://{proxy_info['ip']}:{proxy_info['port']}",
724+
username=proxy_info["username"],
725+
password=proxy_info["password"],
726+
)
727+
728+
except Exception as e:
729+
print(f"[NSTProxy] ❌ Failed to set proxy: {e}")
730+
raise
731+
652732
class VirtualScrollConfig:
653733
"""Configuration for virtual scroll handling.
654734

crawl4ai/async_crawler_strategy.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1383,9 +1383,10 @@ async def remove_overlay_elements(self, page: Page) -> None:
13831383
try:
13841384
await self.adapter.evaluate(page,
13851385
f"""
1386-
(() => {{
1386+
(async () => {{
13871387
try {{
1388-
{remove_overlays_js}
1388+
const removeOverlays = {remove_overlays_js};
1389+
await removeOverlays();
13891390
return {{ success: true }};
13901391
}} catch (error) {{
13911392
return {{

crawl4ai/async_url_seeder.py

Lines changed: 59 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -845,6 +845,15 @@ async def _iter_sitemap(self, url: str):
845845
return
846846

847847
data = gzip.decompress(r.content) if url.endswith(".gz") else r.content
848+
base_url = str(r.url)
849+
850+
def _normalize_loc(raw: Optional[str]) -> Optional[str]:
851+
if not raw:
852+
return None
853+
normalized = urljoin(base_url, raw.strip())
854+
if not normalized:
855+
return None
856+
return normalized
848857

849858
# Detect if this is a sitemap index by checking for <sitemapindex> or presence of <sitemap> elements
850859
is_sitemap_index = False
@@ -857,25 +866,42 @@ async def _iter_sitemap(self, url: str):
857866
# Use XML parser for sitemaps, not HTML parser
858867
parser = etree.XMLParser(recover=True)
859868
root = etree.fromstring(data, parser=parser)
869+
# Namespace-agnostic lookups using local-name() so we honor custom or missing namespaces
870+
sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
871+
url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
860872

861-
# Define namespace for sitemap
862-
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
873+
self._log(
874+
"debug",
875+
"Parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
876+
params={
877+
"url": url,
878+
"sitemap_count": len(sitemap_loc_nodes),
879+
"url_count": len(url_loc_nodes),
880+
},
881+
tag="URL_SEED",
882+
)
863883

864884
# Check for sitemap index entries
865-
sitemap_locs = root.xpath('//s:sitemap/s:loc', namespaces=ns)
866-
if sitemap_locs:
885+
if sitemap_loc_nodes:
867886
is_sitemap_index = True
868-
for sitemap_elem in sitemap_locs:
869-
loc = sitemap_elem.text.strip() if sitemap_elem.text else ""
887+
for sitemap_elem in sitemap_loc_nodes:
888+
loc = _normalize_loc(sitemap_elem.text)
870889
if loc:
871890
sub_sitemaps.append(loc)
872891

873892
# If not a sitemap index, get regular URLs
874893
if not is_sitemap_index:
875-
for loc_elem in root.xpath('//s:url/s:loc', namespaces=ns):
876-
loc = loc_elem.text.strip() if loc_elem.text else ""
894+
for loc_elem in url_loc_nodes:
895+
loc = _normalize_loc(loc_elem.text)
877896
if loc:
878897
regular_urls.append(loc)
898+
if not regular_urls:
899+
self._log(
900+
"warning",
901+
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
902+
params={"url": url},
903+
tag="URL_SEED",
904+
)
879905
except Exception as e:
880906
self._log("error", "LXML parsing error for sitemap {url}: {error}",
881907
params={"url": url, "error": str(e)}, tag="URL_SEED")
@@ -892,19 +918,39 @@ async def _iter_sitemap(self, url: str):
892918

893919
# Check for sitemap index entries
894920
sitemaps = root.findall('.//sitemap')
921+
url_entries = root.findall('.//url')
922+
self._log(
923+
"debug",
924+
"ElementTree parsed sitemap {url}: {sitemap_count} sitemap entries, {url_count} url entries discovered",
925+
params={
926+
"url": url,
927+
"sitemap_count": len(sitemaps),
928+
"url_count": len(url_entries),
929+
},
930+
tag="URL_SEED",
931+
)
895932
if sitemaps:
896933
is_sitemap_index = True
897934
for sitemap in sitemaps:
898935
loc_elem = sitemap.find('loc')
899-
if loc_elem is not None and loc_elem.text:
900-
sub_sitemaps.append(loc_elem.text.strip())
936+
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
937+
if loc:
938+
sub_sitemaps.append(loc)
901939

902940
# If not a sitemap index, get regular URLs
903941
if not is_sitemap_index:
904-
for url_elem in root.findall('.//url'):
942+
for url_elem in url_entries:
905943
loc_elem = url_elem.find('loc')
906-
if loc_elem is not None and loc_elem.text:
907-
regular_urls.append(loc_elem.text.strip())
944+
loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
945+
if loc:
946+
regular_urls.append(loc)
947+
if not regular_urls:
948+
self._log(
949+
"warning",
950+
"No <loc> entries found inside <url> tags for sitemap {url}. The sitemap might be empty or use an unexpected structure.",
951+
params={"url": url},
952+
tag="URL_SEED",
953+
)
908954
except Exception as e:
909955
self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
910956
params={"url": url, "error": str(e)}, tag="URL_SEED")

crawl4ai/async_webcrawler.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -617,7 +617,17 @@ async def aprocess_html(
617617
else config.chunking_strategy
618618
)
619619
sections = chunking.chunk(content)
620-
extracted_content = config.extraction_strategy.run(url, sections)
620+
# extracted_content = config.extraction_strategy.run(url, sections)
621+
622+
# Use async version if available for better parallelism
623+
if hasattr(config.extraction_strategy, 'arun'):
624+
extracted_content = await config.extraction_strategy.arun(url, sections)
625+
else:
626+
# Fallback to sync version run in thread pool to avoid blocking
627+
extracted_content = await asyncio.to_thread(
628+
config.extraction_strategy.run, url, sections
629+
)
630+
621631
extracted_content = json.dumps(
622632
extracted_content, indent=4, default=str, ensure_ascii=False
623633
)

0 commit comments

Comments
 (0)