|
1 | | -# Proxy |
| 1 | +# Proxy & Security |
| 2 | + |
| 3 | +This guide covers proxy configuration and security features in Crawl4AI, including SSL certificate analysis and proxy rotation strategies. |
| 4 | + |
| 5 | +## Understanding Proxy Configuration |
| 6 | + |
| 7 | +Crawl4AI recommends configuring proxies per request through `CrawlerRunConfig.proxy_config`. This gives you precise control, enables rotation strategies, and keeps examples simple enough to copy, paste, and run. |
2 | 8 |
|
3 | 9 | ## Basic Proxy Setup |
4 | 10 |
|
5 | | -Simple proxy configuration with `BrowserConfig`: |
| 11 | +Configure proxies that apply to each crawl operation: |
6 | 12 |
|
7 | 13 | ```python |
8 | | -from crawl4ai.async_configs import BrowserConfig |
| 14 | +import asyncio |
| 15 | +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, ProxyConfig |
| 16 | + |
| 17 | +run_config = CrawlerRunConfig(proxy_config=ProxyConfig(server="http://proxy.example.com:8080")) |
| 18 | +# run_config = CrawlerRunConfig(proxy_config={"server": "http://proxy.example.com:8080"}) |
| 19 | +# run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080") |
| 20 | + |
| 21 | + |
| 22 | +async def main(): |
| 23 | + browser_config = BrowserConfig() |
| 24 | + async with AsyncWebCrawler(config=browser_config) as crawler: |
| 25 | + result = await crawler.arun(url="https://example.com", config=run_config) |
| 26 | + print(f"Success: {result.success} -> {result.url}") |
9 | 27 |
|
10 | | -# Using HTTP proxy |
11 | | -browser_config = BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080"}) |
12 | | -async with AsyncWebCrawler(config=browser_config) as crawler: |
13 | | - result = await crawler.arun(url="https://example.com") |
14 | 28 |
|
15 | | -# Using SOCKS proxy |
16 | | -browser_config = BrowserConfig(proxy_config={"server": "socks5://proxy.example.com:1080"}) |
17 | | -async with AsyncWebCrawler(config=browser_config) as crawler: |
18 | | - result = await crawler.arun(url="https://example.com") |
| 29 | +if __name__ == "__main__": |
| 30 | + asyncio.run(main()) |
19 | 31 | ``` |
20 | 32 |
|
21 | | -## Authenticated Proxy |
| 33 | +!!! note "Why request-level?" |
| 34 | + `CrawlerRunConfig.proxy_config` keeps each request self-contained, so swapping proxies or rotation strategies is just a matter of building a new run configuration. |
22 | 35 |
|
23 | | -Use an authenticated proxy with `BrowserConfig`: |
| 36 | +## Supported Proxy Formats |
| 37 | + |
| 38 | +The `ProxyConfig.from_string()` method supports multiple formats: |
24 | 39 |
|
25 | 40 | ```python |
26 | | -from crawl4ai.async_configs import BrowserConfig |
27 | | - |
28 | | -browser_config = BrowserConfig(proxy_config={ |
29 | | - "server": "http://[host]:[port]", |
30 | | - "username": "[username]", |
31 | | - "password": "[password]", |
32 | | -}) |
33 | | -async with AsyncWebCrawler(config=browser_config) as crawler: |
34 | | - result = await crawler.arun(url="https://example.com") |
35 | | -``` |
| 41 | +from crawl4ai import ProxyConfig |
36 | 42 |
|
| 43 | +# HTTP proxy with authentication |
| 44 | +proxy1 = ProxyConfig.from_string("http://user:pass@192.168.1.1:8080") |
37 | 45 |
|
38 | | -## Rotating Proxies |
| 46 | +# HTTPS proxy |
| 47 | +proxy2 = ProxyConfig.from_string("https://proxy.example.com:8080") |
39 | 48 |
|
40 | | -Example using a proxy rotation service dynamically: |
| 49 | +# SOCKS5 proxy |
| 50 | +proxy3 = ProxyConfig.from_string("socks5://proxy.example.com:1080") |
| 51 | + |
| 52 | +# Simple IP:port format |
| 53 | +proxy4 = ProxyConfig.from_string("192.168.1.1:8080") |
| 54 | + |
| 55 | +# IP:port:user:pass format |
| 56 | +proxy5 = ProxyConfig.from_string("192.168.1.1:8080:user:pass") |
| 57 | +``` |
| 58 | + |
| 59 | +## Authenticated Proxies |
| 60 | + |
| 61 | +For proxies requiring authentication: |
41 | 62 |
|
42 | 63 | ```python |
43 | | -import re |
44 | | -from crawl4ai import ( |
45 | | - AsyncWebCrawler, |
46 | | - BrowserConfig, |
47 | | - CrawlerRunConfig, |
48 | | - CacheMode, |
49 | | - RoundRobinProxyStrategy, |
| 64 | +import asyncio |
| 65 | +from crawl4ai import AsyncWebCrawler,BrowserConfig, CrawlerRunConfig, ProxyConfig |
| 66 | + |
| 67 | +run_config = CrawlerRunConfig( |
| 68 | + proxy_config=ProxyConfig( |
| 69 | + server="http://proxy.example.com:8080", |
| 70 | + username="your_username", |
| 71 | + password="your_password", |
| 72 | + ) |
50 | 73 | ) |
| 74 | +# Or dictionary style: |
| 75 | +# run_config = CrawlerRunConfig(proxy_config={ |
| 76 | +# "server": "http://proxy.example.com:8080", |
| 77 | +# "username": "your_username", |
| 78 | +# "password": "your_password", |
| 79 | +# }) |
| 80 | + |
| 81 | + |
| 82 | +async def main(): |
| 83 | + browser_config = BrowserConfig() |
| 84 | + async with AsyncWebCrawler(config=browser_config) as crawler: |
| 85 | + result = await crawler.arun(url="https://example.com", config=run_config) |
| 86 | + print(f"Success: {result.success} -> {result.url}") |
| 87 | + |
| 88 | + |
| 89 | +if __name__ == "__main__": |
| 90 | + asyncio.run(main()) |
| 91 | +``` |
| 92 | + |
| 93 | +## Environment Variable Configuration |
| 94 | + |
| 95 | +Load proxies from environment variables for easy configuration: |
| 96 | + |
| 97 | +```python |
| 98 | +import os |
| 99 | +from crawl4ai import ProxyConfig, CrawlerRunConfig |
| 100 | + |
| 101 | +# Set environment variable |
| 102 | +os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3" |
| 103 | + |
| 104 | +# Load all proxies |
| 105 | +proxies = ProxyConfig.from_env() |
| 106 | +print(f"Loaded {len(proxies)} proxies") |
| 107 | + |
| 108 | +# Use first proxy |
| 109 | +if proxies: |
| 110 | + run_config = CrawlerRunConfig(proxy_config=proxies[0]) |
| 111 | +``` |
| 112 | + |
| 113 | +## Rotating Proxies |
| 114 | + |
| 115 | +Crawl4AI supports automatic proxy rotation to distribute requests across multiple proxy servers. Rotation is applied per request using a rotation strategy on `CrawlerRunConfig`. |
| 116 | + |
| 117 | +### Proxy Rotation (recommended) |
| 118 | +```python |
51 | 119 | import asyncio |
52 | | -from crawl4ai import ProxyConfig |
| 120 | +import re |
| 121 | +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig |
| 122 | +from crawl4ai.proxy_strategy import RoundRobinProxyStrategy |
| 123 | + |
53 | 124 | async def main(): |
54 | | - # Load proxies and create rotation strategy |
| 125 | + # Load proxies from environment |
55 | 126 | proxies = ProxyConfig.from_env() |
56 | | - #eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2" |
57 | 127 | if not proxies: |
58 | | - print("No proxies found in environment. Set PROXIES env variable!") |
| 128 | + print("No proxies found! Set PROXIES environment variable.") |
59 | 129 | return |
60 | 130 |
|
| 131 | + # Create rotation strategy |
61 | 132 | proxy_strategy = RoundRobinProxyStrategy(proxies) |
62 | 133 |
|
63 | | - # Create configs |
| 134 | + # Configure per-request with proxy rotation |
64 | 135 | browser_config = BrowserConfig(headless=True, verbose=False) |
65 | 136 | run_config = CrawlerRunConfig( |
66 | 137 | cache_mode=CacheMode.BYPASS, |
67 | | - proxy_rotation_strategy=proxy_strategy |
| 138 | + proxy_rotation_strategy=proxy_strategy, |
68 | 139 | ) |
69 | 140 |
|
70 | 141 | async with AsyncWebCrawler(config=browser_config) as crawler: |
71 | 142 | urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice |
72 | 143 |
|
73 | | - print("\n📈 Initializing crawler with proxy rotation...") |
74 | | - async with AsyncWebCrawler(config=browser_config) as crawler: |
75 | | - print("\n🚀 Starting batch crawl with proxy rotation...") |
76 | | - results = await crawler.arun_many( |
77 | | - urls=urls, |
78 | | - config=run_config |
79 | | - ) |
80 | | - for result in results: |
81 | | - if result.success: |
82 | | - ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html) |
83 | | - current_proxy = run_config.proxy_config if run_config.proxy_config else None |
84 | | - |
85 | | - if current_proxy and ip_match: |
86 | | - print(f"URL {result.url}") |
87 | | - print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}") |
88 | | - verified = ip_match.group(0) == current_proxy.ip |
89 | | - if verified: |
90 | | - print(f"✅ Proxy working! IP matches: {current_proxy.ip}") |
91 | | - else: |
92 | | - print("❌ Proxy failed or IP mismatch!") |
93 | | - print("---") |
94 | | - |
95 | | -asyncio.run(main()) |
| 144 | + print(f"🚀 Testing {len(proxies)} proxies with rotation...") |
| 145 | + results = await crawler.arun_many(urls=urls, config=run_config) |
| 146 | + |
| 147 | + for i, result in enumerate(results): |
| 148 | + if result.success: |
| 149 | + # Extract IP from response |
| 150 | + ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html) |
| 151 | + if ip_match: |
| 152 | + detected_ip = ip_match.group(0) |
| 153 | + proxy_index = i % len(proxies) |
| 154 | + expected_ip = proxies[proxy_index].ip |
| 155 | + |
| 156 | + print(f"✅ Request {i+1}: Proxy {proxy_index+1} -> IP {detected_ip}") |
| 157 | + if detected_ip == expected_ip: |
| 158 | + print(" 🎯 IP matches proxy configuration") |
| 159 | + else: |
| 160 | + print(f" ⚠️ IP mismatch (expected {expected_ip})") |
| 161 | + else: |
| 162 | + print(f"❌ Request {i+1}: Could not extract IP from response") |
| 163 | + else: |
| 164 | + print(f"❌ Request {i+1}: Failed - {result.error_message}") |
96 | 165 |
|
| 166 | +if __name__ == "__main__": |
| 167 | + asyncio.run(main()) |
97 | 168 | ``` |
98 | 169 |
|
| 170 | +## SSL Certificate Analysis |
| 171 | + |
| 172 | +Combine proxy usage with SSL certificate inspection for enhanced security analysis. SSL certificate fetching is configured per request via `CrawlerRunConfig`. |
| 173 | + |
| 174 | +### Per-Request SSL Certificate Analysis |
| 175 | +```python |
| 176 | +import asyncio |
| 177 | +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig |
| 178 | + |
| 179 | +run_config = CrawlerRunConfig( |
| 180 | + proxy_config={ |
| 181 | + "server": "http://proxy.example.com:8080", |
| 182 | + "username": "user", |
| 183 | + "password": "pass", |
| 184 | + }, |
| 185 | + fetch_ssl_certificate=True, # Enable SSL certificate analysis for this request |
| 186 | +) |
| 187 | + |
| 188 | + |
| 189 | +async def main(): |
| 190 | + browser_config = BrowserConfig() |
| 191 | + async with AsyncWebCrawler(config=browser_config) as crawler: |
| 192 | + result = await crawler.arun(url="https://example.com", config=run_config) |
| 193 | + |
| 194 | + if result.success: |
| 195 | + print(f"✅ Crawled via proxy: {result.url}") |
| 196 | + |
| 197 | + # Analyze SSL certificate |
| 198 | + if result.ssl_certificate: |
| 199 | + cert = result.ssl_certificate |
| 200 | + print("🔒 SSL Certificate Info:") |
| 201 | + print(f" Issuer: {cert.issuer}") |
| 202 | + print(f" Subject: {cert.subject}") |
| 203 | + print(f" Valid until: {cert.valid_until}") |
| 204 | + print(f" Fingerprint: {cert.fingerprint}") |
| 205 | + |
| 206 | + # Export certificate |
| 207 | + cert.to_json("certificate.json") |
| 208 | + print("💾 Certificate exported to certificate.json") |
| 209 | + else: |
| 210 | + print("⚠️ No SSL certificate information available") |
| 211 | + |
| 212 | + |
| 213 | +if __name__ == "__main__": |
| 214 | + asyncio.run(main()) |
| 215 | +``` |
| 216 | + |
| 217 | +## Security Best Practices |
| 218 | + |
| 219 | +### 1. Proxy Rotation for Anonymity |
| 220 | +```python |
| 221 | +from crawl4ai import CrawlerRunConfig, ProxyConfig |
| 222 | +from crawl4ai.proxy_strategy import RoundRobinProxyStrategy |
| 223 | + |
| 224 | +# Use multiple proxies to avoid IP blocking |
| 225 | +proxies = ProxyConfig.from_env("PROXIES") |
| 226 | +strategy = RoundRobinProxyStrategy(proxies) |
| 227 | + |
| 228 | +# Configure rotation per request (recommended) |
| 229 | +run_config = CrawlerRunConfig(proxy_rotation_strategy=strategy) |
| 230 | + |
| 231 | +# For a fixed proxy across all requests, just reuse the same run_config instance |
| 232 | +static_run_config = run_config |
| 233 | +``` |
| 234 | + |
| 235 | +### 2. SSL Certificate Verification |
| 236 | +```python |
| 237 | +from crawl4ai import CrawlerRunConfig |
| 238 | + |
| 239 | +# Always verify SSL certificates when possible |
| 240 | +# Per-request (affects specific requests) |
| 241 | +run_config = CrawlerRunConfig(fetch_ssl_certificate=True) |
| 242 | +``` |
| 243 | + |
| 244 | +### 3. Environment Variable Security |
| 245 | +```bash |
| 246 | +# Use environment variables for sensitive proxy credentials |
| 247 | +# Avoid hardcoding usernames/passwords in code |
| 248 | +export PROXIES="ip1:port1:user1:pass1,ip2:port2:user2:pass2" |
| 249 | +``` |
| 250 | + |
| 251 | +### 4. SOCKS5 for Enhanced Security |
| 252 | +```python |
| 253 | +from crawl4ai import CrawlerRunConfig |
| 254 | + |
| 255 | +# Prefer SOCKS5 proxies for better protocol support |
| 256 | +run_config = CrawlerRunConfig(proxy_config="socks5://proxy.example.com:1080") |
| 257 | +``` |
| 258 | + |
| 259 | +## Migration from Deprecated `proxy` Parameter |
| 260 | + |
| 261 | +!!! warning "Deprecation Notice" |
| 262 | + The legacy `proxy` argument on `BrowserConfig` is deprecated. Configure proxies through `CrawlerRunConfig.proxy_config` so each request fully describes its network settings. |
| 263 | + |
| 264 | +```python |
| 265 | +# Old (deprecated) approach |
| 266 | +# from crawl4ai import BrowserConfig |
| 267 | +# browser_config = BrowserConfig(proxy="http://proxy.example.com:8080") |
| 268 | + |
| 269 | +# New (preferred) approach |
| 270 | +from crawl4ai import CrawlerRunConfig |
| 271 | +run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080") |
| 272 | +``` |
| 273 | + |
| 274 | +### Safe Logging of Proxies |
| 275 | +```python |
| 276 | +from crawl4ai import ProxyConfig |
| 277 | + |
| 278 | +def safe_proxy_repr(proxy: ProxyConfig): |
| 279 | + if getattr(proxy, "username", None): |
| 280 | + return f"{proxy.server} (auth: ****)" |
| 281 | + return proxy.server |
| 282 | +``` |
| 283 | + |
| 284 | +## Troubleshooting |
| 285 | + |
| 286 | +### Common Issues |
| 287 | + |
| 288 | +???+ question "Proxy connection failed" |
| 289 | + - Verify the proxy server is reachable from your network. |
| 290 | + - Double-check authentication credentials. |
| 291 | + - Ensure the protocol matches (`http`, `https`, or `socks5`). |
| 292 | + |
| 293 | +???+ question "SSL certificate errors" |
| 294 | + - Some proxies break SSL inspection; switch proxies if you see repeated failures. |
| 295 | + - Consider temporarily disabling certificate fetching to isolate the issue. |
| 296 | + |
| 297 | +???+ question "Environment variables not loading" |
| 298 | + - Confirm `PROXIES` (or your custom env var) is set before running the script. |
| 299 | + - Check formatting: `ip:port:user:pass,ip:port:user:pass`. |
| 300 | + |
| 301 | +???+ question "Proxy rotation not working" |
| 302 | + - Ensure `ProxyConfig.from_env()` actually loaded entries (`len(proxies) > 0`). |
| 303 | + - Attach `proxy_rotation_strategy` to `CrawlerRunConfig`. |
| 304 | + - Validate the proxy definitions you pass into the strategy. |
0 commit comments