Skip to content

Commit 263ac89

Browse files
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
1 parent d56b0eb commit 263ac89

File tree

1 file changed

+288
-53
lines changed

1 file changed

+288
-53
lines changed
Lines changed: 288 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,333 @@
1-
# Proxy
1+
# Proxy & Security
2+
3+
This guide covers proxy configuration and security features in Crawl4AI, including SSL certificate analysis and proxy rotation strategies.
4+
5+
## Understanding Proxy Configuration
6+
7+
Crawl4AI supports proxy configuration at two levels:
8+
9+
### BrowserConfig.proxy_config
10+
Sets proxy at the **browser level** - affects all pages/tabs in that browser instance. Use this when:
11+
- You want all crawls from this browser to use the same proxy
12+
- You're using a single proxy for the entire session
13+
- You need persistent proxy settings across multiple crawls
14+
15+
### CrawlerRunConfig.proxy_config
16+
Sets proxy at the **request level** - can be different for each crawl operation. Use this when:
17+
- You want per-request proxy control
18+
- You're implementing proxy rotation
19+
- Different URLs need different proxies
220

321
## Basic Proxy Setup
422

5-
Simple proxy configuration with `BrowserConfig`:
23+
### Browser-Level Proxy (BrowserConfig)
24+
25+
Configure proxies that apply to the entire browser session:
626

727
```python
8-
from crawl4ai.async_configs import BrowserConfig
28+
from crawl4ai import AsyncWebCrawler, BrowserConfig
29+
30+
# Using dictionary configuration
31+
browser_config = BrowserConfig(proxy_config={
32+
"server": "http://proxy.example.com:8080"
33+
})
34+
35+
# Using ProxyConfig object
36+
from crawl4ai import ProxyConfig
37+
proxy = ProxyConfig(server="http://proxy.example.com:8080")
38+
browser_config = BrowserConfig(proxy_config=proxy)
39+
40+
# Using string (auto-parsed)
41+
browser_config = BrowserConfig(proxy_config="http://proxy.example.com:8080")
942

10-
# Using HTTP proxy
11-
browser_config = BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080"})
1243
async with AsyncWebCrawler(config=browser_config) as crawler:
1344
result = await crawler.arun(url="https://example.com")
45+
```
1446

15-
# Using SOCKS proxy
16-
browser_config = BrowserConfig(proxy_config={"server": "socks5://proxy.example.com:1080"})
47+
### Request-Level Proxy (CrawlerRunConfig)
48+
49+
Configure proxies that can be customized per crawl operation:
50+
51+
```python
52+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
53+
54+
# Using dictionary configuration
55+
run_config = CrawlerRunConfig(proxy_config={
56+
"server": "http://proxy.example.com:8080"
57+
})
58+
59+
# Using ProxyConfig object
60+
from crawl4ai import ProxyConfig
61+
proxy = ProxyConfig(server="http://proxy.example.com:8080")
62+
run_config = CrawlerRunConfig(proxy_config=proxy)
63+
64+
# Using string (auto-parsed)
65+
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
66+
67+
browser_config = BrowserConfig()
1768
async with AsyncWebCrawler(config=browser_config) as crawler:
18-
result = await crawler.arun(url="https://example.com")
69+
result = await crawler.arun(url="https://example.com", config=run_config)
70+
```
71+
72+
!!! note "Priority Order"
73+
When both `BrowserConfig.proxy_config` and `CrawlerRunConfig.proxy_config` are set, `CrawlerRunConfig.proxy_config` takes precedence for that specific crawl operation.
74+
75+
## Supported Proxy Formats
76+
77+
The `ProxyConfig.from_string()` method supports multiple formats:
78+
79+
```python
80+
from crawl4ai import ProxyConfig
81+
82+
# HTTP proxy with authentication
83+
proxy1 = ProxyConfig.from_string("http://user:pass@192.168.1.1:8080")
84+
85+
# HTTPS proxy
86+
proxy2 = ProxyConfig.from_string("https://proxy.example.com:8080")
87+
88+
# SOCKS5 proxy
89+
proxy3 = ProxyConfig.from_string("socks5://proxy.example.com:1080")
90+
91+
# Simple IP:port format
92+
proxy4 = ProxyConfig.from_string("192.168.1.1:8080")
93+
94+
# IP:port:user:pass format
95+
proxy5 = ProxyConfig.from_string("192.168.1.1:8080:user:pass")
1996
```
2097

21-
## Authenticated Proxy
98+
## Authenticated Proxies
2299

23-
Use an authenticated proxy with `BrowserConfig`:
100+
For proxies requiring authentication:
24101

25102
```python
26-
from crawl4ai.async_configs import BrowserConfig
103+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
27104

28-
browser_config = BrowserConfig(proxy_config={
29-
"server": "http://[host]:[port]",
30-
"username": "[username]",
31-
"password": "[password]",
105+
# Using dictionary
106+
run_config = CrawlerRunConfig(proxy_config={
107+
"server": "http://proxy.example.com:8080",
108+
"username": "your_username",
109+
"password": "your_password"
32110
})
111+
112+
# Using ProxyConfig object
113+
from crawl4ai import ProxyConfig
114+
proxy = ProxyConfig(
115+
server="http://proxy.example.com:8080",
116+
username="your_username",
117+
password="your_password"
118+
)
119+
run_config = CrawlerRunConfig(proxy_config=proxy)
120+
121+
browser_config = BrowserConfig()
33122
async with AsyncWebCrawler(config=browser_config) as crawler:
34-
result = await crawler.arun(url="https://example.com")
123+
result = await crawler.arun(url="https://example.com", config=run_config)
35124
```
36125

126+
## Environment Variable Configuration
127+
128+
Load proxies from environment variables for easy configuration:
129+
130+
```python
131+
import os
132+
from crawl4ai import ProxyConfig, CrawlerRunConfig
133+
134+
# Set environment variable
135+
os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3"
136+
137+
# Load all proxies
138+
proxies = ProxyConfig.from_env()
139+
print(f"Loaded {len(proxies)} proxies")
140+
141+
# Use first proxy
142+
if proxies:
143+
run_config = CrawlerRunConfig(proxy_config=proxies[0])
144+
```
37145

38-
## Rotating Proxies
146+
## Rotating Proxies
39147

40-
Example using a proxy rotation service dynamically:
148+
Crawl4AI supports automatic proxy rotation to distribute requests across multiple proxy servers. Rotation is applied per request using a rotation strategy on `CrawlerRunConfig`.
41149

150+
### Proxy Rotation (recommended)
42151
```python
152+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig
153+
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
43154
import re
44-
from crawl4ai import (
45-
AsyncWebCrawler,
46-
BrowserConfig,
47-
CrawlerRunConfig,
48-
CacheMode,
49-
RoundRobinProxyStrategy,
50-
)
51-
import asyncio
52-
from crawl4ai import ProxyConfig
155+
53156
async def main():
54-
# Load proxies and create rotation strategy
157+
# Load proxies from environment
55158
proxies = ProxyConfig.from_env()
56-
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
57159
if not proxies:
58-
print("No proxies found in environment. Set PROXIES env variable!")
160+
print("No proxies found! Set PROXIES environment variable.")
59161
return
60162

163+
# Create rotation strategy
61164
proxy_strategy = RoundRobinProxyStrategy(proxies)
62165

63-
# Create configs
166+
# Configure per-request with proxy rotation
64167
browser_config = BrowserConfig(headless=True, verbose=False)
65168
run_config = CrawlerRunConfig(
66169
cache_mode=CacheMode.BYPASS,
67-
proxy_rotation_strategy=proxy_strategy
170+
proxy_rotation_strategy=proxy_strategy,
68171
)
69172

70173
async with AsyncWebCrawler(config=browser_config) as crawler:
71174
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
72175

73-
print("\n📈 Initializing crawler with proxy rotation...")
74-
async with AsyncWebCrawler(config=browser_config) as crawler:
75-
print("\n🚀 Starting batch crawl with proxy rotation...")
76-
results = await crawler.arun_many(
77-
urls=urls,
78-
config=run_config
79-
)
80-
for result in results:
81-
if result.success:
82-
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
83-
current_proxy = run_config.proxy_config if run_config.proxy_config else None
84-
85-
if current_proxy and ip_match:
86-
print(f"URL {result.url}")
87-
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
88-
verified = ip_match.group(0) == current_proxy.ip
89-
if verified:
90-
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
91-
else:
92-
print("❌ Proxy failed or IP mismatch!")
93-
print("---")
176+
print(f"🚀 Testing {len(proxies)} proxies with rotation...")
177+
results = await crawler.arun_many(urls=urls, config=run_config)
178+
179+
for i, result in enumerate(results):
180+
if result.success:
181+
# Extract IP from response
182+
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
183+
if ip_match:
184+
detected_ip = ip_match.group(0)
185+
proxy_index = i % len(proxies)
186+
expected_ip = proxies[proxy_index].ip
187+
188+
print(f"✅ Request {i+1}: Proxy {proxy_index+1} -> IP {detected_ip}")
189+
if detected_ip == expected_ip:
190+
print(" 🎯 IP matches proxy configuration")
191+
else:
192+
print(f" ⚠️ IP mismatch (expected {expected_ip})")
193+
else:
194+
print(f"❌ Request {i+1}: Could not extract IP from response")
195+
else:
196+
print(f"❌ Request {i+1}: Failed - {result.error_message}")
94197

95198
asyncio.run(main())
199+
```
200+
201+
## SSL Certificate Analysis
202+
203+
Combine proxy usage with SSL certificate inspection for enhanced security analysis. SSL certificate fetching is configured per request via `CrawlerRunConfig`.
204+
205+
### Per-Request SSL Certificate Analysis
206+
```python
207+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
208+
209+
# Configure proxy with SSL certificate fetching per request
210+
run_config = CrawlerRunConfig(
211+
proxy_config={
212+
"server": "http://proxy.example.com:8080",
213+
"username": "user",
214+
"password": "pass"
215+
},
216+
fetch_ssl_certificate=True # Enable SSL certificate analysis for this request
217+
)
218+
219+
browser_config = BrowserConfig()
220+
async with AsyncWebCrawler(config=browser_config) as crawler:
221+
result = await crawler.arun(url="https://example.com", config=run_config)
222+
223+
if result.success:
224+
print(f"✅ Crawled via proxy: {result.url}")
225+
226+
# Analyze SSL certificate
227+
if result.ssl_certificate:
228+
cert = result.ssl_certificate
229+
print("🔒 SSL Certificate Info:")
230+
print(f" Issuer: {cert.issuer}")
231+
print(f" Subject: {cert.subject}")
232+
print(f" Valid until: {cert.valid_until}")
233+
print(f" Fingerprint: {cert.fingerprint}")
234+
235+
# Export certificate
236+
cert.to_json("certificate.json")
237+
print("💾 Certificate exported to certificate.json")
238+
else:
239+
print("⚠️ No SSL certificate information available")
240+
```
241+
242+
## Security Best Practices
243+
244+
### 1. Proxy Rotation for Anonymity
245+
```python
246+
# Use multiple proxies to avoid IP blocking
247+
proxies = ProxyConfig.from_env("PROXIES")
248+
strategy = RoundRobinProxyStrategy(proxies)
249+
250+
# Configure rotation per request (recommended)
251+
run_config = CrawlerRunConfig(proxy_rotation_strategy=strategy)
252+
253+
# If you want a single static proxy across all requests, set a fixed ProxyConfig at browser-level:
254+
# browser_config = BrowserConfig(proxy_config=proxies[0])
255+
```
256+
257+
### 2. SSL Certificate Verification
258+
```python
259+
# Always verify SSL certificates when possible
260+
# Per-request (affects specific requests)
261+
run_config = CrawlerRunConfig(fetch_ssl_certificate=True)
262+
```
96263

264+
### 3. Environment Variable Security
265+
```bash
266+
# Use environment variables for sensitive proxy credentials
267+
# Avoid hardcoding usernames/passwords in code
268+
export PROXIES="ip1:port1:user1:pass1,ip2:port2:user2:pass2"
97269
```
98270

271+
### 4. SOCKS5 for Enhanced Security
272+
```python
273+
# Prefer SOCKS5 proxies for better protocol support
274+
# Browser-level
275+
browser_config = BrowserConfig(proxy_config="socks5://proxy.example.com:1080")
276+
277+
# Or request-level
278+
run_config = CrawlerRunConfig(proxy_config="socks5://proxy.example.com:1080")
279+
```
280+
281+
## Migration from Deprecated `proxy` Parameter
282+
283+
!!! warning "Deprecation Notice"
284+
The `proxy` parameter in `BrowserConfig` is deprecated. Use `proxy_config` in either `BrowserConfig` or `CrawlerRunConfig` instead.
285+
286+
```python
287+
# Old (deprecated)
288+
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
289+
290+
# You will see a warning similar to:
291+
# DeprecationWarning: BrowserConfig.proxy is deprecated and ignored. Use proxy_config instead.
292+
293+
# New (recommended) - Browser-level default
294+
browser_config = BrowserConfig(proxy_config="http://proxy.example.com:8080")
295+
296+
# Or request-level override (takes precedence per request)
297+
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
298+
```
299+
300+
### Safe Logging of Proxies
301+
```python
302+
from crawl4ai import ProxyConfig
303+
304+
def safe_proxy_repr(proxy: ProxyConfig):
305+
if getattr(proxy, "username", None):
306+
return f"{proxy.server} (auth: ****)"
307+
return proxy.server
308+
```
309+
310+
## Troubleshooting
311+
312+
### Common Issues
313+
314+
1. **Proxy Connection Failed**
315+
- Verify proxy server is accessible
316+
- Check authentication credentials
317+
- Ensure correct protocol (http/https/socks5)
318+
319+
2. **SSL Certificate Errors**
320+
- Some proxies may interfere with SSL inspection
321+
- Try different proxy or disable SSL verification if necessary
322+
323+
3. **Environment Variables Not Loading**
324+
- Ensure PROXIES variable is set correctly
325+
- Check comma separation and format: `ip:port:user:pass,ip:port:user:pass`
326+
327+
4. **Proxy Rotation Not Working**
328+
- Verify proxies are loaded: `len(proxies) > 0`
329+
- Check proxy strategy is set on `CrawlerRunConfig` via `proxy_rotation_strategy`
330+
- Ensure `proxy_config` is a valid `ProxyConfig` (when using a static proxy)
331+
332+
<!-- Removed duplicate Supported Proxy Formats section (already covered above) -->
333+

0 commit comments

Comments
 (0)