Skip to content

Commit 8116b15

Browse files
authored
Merge pull request unclecode#1596 from unclecode/docs-proxy-security
unclecode#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2 parents 89cc29f + fe353c4 commit 8116b15

File tree

1 file changed

+267
-61
lines changed

1 file changed

+267
-61
lines changed
Lines changed: 267 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,304 @@
1-
# Proxy
1+
# Proxy & Security
2+
3+
This guide covers proxy configuration and security features in Crawl4AI, including SSL certificate analysis and proxy rotation strategies.
4+
5+
## Understanding Proxy Configuration
6+
7+
Crawl4AI recommends configuring proxies per request through `CrawlerRunConfig.proxy_config`. This gives you precise control, enables rotation strategies, and keeps examples simple enough to copy, paste, and run.
28

39
## Basic Proxy Setup
410

5-
Simple proxy configuration with `BrowserConfig`:
11+
Configure proxies that apply to each crawl operation:
612

713
```python
8-
from crawl4ai.async_configs import BrowserConfig
14+
import asyncio
15+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, ProxyConfig
16+
17+
run_config = CrawlerRunConfig(proxy_config=ProxyConfig(server="http://proxy.example.com:8080"))
18+
# run_config = CrawlerRunConfig(proxy_config={"server": "http://proxy.example.com:8080"})
19+
# run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
20+
21+
22+
async def main():
23+
browser_config = BrowserConfig()
24+
async with AsyncWebCrawler(config=browser_config) as crawler:
25+
result = await crawler.arun(url="https://example.com", config=run_config)
26+
print(f"Success: {result.success} -> {result.url}")
927

10-
# Using HTTP proxy
11-
browser_config = BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080"})
12-
async with AsyncWebCrawler(config=browser_config) as crawler:
13-
result = await crawler.arun(url="https://example.com")
1428

15-
# Using SOCKS proxy
16-
browser_config = BrowserConfig(proxy_config={"server": "socks5://proxy.example.com:1080"})
17-
async with AsyncWebCrawler(config=browser_config) as crawler:
18-
result = await crawler.arun(url="https://example.com")
29+
if __name__ == "__main__":
30+
asyncio.run(main())
1931
```
2032

21-
## Authenticated Proxy
33+
!!! note "Why request-level?"
34+
`CrawlerRunConfig.proxy_config` keeps each request self-contained, so swapping proxies or rotation strategies is just a matter of building a new run configuration.
2235

23-
Use an authenticated proxy with `BrowserConfig`:
36+
## Supported Proxy Formats
37+
38+
The `ProxyConfig.from_string()` method supports multiple formats:
2439

2540
```python
26-
from crawl4ai.async_configs import BrowserConfig
27-
28-
browser_config = BrowserConfig(proxy_config={
29-
"server": "http://[host]:[port]",
30-
"username": "[username]",
31-
"password": "[password]",
32-
})
33-
async with AsyncWebCrawler(config=browser_config) as crawler:
34-
result = await crawler.arun(url="https://example.com")
35-
```
41+
from crawl4ai import ProxyConfig
3642

43+
# HTTP proxy with authentication
44+
proxy1 = ProxyConfig.from_string("http://user:pass@192.168.1.1:8080")
3745

38-
## Rotating Proxies
46+
# HTTPS proxy
47+
proxy2 = ProxyConfig.from_string("https://proxy.example.com:8080")
3948

40-
Example using a proxy rotation service dynamically:
49+
# SOCKS5 proxy
50+
proxy3 = ProxyConfig.from_string("socks5://proxy.example.com:1080")
51+
52+
# Simple IP:port format
53+
proxy4 = ProxyConfig.from_string("192.168.1.1:8080")
54+
55+
# IP:port:user:pass format
56+
proxy5 = ProxyConfig.from_string("192.168.1.1:8080:user:pass")
57+
```
58+
59+
## Authenticated Proxies
60+
61+
For proxies requiring authentication:
4162

4263
```python
43-
import re
44-
from crawl4ai import (
45-
AsyncWebCrawler,
46-
BrowserConfig,
47-
CrawlerRunConfig,
48-
CacheMode,
49-
RoundRobinProxyStrategy,
64+
import asyncio
65+
from crawl4ai import AsyncWebCrawler,BrowserConfig, CrawlerRunConfig, ProxyConfig
66+
67+
run_config = CrawlerRunConfig(
68+
proxy_config=ProxyConfig(
69+
server="http://proxy.example.com:8080",
70+
username="your_username",
71+
password="your_password",
72+
)
5073
)
74+
# Or dictionary style:
75+
# run_config = CrawlerRunConfig(proxy_config={
76+
# "server": "http://proxy.example.com:8080",
77+
# "username": "your_username",
78+
# "password": "your_password",
79+
# })
80+
81+
82+
async def main():
83+
browser_config = BrowserConfig()
84+
async with AsyncWebCrawler(config=browser_config) as crawler:
85+
result = await crawler.arun(url="https://example.com", config=run_config)
86+
print(f"Success: {result.success} -> {result.url}")
87+
88+
89+
if __name__ == "__main__":
90+
asyncio.run(main())
91+
```
92+
93+
## Environment Variable Configuration
94+
95+
Load proxies from environment variables for easy configuration:
96+
97+
```python
98+
import os
99+
from crawl4ai import ProxyConfig, CrawlerRunConfig
100+
101+
# Set environment variable
102+
os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3"
103+
104+
# Load all proxies
105+
proxies = ProxyConfig.from_env()
106+
print(f"Loaded {len(proxies)} proxies")
107+
108+
# Use first proxy
109+
if proxies:
110+
run_config = CrawlerRunConfig(proxy_config=proxies[0])
111+
```
112+
113+
## Rotating Proxies
114+
115+
Crawl4AI supports automatic proxy rotation to distribute requests across multiple proxy servers. Rotation is applied per request using a rotation strategy on `CrawlerRunConfig`.
116+
117+
### Proxy Rotation (recommended)
118+
```python
51119
import asyncio
52-
from crawl4ai import ProxyConfig
120+
import re
121+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig
122+
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
123+
53124
async def main():
54-
# Load proxies and create rotation strategy
125+
# Load proxies from environment
55126
proxies = ProxyConfig.from_env()
56-
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
57127
if not proxies:
58-
print("No proxies found in environment. Set PROXIES env variable!")
128+
print("No proxies found! Set PROXIES environment variable.")
59129
return
60130

131+
# Create rotation strategy
61132
proxy_strategy = RoundRobinProxyStrategy(proxies)
62133

63-
# Create configs
134+
# Configure per-request with proxy rotation
64135
browser_config = BrowserConfig(headless=True, verbose=False)
65136
run_config = CrawlerRunConfig(
66137
cache_mode=CacheMode.BYPASS,
67-
proxy_rotation_strategy=proxy_strategy
138+
proxy_rotation_strategy=proxy_strategy,
68139
)
69140

70141
async with AsyncWebCrawler(config=browser_config) as crawler:
71142
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
72143

73-
print("\n📈 Initializing crawler with proxy rotation...")
74-
async with AsyncWebCrawler(config=browser_config) as crawler:
75-
print("\n🚀 Starting batch crawl with proxy rotation...")
76-
results = await crawler.arun_many(
77-
urls=urls,
78-
config=run_config
79-
)
80-
for result in results:
81-
if result.success:
82-
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
83-
current_proxy = run_config.proxy_config if run_config.proxy_config else None
84-
85-
if current_proxy and ip_match:
86-
print(f"URL {result.url}")
87-
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
88-
verified = ip_match.group(0) == current_proxy.ip
89-
if verified:
90-
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
91-
else:
92-
print("❌ Proxy failed or IP mismatch!")
93-
print("---")
94-
95-
asyncio.run(main())
144+
print(f"🚀 Testing {len(proxies)} proxies with rotation...")
145+
results = await crawler.arun_many(urls=urls, config=run_config)
146+
147+
for i, result in enumerate(results):
148+
if result.success:
149+
# Extract IP from response
150+
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
151+
if ip_match:
152+
detected_ip = ip_match.group(0)
153+
proxy_index = i % len(proxies)
154+
expected_ip = proxies[proxy_index].ip
155+
156+
print(f"✅ Request {i+1}: Proxy {proxy_index+1} -> IP {detected_ip}")
157+
if detected_ip == expected_ip:
158+
print(" 🎯 IP matches proxy configuration")
159+
else:
160+
print(f" ⚠️ IP mismatch (expected {expected_ip})")
161+
else:
162+
print(f"❌ Request {i+1}: Could not extract IP from response")
163+
else:
164+
print(f"❌ Request {i+1}: Failed - {result.error_message}")
96165

166+
if __name__ == "__main__":
167+
asyncio.run(main())
97168
```
98169

170+
## SSL Certificate Analysis
171+
172+
Combine proxy usage with SSL certificate inspection for enhanced security analysis. SSL certificate fetching is configured per request via `CrawlerRunConfig`.
173+
174+
### Per-Request SSL Certificate Analysis
175+
```python
176+
import asyncio
177+
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
178+
179+
run_config = CrawlerRunConfig(
180+
proxy_config={
181+
"server": "http://proxy.example.com:8080",
182+
"username": "user",
183+
"password": "pass",
184+
},
185+
fetch_ssl_certificate=True, # Enable SSL certificate analysis for this request
186+
)
187+
188+
189+
async def main():
190+
browser_config = BrowserConfig()
191+
async with AsyncWebCrawler(config=browser_config) as crawler:
192+
result = await crawler.arun(url="https://example.com", config=run_config)
193+
194+
if result.success:
195+
print(f"✅ Crawled via proxy: {result.url}")
196+
197+
# Analyze SSL certificate
198+
if result.ssl_certificate:
199+
cert = result.ssl_certificate
200+
print("🔒 SSL Certificate Info:")
201+
print(f" Issuer: {cert.issuer}")
202+
print(f" Subject: {cert.subject}")
203+
print(f" Valid until: {cert.valid_until}")
204+
print(f" Fingerprint: {cert.fingerprint}")
205+
206+
# Export certificate
207+
cert.to_json("certificate.json")
208+
print("💾 Certificate exported to certificate.json")
209+
else:
210+
print("⚠️ No SSL certificate information available")
211+
212+
213+
if __name__ == "__main__":
214+
asyncio.run(main())
215+
```
216+
217+
## Security Best Practices
218+
219+
### 1. Proxy Rotation for Anonymity
220+
```python
221+
from crawl4ai import CrawlerRunConfig, ProxyConfig
222+
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
223+
224+
# Use multiple proxies to avoid IP blocking
225+
proxies = ProxyConfig.from_env("PROXIES")
226+
strategy = RoundRobinProxyStrategy(proxies)
227+
228+
# Configure rotation per request (recommended)
229+
run_config = CrawlerRunConfig(proxy_rotation_strategy=strategy)
230+
231+
# For a fixed proxy across all requests, just reuse the same run_config instance
232+
static_run_config = run_config
233+
```
234+
235+
### 2. SSL Certificate Verification
236+
```python
237+
from crawl4ai import CrawlerRunConfig
238+
239+
# Always verify SSL certificates when possible
240+
# Per-request (affects specific requests)
241+
run_config = CrawlerRunConfig(fetch_ssl_certificate=True)
242+
```
243+
244+
### 3. Environment Variable Security
245+
```bash
246+
# Use environment variables for sensitive proxy credentials
247+
# Avoid hardcoding usernames/passwords in code
248+
export PROXIES="ip1:port1:user1:pass1,ip2:port2:user2:pass2"
249+
```
250+
251+
### 4. SOCKS5 for Enhanced Security
252+
```python
253+
from crawl4ai import CrawlerRunConfig
254+
255+
# Prefer SOCKS5 proxies for better protocol support
256+
run_config = CrawlerRunConfig(proxy_config="socks5://proxy.example.com:1080")
257+
```
258+
259+
## Migration from Deprecated `proxy` Parameter
260+
261+
!!! warning "Deprecation Notice"
262+
The legacy `proxy` argument on `BrowserConfig` is deprecated. Configure proxies through `CrawlerRunConfig.proxy_config` so each request fully describes its network settings.
263+
264+
```python
265+
# Old (deprecated) approach
266+
# from crawl4ai import BrowserConfig
267+
# browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
268+
269+
# New (preferred) approach
270+
from crawl4ai import CrawlerRunConfig
271+
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
272+
```
273+
274+
### Safe Logging of Proxies
275+
```python
276+
from crawl4ai import ProxyConfig
277+
278+
def safe_proxy_repr(proxy: ProxyConfig):
279+
if getattr(proxy, "username", None):
280+
return f"{proxy.server} (auth: ****)"
281+
return proxy.server
282+
```
283+
284+
## Troubleshooting
285+
286+
### Common Issues
287+
288+
???+ question "Proxy connection failed"
289+
- Verify the proxy server is reachable from your network.
290+
- Double-check authentication credentials.
291+
- Ensure the protocol matches (`http`, `https`, or `socks5`).
292+
293+
???+ question "SSL certificate errors"
294+
- Some proxies break SSL inspection; switch proxies if you see repeated failures.
295+
- Consider temporarily disabling certificate fetching to isolate the issue.
296+
297+
???+ question "Environment variables not loading"
298+
- Confirm `PROXIES` (or your custom env var) is set before running the script.
299+
- Check formatting: `ip:port:user:pass,ip:port:user:pass`.
300+
301+
???+ question "Proxy rotation not working"
302+
- Ensure `ProxyConfig.from_env()` actually loaded entries (`len(proxies) > 0`).
303+
- Attach `proxy_rotation_strategy` to `CrawlerRunConfig`.
304+
- Validate the proxy definitions you pass into the strategy.

0 commit comments

Comments
 (0)