You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `tables` field contains structured data extracted from HTML tables found on the crawled page. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
213
+
214
+
- Presence of thead and tbody sections
215
+
- Use of th elements for headers
216
+
- Column consistency
217
+
- Text density
218
+
- And other factors
219
+
220
+
Tables that score above the threshold (default: 7) are extracted and stored in result.tables.
221
+
222
+
### Accessing Table data:
223
+
```python
224
+
asyncwith AsyncWebCrawler() as crawler:
225
+
result =await crawler.arun(
226
+
url="https://example.com/",
227
+
config=CrawlerRunConfig(
228
+
table_score_threshold=7# Minimum score for table detection
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
266
+
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
267
+
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
268
+
269
+
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
270
+
271
+
272
+
### 5.4 `screenshot`, `pdf`, and `mhtml`
211
273
212
274
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
213
275
@@ -228,7 +290,7 @@ if result.mhtml:
228
290
229
291
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
230
292
231
-
### 5.4`ssl_certificate`
293
+
### 5.5`ssl_certificate`
232
294
233
295
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the site’s SSL cert, such as issuer, validity dates, etc.
Copy file name to clipboardExpand all lines: docs/md_v2/core/link-media.md
+5-78Lines changed: 5 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -520,7 +520,8 @@ This approach is handy when you still want external links but need to block cert
520
520
521
521
### 4.1 Accessing `result.media`
522
522
523
-
By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
523
+
By default, Crawl4AI collects images, audio and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
524
+
**Note: Tables have been moved from `result.media["tables"]` to the new `result.tables` format for better organization and direct access.**
524
525
525
526
**Basic Example**:
526
527
@@ -534,14 +535,6 @@ if result.success:
534
535
print(f" Alt text: {img.get('alt', '')}")
535
536
print(f" Score: {img.get('score')}")
536
537
print(f" Description: {img.get('desc', '')}\n")
537
-
538
-
# Get tables
539
-
tables = result.media.get("tables", [])
540
-
print(f"Found {len(tables)} data tables in total.")
This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
610
590
611
-
### 3.3 Working with Tables
612
-
613
-
Crawl4AI can detect and extract structured data from HTML tables. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
614
-
615
-
- Presence of thead and tbody sections
616
-
- Use of th elements for headers
617
-
- Column consistency
618
-
- Text density
619
-
- And other factors
620
-
621
-
Tables that score above the threshold (default: 7) are extracted and stored in `result.media.tables`.
622
-
623
-
**Accessing Table Data**:
624
-
625
-
```python
626
-
if result.success:
627
-
tables = result.media.get("tables", [])
628
-
print(f"Found {len(tables)} data tables on the page")
-**`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
660
594
-**`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
@@ -695,7 +629,7 @@ The MHTML format is particularly useful because:
695
629
696
630
---
697
631
698
-
## 4. Putting It All Together: Link & Media Filtering
632
+
## 5. Putting It All Together: Link & Media Filtering
699
633
700
634
Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
701
635
@@ -743,7 +677,7 @@ if __name__ == "__main__":
743
677
744
678
---
745
679
746
-
## 5. Common Pitfalls & Tips
680
+
## 6. Common Pitfalls & Tips
747
681
748
682
1. **Conflicting Flags**:
749
683
-`exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
@@ -762,10 +696,3 @@ if __name__ == "__main__":
762
696
---
763
697
764
698
**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
765
-
### Table Extraction Tips
766
-
767
-
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
768
-
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
769
-
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
770
-
771
-
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
0 commit comments