Skip to content

Commit fddae30

Browse files
committed
docs: Update README.md and modify Media and Tables Documentation.(unclecode#1271)
- Update Table-to-DataFrame Extraction example in README.md - Replace old method of accessing tables via result.media directly with result.tables in the documentation - Remove tables section from links & media page. - Add tables section to crawler result page.
1 parent ff6ea41 commit fddae30

File tree

3 files changed

+74
-85
lines changed

3 files changed

+74
-85
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -618,16 +618,16 @@ Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blo
618618
# Process results
619619
raw_df = pd.DataFrame()
620620
for result in results:
621-
if result.success and result.media["tables"]:
621+
if result.success and result.tables:
622622
raw_df = pd.DataFrame(
623-
result.media["tables"][0]["rows"],
624-
columns=result.media["tables"][0]["headers"],
623+
result.tables[0]["rows"],
624+
columns=result.tables[0]["headers"],
625625
)
626626
break
627627
print(raw_df.head())
628628

629629
finally:
630-
await crawler.stop()
630+
await crawler.close()
631631
```
632632

633633
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage

docs/md_v2/core/crawler-result.md

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ Here:
187187

188188
---
189189

190-
## 5. More Fields: Links, Media, and More
190+
## 5. More Fields: Links, Media, Tables and More
191191

192192
### 5.1 `links`
193193

@@ -207,7 +207,69 @@ for img in images:
207207
print("Image URL:", img["src"], "Alt:", img.get("alt"))
208208
```
209209

210-
### 5.3 `screenshot`, `pdf`, and `mhtml`
210+
### 5.3 `tables`
211+
212+
The `tables` field contains structured data extracted from HTML tables found on the crawled page. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
213+
214+
- Presence of thead and tbody sections
215+
- Use of th elements for headers
216+
- Column consistency
217+
- Text density
218+
- And other factors
219+
220+
Tables that score above the threshold (default: 7) are extracted and stored in result.tables.
221+
222+
### Accessing Table data:
223+
```python
224+
async with AsyncWebCrawler() as crawler:
225+
result = await crawler.arun(
226+
url="https://example.com/",
227+
config=CrawlerRunConfig(
228+
table_score_threshold=7 # Minimum score for table detection
229+
)
230+
)
231+
232+
if result.success and result.tables:
233+
print(f"Found {len(result.tables)} tables")
234+
235+
for i, table in enumerate(result.tables):
236+
print(f"\nTable {i+1}:")
237+
print(f"Caption: {table.get('caption', 'No caption')}")
238+
print(f"Headers: {table['headers']}")
239+
print(f"Rows: {len(table['rows'])}")
240+
241+
# Print first few rows as example
242+
for j, row in enumerate(table['rows'][:3]):
243+
print(f" Row {j+1}: {row}")
244+
```
245+
246+
### Configuring Table Extraction:
247+
248+
You can adjust the sensitivity of the table detection algorithm with:
249+
250+
```python
251+
config = CrawlerRunConfig(
252+
table_score_threshold=5 # Lower value = more tables detected (default: 7)
253+
)
254+
```
255+
256+
Each extracted table contains:
257+
258+
- `headers`: Column header names
259+
- `rows`: List of rows, each containing cell values
260+
- `caption`: Table caption text (if available)
261+
- `summary`: Table summary attribute (if specified)
262+
263+
### Table Extraction Tips
264+
265+
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
266+
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
267+
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
268+
269+
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
270+
271+
272+
### 5.4 `screenshot`, `pdf`, and `mhtml`
211273

212274
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
213275

@@ -228,7 +290,7 @@ if result.mhtml:
228290

229291
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
230292

231-
### 5.4 `ssl_certificate`
293+
### 5.5 `ssl_certificate`
232294

233295
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the site’s SSL cert, such as issuer, validity dates, etc.
234296

docs/md_v2/core/link-media.md

Lines changed: 5 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -520,7 +520,8 @@ This approach is handy when you still want external links but need to block cert
520520

521521
### 4.1 Accessing `result.media`
522522

523-
By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
523+
By default, Crawl4AI collects images, audio and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
524+
**Note: Tables have been moved from `result.media["tables"]` to the new `result.tables` format for better organization and direct access.**
524525

525526
**Basic Example**:
526527

@@ -534,14 +535,6 @@ if result.success:
534535
print(f" Alt text: {img.get('alt', '')}")
535536
print(f" Score: {img.get('score')}")
536537
print(f" Description: {img.get('desc', '')}\n")
537-
538-
# Get tables
539-
tables = result.media.get("tables", [])
540-
print(f"Found {len(tables)} data tables in total.")
541-
for i, table in enumerate(tables):
542-
print(f"[Table {i}] Caption: {table.get('caption', 'No caption')}")
543-
print(f" Columns: {len(table.get('headers', []))}")
544-
print(f" Rows: {len(table.get('rows', []))}")
545538
```
546539

547540
**Structure Example**:
@@ -568,19 +561,6 @@ result.media = {
568561
"audio": [
569562
# Similar structure but with audio-specific fields
570563
],
571-
"tables": [
572-
{
573-
"headers": ["Name", "Age", "Location"],
574-
"rows": [
575-
["John Doe", "34", "New York"],
576-
["Jane Smith", "28", "San Francisco"],
577-
["Alex Johnson", "42", "Chicago"]
578-
],
579-
"caption": "Employee Directory",
580-
"summary": "Directory of company employees"
581-
},
582-
# More tables if present
583-
]
584564
}
585565
```
586566

@@ -608,53 +588,7 @@ crawler_cfg = CrawlerRunConfig(
608588

609589
This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
610590

611-
### 3.3 Working with Tables
612-
613-
Crawl4AI can detect and extract structured data from HTML tables. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
614-
615-
- Presence of thead and tbody sections
616-
- Use of th elements for headers
617-
- Column consistency
618-
- Text density
619-
- And other factors
620-
621-
Tables that score above the threshold (default: 7) are extracted and stored in `result.media.tables`.
622-
623-
**Accessing Table Data**:
624-
625-
```python
626-
if result.success:
627-
tables = result.media.get("tables", [])
628-
print(f"Found {len(tables)} data tables on the page")
629-
630-
if tables:
631-
# Access the first table
632-
first_table = tables[0]
633-
print(f"Table caption: {first_table.get('caption', 'No caption')}")
634-
print(f"Headers: {first_table.get('headers', [])}")
635-
636-
# Print the first 3 rows
637-
for i, row in enumerate(first_table.get('rows', [])[:3]):
638-
print(f"Row {i+1}: {row}")
639-
```
640-
641-
**Configuring Table Extraction**:
642-
643-
You can adjust the sensitivity of the table detection algorithm with:
644-
645-
```python
646-
crawler_cfg = CrawlerRunConfig(
647-
table_score_threshold=5 # Lower value = more tables detected (default: 7)
648-
)
649-
```
650-
651-
Each extracted table contains:
652-
- `headers`: Column header names
653-
- `rows`: List of rows, each containing cell values
654-
- `caption`: Table caption text (if available)
655-
- `summary`: Table summary attribute (if specified)
656-
657-
### 3.4 Additional Media Config
591+
### 4.3 Additional Media Config
658592

659593
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
660594
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
@@ -695,7 +629,7 @@ The MHTML format is particularly useful because:
695629

696630
---
697631

698-
## 4. Putting It All Together: Link & Media Filtering
632+
## 5. Putting It All Together: Link & Media Filtering
699633

700634
Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
701635

@@ -743,7 +677,7 @@ if __name__ == "__main__":
743677

744678
---
745679

746-
## 5. Common Pitfalls & Tips
680+
## 6. Common Pitfalls & Tips
747681

748682
1. **Conflicting Flags**:
749683
- `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
@@ -762,10 +696,3 @@ if __name__ == "__main__":
762696
---
763697

764698
**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
765-
### Table Extraction Tips
766-
767-
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
768-
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
769-
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
770-
771-
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.

0 commit comments

Comments
 (0)