Skip to content

Commit 015c07f

Browse files
committed
Fine tune filtering.
1 parent ff52b6b commit 015c07f

File tree

5 files changed

+711
-19
lines changed

5 files changed

+711
-19
lines changed

docs/html-to-plain-text.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# HTML to Plain Text: A Better Approach
2+
3+
When you need to convert HTML to plain text, `strip_tags()` often produces poor results. This document shows how using `HtmlToDjot` combined with Profile filtering produces much more readable output.
4+
5+
## The Problem with strip_tags()
6+
7+
```php
8+
$html = '<table><tr><th>Name</th><th>Type</th></tr><tr><td>Djot</td><td>Markup</td></tr></table>';
9+
echo strip_tags($html);
10+
// Output: "NameTypeDjotMarkup" - unreadable!
11+
```
12+
13+
Common issues:
14+
- Table cells run together with no separation
15+
- List items lose their structure
16+
- Headings blend into body text
17+
- Images disappear entirely (including alt text)
18+
- Blockquotes lose their visual distinction
19+
20+
## Possible Solution: HtmlToDjot + Profile Filtering
21+
22+
```php
23+
use Djot\Converter\HtmlToDjot;
24+
use Djot\DjotConverter;
25+
use Djot\NodeType;
26+
use Djot\Profile;
27+
28+
function htmlToPlainText(string $html): string
29+
{
30+
// Step 1: Convert HTML to Djot AST
31+
$htmlConverter = new HtmlToDjot();
32+
$djot = $htmlConverter->convert($html);
33+
34+
// Step 2: Create a plain-text profile (only text and paragraphs allowed)
35+
$plainTextProfile = (new Profile())
36+
->allowInline([NodeType::TEXT, NodeType::SOFT_BREAK, NodeType::HARD_BREAK])
37+
->allowBlock([NodeType::PARAGRAPH]);
38+
39+
// Step 3: Render with plain-text profile (converts everything to text)
40+
$djotConverter = new DjotConverter(profile: $plainTextProfile);
41+
$plainHtml = $djotConverter->convert($djot);
42+
43+
// Step 4: Strip remaining HTML tags and decode entities
44+
$text = strip_tags(str_replace(['<br>', '</p><p>'], ["\n", "\n\n"], $plainHtml));
45+
46+
return html_entity_decode($text, ENT_QUOTES | ENT_HTML5, 'UTF-8');
47+
}
48+
```
49+
50+
This custom profile is more restrictive than `Profile::minimal()` - it only allows plain text and paragraphs, converting all other elements (headings, lists, tables, etc.) to their semantic text representation.
51+
52+
## Side-by-Side Comparison
53+
54+
Each example shows the actual output from both approaches.
55+
56+
### Table
57+
58+
**Input:**
59+
```html
60+
<table><tr><th>Name</th><th>Type</th></tr><tr><td>Djot</td><td>Markup</td></tr></table>
61+
```
62+
63+
| Method | Output |
64+
|--------|--------|
65+
| `strip_tags()` | `NameTypeDjotMarkup` |
66+
| HtmlToDjot+Profile | `Name \| Type`<br>`Djot \| Markup` |
67+
68+
### Heading
69+
70+
**Input:**
71+
```html
72+
<h2>Welcome to Our Site</h2>
73+
```
74+
75+
| Method | Output |
76+
|--------|--------|
77+
| `strip_tags()` | `Welcome to Our Site` |
78+
| HtmlToDjot+Profile | `## Welcome to Our Site` |
79+
80+
### Image
81+
82+
**Input:**
83+
```html
84+
<img src="photo.jpg" alt="A beautiful photo">
85+
```
86+
87+
| Method | Output |
88+
|--------|--------|
89+
| `strip_tags()` | *(empty string)* |
90+
| HtmlToDjot+Profile | `[img: A beautiful photo]` |
91+
92+
### Blockquote
93+
94+
**Input:**
95+
```html
96+
<blockquote><p>A wise quote.</p></blockquote>
97+
```
98+
99+
| Method | Output |
100+
|--------|--------|
101+
| `strip_tags()` | `A wise quote.` |
102+
| HtmlToDjot+Profile | `> A wise quote.` |
103+
104+
### Code Block
105+
106+
**Input:**
107+
```html
108+
<pre><code>echo "Hello";</code></pre>
109+
```
110+
111+
| Method | Output |
112+
|--------|--------|
113+
| `strip_tags()` | `echo "Hello";` |
114+
| HtmlToDjot+Profile | `` `echo "Hello";` `` |
115+
116+
### Thematic Break
117+
118+
**Input:**
119+
```html
120+
<hr>
121+
```
122+
123+
| Method | Output |
124+
|--------|--------|
125+
| `strip_tags()` | *(empty string)* |
126+
| HtmlToDjot+Profile | `---` |
127+
128+
### Definition List
129+
130+
**Input:**
131+
```html
132+
<dl><dt>Term</dt><dd>Definition here</dd></dl>
133+
```
134+
135+
| Method | Output |
136+
|--------|--------|
137+
| `strip_tags()` | `TermDefinition here` |
138+
| HtmlToDjot+Profile | `Term`<br>`- Definition here` |
139+
140+
### Unordered List
141+
142+
**Input:**
143+
```html
144+
<ul><li>First</li><li>Second</li></ul>
145+
```
146+
147+
| Method | Output |
148+
|--------|--------|
149+
| `strip_tags()` | `FirstSecond` |
150+
| HtmlToDjot+Profile | `- First`<br>`- Second` |
151+
152+
### Ordered List
153+
154+
**Input:**
155+
```html
156+
<ol><li>First</li><li>Second</li><li>Third</li></ol>
157+
```
158+
159+
| Method | Output |
160+
|--------|--------|
161+
| `strip_tags()` | `FirstSecondThird` |
162+
| HtmlToDjot+Profile | `1. First`<br>`2. Second`<br>`3. Third` |
163+
164+
## Use Cases
165+
166+
### Email Plain-Text Version
167+
168+
```php
169+
// Generate plain-text alternative for HTML emails
170+
$plainText = htmlToPlainText($htmlEmail);
171+
```
172+
173+
### Search Indexing
174+
175+
```php
176+
// Extract readable text for search engines
177+
$searchableText = htmlToPlainText($articleHtml);
178+
```
179+
180+
### Content Preview
181+
182+
```php
183+
// Generate preview snippets
184+
$preview = mb_substr(htmlToPlainText($content), 0, 200) . '...';
185+
```
186+
187+
### Accessibility / Screen Readers
188+
189+
```php
190+
// Generate screen-reader friendly text summary
191+
$accessibleText = htmlToPlainText($richContent);
192+
```
193+
194+
## Performance Considerations
195+
196+
The HtmlToDjot approach involves parsing, which is slower than `strip_tags()`. For high-volume processing, consider caching:
197+
198+
```php
199+
$cacheKey = 'plain_text_' . md5($html);
200+
$plainText = $cache->get($cacheKey, fn() => htmlToPlainText($html));
201+
```
202+
203+
## Summary
204+
205+
| Feature | `strip_tags()` | HtmlToDjot + Profile |
206+
|---------|----------------|---------------------|
207+
| Table cells |`NameType` (merged) |`Name \| Type` |
208+
| Image alt text |*(lost)* |`[img: alt]` |
209+
| Heading level | ❌ Plain text |`## Heading` |
210+
| Blockquotes | ❌ No indicator |`> quote` |
211+
| Code blocks | ⚠️ Plain text |`` `code` `` |
212+
| Unordered lists |`FirstSecond` |`- First`, `- Second` |
213+
| Ordered lists |`FirstSecond` |`1. First`, `2. Second` |
214+
| Definition lists |`TermDef` |`Term` + `- Def` |
215+
| Thematic breaks |*(lost)* |`---` |
216+
217+
For readable plain text output from HTML, the HtmlToDjot + Profile approach is significantly better.

0 commit comments

Comments
 (0)