Skip to content

Commit af81faf

Browse files
authored
test: update tests to reflect new libxml2 HTML5 parsing behaviors (#3308)
**What problem is this PR intended to solve?** @nwellnhof has done some work in libxml2 to move towards HTML5 parsing behaviors. This branch is intended to update Nokogiri's expectations (primarily tests) as a feedback loop for both projects. See https://gitlab.gnome.org/GNOME/libxml2/-/issues/758#note_2217350 **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** Not so far.
2 parents d992447 + a31c095 commit af81faf

File tree

11 files changed

+198
-103
lines changed

11 files changed

+198
-103
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ We've resolved many long-standing bugs in the various schema classes, validation
4848
* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
4949
* `XML::SAX::Parser#parse_memory` and `#parse_file` now accept an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjones
5050
* `XML::SAX::ParserContext.memory` now accepts an optional `encoding` argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones
51+
* `XML::DocumentFragment#parse_options` and `HTML4::DocumentFragment#parse_options` return the options used to parse the document fragment. @flavorjones
5152
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
5253
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
5354
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones

lib/nokogiri/html4/document_fragment.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ def initialize(document, tags = nil, ctx = nil, options = XML::ParseOptions::DEF
9191
return self unless tags
9292

9393
options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
94+
@parse_options = options
9495
yield options if block_given?
9596

9697
if ctx

lib/nokogiri/xml/document_fragment.rb

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@
44
module Nokogiri
55
module XML
66
class DocumentFragment < Nokogiri::XML::Node
7+
# The options used to parse the document fragment. Returns the value of any options that were
8+
# passed into the constructor as a parameter or set in a config block, else the default
9+
# options for the specific subclass.
10+
attr_reader :parse_options
11+
712
####
813
# Create a Nokogiri::XML::DocumentFragment from +tags+
914
def self.parse(tags, options = ParseOptions::DEFAULT_XML, &block)
@@ -20,6 +25,7 @@ def initialize(document, tags = nil, ctx = nil, options = ParseOptions::DEFAULT_
2025
return self unless tags
2126

2227
options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
28+
@parse_options = options
2329
yield options if block_given?
2430

2531
children = if ctx

test/html4/sax/test_document_error.rb

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,10 @@ def start_document
2020
end
2121

2222
def test_warning_document_encounters_error_but_terminates_normally
23-
# Probably I'm doing something wrong, but I can't make nekohtml report errors,
24-
# despite setting http://cyberneko.org/html/features/report-errors.
25-
# See https://nekohtml.sourceforge.net/settings.html for more info.
26-
# I'd love some help here if someone finds this comment and cares enough to dig in.
27-
skip_unless_libxml2("nekohtml sax parser does not seem to report errors?")
28-
2923
warning_parser = Nokogiri::HTML4::SAX::Parser.new(Nokogiri::SAX::TestCase::Doc.new)
3024
warning_parser.parse("<html><body><<div att=")
31-
refute_empty(warning_parser.document.errors, "error collector did not collect an error")
25+
26+
assert(warning_parser.document.end_document_called)
3227
end
3328
end
3429
end

test/html4/test_comments.rb

Lines changed: 50 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,19 @@ class TestComment < Nokogiri::TestCase
9696
let(:doc) { Nokogiri::HTML4(html) }
9797
let(:subject) { doc.at_css("div#under-test") }
9898

99-
if Nokogiri.uses_libxml?
99+
if Nokogiri.uses_libxml?(">= 2.14.0")
100+
it "behaves as if the comment is closed immediately before the end of the input stream" do # COMPLIANT
101+
assert_pattern do
102+
subject => {
103+
name: "div",
104+
attributes: [{ name: "id", value: "under-test" }],
105+
children: [
106+
{ name: "comment", content: "start of unterminated comment" }
107+
]
108+
}
109+
end
110+
end
111+
elsif Nokogiri.uses_libxml?
100112
it "behaves as if the comment is unterminated and doesn't exist" do # NON-COMPLIANT
101113
assert_equal 0, subject.children.length
102114
assert_equal 1, doc.errors.length
@@ -132,8 +144,12 @@ class TestComment < Nokogiri::TestCase
132144
assert_equal inner_div, subject.children[1]
133145
assert_predicate subject.children[2], :comment?
134146
assert_equal "bar", subject.children[2].content
135-
assert_equal 1, doc.errors.length
136-
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
147+
if Nokogiri.uses_libxml?(">= 2.14.0")
148+
assert_empty doc.errors
149+
else
150+
assert_equal 1, doc.errors.length
151+
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
152+
end
137153
end
138154
else # jruby, or libxml2 system lib less than 2.9.11
139155
it "behaves as if the comment encompasses the inner div" do # NON-COMPLIANT
@@ -161,7 +177,22 @@ class TestComment < Nokogiri::TestCase
161177
let(:body) { doc.at_css("body") }
162178
let(:subject) { doc.at_css("div#under-test") }
163179

164-
if Nokogiri.uses_libxml?("= 2.9.14")
180+
if Nokogiri.uses_libxml?(">= 2.14.0")
181+
it "parses as comments" do # COMPLIANT
182+
assert_pattern do
183+
body.children => [
184+
{
185+
name: "div",
186+
children: [
187+
{ name: "comment", content: " comment <div id=do-i-exist" },
188+
{ name: "text", content: "inner content" },
189+
]
190+
},
191+
{ name: "text", content: "-->hello" },
192+
]
193+
end
194+
end
195+
elsif Nokogiri.uses_libxml?("= 2.9.14")
165196
it "parses as PCDATA" do # NON-COMPLIANT
166197
assert_equal 1, body.children.length
167198
assert_equal subject, body.children.first
@@ -212,7 +243,21 @@ class TestComment < Nokogiri::TestCase
212243
let(:body) { doc.at_css("body") }
213244
let(:subject) { doc.at_css("div#under-test") }
214245

215-
if Nokogiri.uses_libxml?("= 2.9.14")
246+
if Nokogiri.uses_libxml?(">= 2.14.0")
247+
it "parses the <! tags as comments" do
248+
assert_pattern do
249+
body.children => [
250+
{
251+
name: "div", children: [
252+
{ name: "comment", content: "[if foo]" },
253+
{ name: "div", attributes: [{name: "id", value: "do-i-exist"}] },
254+
{ name: "comment", content: "[endif]" },
255+
]
256+
}
257+
]
258+
end
259+
end
260+
elsif Nokogiri.uses_libxml?("= 2.9.14")
216261
it "parses the <! tags as PCDATA" do
217262
assert_equal(1, body.children.length)
218263
assert_equal(subject, body.children.first)

test/html4/test_document.rb

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -363,10 +363,8 @@ def test_document_has_error
363363
html = Nokogiri::HTML4(<<~HTML)
364364
<html>
365365
<body>
366-
<div awesome="asdf>
367-
<p>inside div tag</p>
368-
</div>
369-
<p>outside div tag</p>
366+
<div>
367+
</foo>
370368
</body>
371369
</html>
372370
HTML
@@ -660,14 +658,15 @@ def test_capturing_nonparse_errors_during_document_clone
660658

661659
def test_capturing_nonparse_errors_during_node_copy_between_docs
662660
# Errors should be emitted while parsing only, and should not change when moving nodes.
663-
doc1 = Nokogiri::HTML4("<html><body><diva id='unique'>one</diva></body></html>")
664-
doc2 = Nokogiri::HTML4("<html><body><dive id='unique'>two</dive></body></html>")
661+
doc1 = Nokogiri::HTML4("<html><body><div id='unique'>one</foo1></body></html>")
662+
doc2 = Nokogiri::HTML4("<html><body><div id='unique'>two</foo2></body></html>")
665663
node1 = doc1.at_css("#unique")
666664
node2 = doc2.at_css("#unique")
667665
original_errors1 = doc1.errors.dup
668666
original_errors2 = doc2.errors.dup
669-
assert(original_errors1.any? { |e| e.to_s.include?("Tag diva invalid") }, "it should complain about the tag name")
670-
assert(original_errors2.any? { |e| e.to_s.include?("Tag dive invalid") }, "it should complain about the tag name")
667+
668+
refute_empty(original_errors1)
669+
refute_empty(original_errors2)
671670

672671
node1.add_child(node2)
673672

@@ -734,6 +733,8 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
734733
doc = Nokogiri::HTML4::Document.parse(html)
735734
expected = if Nokogiri.jruby?
736735
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::PI_NODE]
736+
elsif Nokogiri.uses_libxml?(">= 2.14.0")
737+
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::COMMENT_NODE]
737738
elsif Nokogiri.uses_libxml?(">= 2.10.0")
738739
[Nokogiri::XML::Node::COMMENT_NODE]
739740
else
@@ -802,7 +803,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
802803
end
803804

804805
describe "read memory" do
805-
let(:input) { "<html><body><div" }
806+
let(:input) { "<html><body><div></foo>" }
806807

807808
describe "strict parsing" do
808809
let(:parse_options) { html_strict }
@@ -824,7 +825,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
824825
end
825826

826827
describe "read io" do
827-
let(:input) { StringIO.new("<html><body><div") }
828+
let(:input) { StringIO.new("<html><body><div></foo>") }
828829

829830
describe "strict parsing" do
830831
let(:parse_options) { html_strict }

test/html4/test_document_encoding.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ def binopen(file)
148148
end
149149

150150
describe "error handling" do
151-
RAW = "<html><body><div"
151+
RAW = "<html><body><div></foo>"
152152

153153
{ "read_memory" => RAW, "read_io" => StringIO.new(RAW) }.each do |flavor, input|
154154
it "#{flavor} should handle errors" do

0 commit comments

Comments
 (0)