From c22cca7824d96f39642b69111b31869878d195d7 Mon Sep 17 00:00:00 2001 From: Jim Winstead Date: Sat, 10 Aug 2024 15:12:18 -0700 Subject: [PATCH] Warn about libxml's unstable HTML parsing (fixes #2219) (#3649) Consolidates the duplicated note between `DOMDocument::loadHTML` and `DOMDocument::loadHTMLFile` and adds information about how libxml HTML5 parsing isn't stable across versions and the new `Dom\HTMLDocument` functions should be used. (Not documented yet, so they aren't linked, but will wire up automatically when it is.) --- language-snippets.ent | 24 ++++++++++++++++++++++ reference/dom/domdocument/loadhtml.xml | 17 +-------------- reference/dom/domdocument/loadhtmlfile.xml | 17 +-------------- 3 files changed, 26 insertions(+), 32 deletions(-) diff --git a/language-snippets.ent b/language-snippets.ent index c5a64eb378..bf684e7945 100644 --- a/language-snippets.ent +++ b/language-snippets.ent @@ -1572,6 +1572,30 @@ it is inserted with (e.g.) DOMNo While malformed HTML should load successfully, this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.'> The DOM extension uses UTF-8 encoding. Use mb_convert_encoding, UConverter::transcode, or iconv to handle other encodings.'> When using json_encode on a DOMDocument object the result will be that of encoding an empty object.'> + + + This function parses the input using an HTML 4 parser. The parsing rules + of HTML 5, which is what modern web browsers use, are different. Depending + on the input this might result in a different DOM structure. Therefore + this function cannot be safely used for sanitizing HTML. + + + The behavior when parsing HTML can depend on the version of + libxml that is being used, particularly with regards to + edge conditions and error handling. + For parsing that conforms to the HTML5 specification, + use Dom\HTMLDocument::createFromString or + Dom\HTMLDocument::createFromFile, added in PHP 8.4. + + + As an example, some HTML elements will implicitly close a parent element + when encountered. The rules for automatically closing parent elements + differ between HTML 4 and HTML 5 and thus the resulting DOM structure that + DOMDocument sees might be different from the DOM + structure a web browser sees, possibly allowing an attacker to break the + resulting HTML. + +'> diff --git a/reference/dom/domdocument/loadhtml.xml b/reference/dom/domdocument/loadhtml.xml index 0b3ff8c87e..bed82fe7be 100644 --- a/reference/dom/domdocument/loadhtml.xml +++ b/reference/dom/domdocument/loadhtml.xml @@ -18,22 +18,7 @@ The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. - - - This function parses the input using an HTML 4 parser. The parsing rules - of HTML 5, which is what modern web browsers use, are different. Depending - on the input this might result in a different DOM structure. Therefore - this function cannot be safely used for sanitizing HTML. - - - As an example, some HTML elements will implicitly close a parent element - when encountered. The rules for automatically closing parent elements - differ between HTML 4 and HTML 5 and thus the resulting DOM structure that - DOMDocument sees might be different from the DOM - structure a web browser sees, possibly allowing an attacker to break the - resulting HTML. - - + &dom.domdocument.html5; &reftitle.parameters; diff --git a/reference/dom/domdocument/loadhtmlfile.xml b/reference/dom/domdocument/loadhtmlfile.xml index 8cf922ff78..ed236bc2e1 100644 --- a/reference/dom/domdocument/loadhtmlfile.xml +++ b/reference/dom/domdocument/loadhtmlfile.xml @@ -19,22 +19,7 @@ filename. Unlike loading XML, HTML does not have to be well-formed to load. - - - This function parses the input using an HTML 4 parser. The parsing rules - of HTML 5, which is what modern web browsers use, are different. Depending - on the input this might result in a different DOM structure. Therefore - this function cannot be safely used for sanitizing HTML. - - - As an example, some HTML elements will implicitly close a parent element - when encountered. The rules for automatically closing parent elements - differ between HTML 4 and HTML 5 and thus the resulting DOM structure that - DOMDocument sees might be different from the DOM - structure a web browser sees, possibly allowing an attacker to break the - resulting HTML. - - + &dom.domdocument.html5; &reftitle.parameters;