1
0
mirror of https://github.com/php/doc-en.git synced 2026-03-23 23:32:18 +01:00

Warn about libxml's unstable HTML parsing (fixes #2219) (#3649)

Consolidates the duplicated note between `DOMDocument::loadHTML` and `DOMDocument::loadHTMLFile` and adds information about how libxml HTML5 parsing isn't stable across versions and the new `Dom\HTMLDocument` functions should be used. (Not documented yet, so they aren't linked, but will wire up automatically when it is.)
This commit is contained in:
Jim Winstead
2024-08-10 15:12:18 -07:00
committed by GitHub
parent 4acad9b77f
commit c22cca7824
3 changed files with 26 additions and 32 deletions

View File

@@ -1572,6 +1572,30 @@ it is inserted with (e.g.) <function xmlns="http://docbook.org/ns/docbook">DOMNo
<!ENTITY dom.malformederror '<para xmlns="http://docbook.org/ns/docbook">While malformed HTML should load successfully, this function may generate <constant>E_WARNING</constant> errors when it encounters bad markup. <link linkend="function.libxml-use-internal-errors">libxml&apos;s error handling functions</link> may be used to handle these errors.</para>'>
<!ENTITY dom.note.utf8 '<note xmlns="http://docbook.org/ns/docbook"><para>The DOM extension uses UTF-8 encoding. Use <function>mb_convert_encoding</function>, <methodname>UConverter::transcode</methodname>, or <function>iconv</function> to handle other encodings.</para></note>'>
<!ENTITY dom.note.json '<note xmlns="http://docbook.org/ns/docbook"><para>When using <function>json_encode</function> on a <classname>DOMDocument</classname> object the result will be that of encoding an empty object.</para></note>'>
<!ENTITY dom.domdocument.html5 '<warning xmlns="http://docbook.org/ns/docbook">
<para>
This function parses the input using an HTML 4 parser. The parsing rules
of HTML 5, which is what modern web browsers use, are different. Depending
on the input this might result in a different DOM structure. Therefore
this function cannot be safely used for sanitizing HTML.
</para>
<para>
The behavior when parsing HTML can depend on the version of
<literal>libxml</literal> that is being used, particularly with regards to
edge conditions and error handling.
For parsing that conforms to the HTML5 specification,
use <methodname>Dom\HTMLDocument::createFromString</methodname> or
<methodname>Dom\HTMLDocument::createFromFile</methodname>, added in PHP 8.4.
</para>
<para>
As an example, some HTML elements will implicitly close a parent element
when encountered. The rules for automatically closing parent elements
differ between HTML 4 and HTML 5 and thus the resulting DOM structure that
<classname>DOMDocument</classname> sees might be different from the DOM
structure a web browser sees, possibly allowing an attacker to break the
resulting HTML.
</para>
</warning>'>

View File

@@ -18,22 +18,7 @@
The function parses the HTML contained in the string <parameter>source</parameter>.
Unlike loading XML, HTML does not have to be well-formed to load.
</para>
<warning>
<para>
This function parses the input using an HTML 4 parser. The parsing rules
of HTML 5, which is what modern web browsers use, are different. Depending
on the input this might result in a different DOM structure. Therefore
this function cannot be safely used for sanitizing HTML.
</para>
<para>
As an example, some HTML elements will implicitly close a parent element
when encountered. The rules for automatically closing parent elements
differ between HTML 4 and HTML 5 and thus the resulting DOM structure that
<classname>DOMDocument</classname> sees might be different from the DOM
structure a web browser sees, possibly allowing an attacker to break the
resulting HTML.
</para>
</warning>
&dom.domdocument.html5;
</refsect1>
<refsect1 role="parameters">
&reftitle.parameters;

View File

@@ -19,22 +19,7 @@
<parameter>filename</parameter>. Unlike loading XML, HTML does not have
to be well-formed to load.
</para>
<warning>
<para>
This function parses the input using an HTML 4 parser. The parsing rules
of HTML 5, which is what modern web browsers use, are different. Depending
on the input this might result in a different DOM structure. Therefore
this function cannot be safely used for sanitizing HTML.
</para>
<para>
As an example, some HTML elements will implicitly close a parent element
when encountered. The rules for automatically closing parent elements
differ between HTML 4 and HTML 5 and thus the resulting DOM structure that
<classname>DOMDocument</classname> sees might be different from the DOM
structure a web browser sees, possibly allowing an attacker to break the
resulting HTML.
</para>
</warning>
&dom.domdocument.html5;
</refsect1>
<refsect1 role="parameters">
&reftitle.parameters;