diff options
author | Chih-Hsuan Yen <[email protected]> | 2023-11-26 20:53:05 +0800 |
---|---|---|
committer | Chih-Hsuan Yen <[email protected]> | 2023-11-26 21:04:56 +0800 |
commit | d4da4dcc321ca65fb2cd19877f395cc5f75933ab (patch) | |
tree | 5667b4fb2b4dd42853fb638ef81e6fec10475c52 /classes | |
parent | 2c7e000120b23487ed4090241a206f528e6b11f5 (diff) |
Fix sanitizer with libxml2 >= 2.12.0
Somehow with newer libxml2, `<?xml encoding="UTF-8">` no longer enforces
UTF-8. Instead, non-ASCII contents are treated as ISO-8859-1 and get
broken.
For example, `<p>中文</p>` becomes
`<p>中文</p>` (should be
`<p>中文</p>`).
Switching to another trick mentioned on [1] fixes the issue, and the
new trick still works with older libxml2 (tested 2.11.5).
As a side note, DOMDocument::loadHTML uses HTMLParser in libxml2 [2][3].
[1] https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly
[2] https://github.com/php/php-src/blob/php-8.1.26/ext/dom/document.c#L1855
[3] https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html
Diffstat (limited to 'classes')
-rw-r--r-- | classes/Sanitizer.php | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/classes/Sanitizer.php b/classes/Sanitizer.php index a7bea9e5f..7af92f249 100644 --- a/classes/Sanitizer.php +++ b/classes/Sanitizer.php @@ -72,7 +72,7 @@ class Sanitizer { $res = trim($str); if (!$res) return ''; $doc = new DOMDocument(); - $doc->loadHTML('<?xml encoding="UTF-8">' . $res); + $doc->loadHTML('<meta charset="UTF-8">' . $res); $xpath = new DOMXPath($doc); // is it a good idea to possibly rewrite urls to our own prefix? |