如何将包含 HTML 实体和无效字符的文本转换为 UTF-8 等效项?
我正在更改标题,因为我不知道特殊的损坏的窗口字符给我带来了问题,使问题看起来像重复的。
如何转换 HTML 实体,&#[0- 类型的字符引用9]+;和 &#x[a-fA-F0-9]+;,无效的字符引用 — 以及无效的 Windows 字符 chr(151) 到其 UTF-8 等效项?
基本上如何清理一些非常糟糕的可变编码文本并将其保存为UTF-8?
原始问题如下
转换 &#[0-9]+;和#x[a-fA-F0-9]+; UTF-8 等效项的引用?
例如
—
—
-
就像浏览器一样,但是使用 php。
编辑:即使是 Windows 制作的非标准但浏览器仍然显示。
I am changing the title because I was unaware of special broken windows characters that caused me problems, making the question look like a duplicate.
How to convert HTML entities, character references of type [0-9]+; and [a-fA-F0-9]+;, invalid character references and invalid windows characters chr(151) to their UTF-8 equivalents?
Basically how to clean up some very bad text of variable encoding and save it as UTF-8?
original question below
Convert [0-9]+; and [a-fA-F0-9]+; references to UTF-8 equvalents?
for example
to
—
like a browser does it, but with php.
edit: even the non-standard ones that windows made but browsers still display.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
用我最终使用的解决方案回答我自己的问题
问题:
我需要替换 html 实体以及十进制和十六进制字符引用,如下所示
‚
和& #x201A;
和&#emdash;
转换为 UTF-8 等效项,就像普通浏览器一样,并将文本转换为 UTF-8。问题是,经常有 130-150 和 x82-x9F 范围内的引用,如 thirtydot发现是 无效的 Windows 单词字符,人们在 ASCII 文本中使用破折号等特殊字符,而 php 的 html_entity_decode 不支持这些字符。
您可能会认为这些无效字符在浏览器中不起作用,但浏览器似乎达成了一个无声的未记录协议来修复这些字符并正确显示它们。
在尝试修复这些引用时,我还发现像
这样的实际字符也被使用,它们可能是直接从 word 复制的,并且会导致各种各样的问题,所以我也需要解决它们。
我发现的关于编码的大多数答案都没有提到的是,编码相关问题的解决方案通常很大程度上取决于所使用的编码。
下面是一个示例:
无效的 Windows 字符
chr(151)
将使用“ISO-8859-1”编码文本,并且 Josh B 根据 Jukka Korpelas 建议提到你应该像这样修复它们:它的作用是将 windows 字符替换为 a安全的 ASCII 替代方案,但知道文本将以 UTF-8 存储,我不想丢失原始字符。
虽然像这样更改它们不是一个选项,因为 ASCII 不支持正确的 Unicode 字符:
所以我所做的是首先将字符替换为其 html 引用(虽然 $str 是“ISO-8859-1”编码:
然后我更改编码
最后我使用主要基于 Gumbo 解决方案,它还修复了错误的 Windows 引用,但仅使用
preg_replace_callback
来检查错误的 Windows 字符因此,如果您的文本采用“ISO-8859-1”编码,则完整的解决方案如下所示:
它经过了各种情况的测试,看起来像这样。 。
让我们看看包含错误 Windows 字符的 UTF-8 编码文本的相同情况
测试是否存在错误字符或“格式错误的 UTF-8”的一种可靠方法是使用 iconv,它很慢,但比使用 preg_match 在我的测试中:
这几乎是我能想到的最好的,因为我发现没有合理的查找并替换 UTF-8 文本中的错误 Windows 字符的方法,让我解释一下原因。
让我们采用一个带有完全有效的 unicode 字符
$str = "—".chr(151);
和错误的 Windows 破折号的字符串。我不知道 UTF-8 字符串中可能存在哪些错误的 Windows 字符,只知道它们可能存在。
使用
str_replace
尝试修复上面有效的 emdash 字符串中的错误 Windows 字符chr(148)
(右双引号),该字符串甚至不包含任何双引号将导致一个乱码字符,起初我认为str_replace
可能不是多字节安全的,并尝试使用mb_eregi_replace
但问题是一样的。php 网站和 stackoverflow 上的评论提到
str_replace
是二进制安全的,并且由于 UTF-8 的设计方式,可以很好地处理 格式良好的 UTF-8 文本。为什么它会损坏
它认为坏的Windows字符
chr(148)
是由以下位“10010100”组成的,而(破折号字符)(http://www.fileformat.info/ info/unicode/char/2014/index.htm),根据文件格式网站,它由 3 个字节组成: "11100010:10000000:10010100"
请注意,完全有效的 UTF-8 字符中最后一个字节中的位与错误窗口右双引号中的位相匹配,因此
str_replace
只是替换最后一个字节,破坏 UTF-8 字符。这个问题发生在大量 unicode 字符上,并且会扰乱俄语文本中的大量字符。
对于 ASCII 文本,这种情况不会发生,因为每个字符始终由单个字节组成。
因此,当您获得包含任意数量的多字节字符的 UTF-8 字符串时,您将无法再安全地修复错误的 Windows 字符,我发现的唯一解决方案是使用 iconv 删除它们。
我能想到的唯一解决方案
虽然您始终可以将包含坏字符字节的有效 unicode 字符替换为其编码的对应字符,然后替换坏字符,然后解码好字符,从而保留所有内容:)
,如下所示
11100010:10000000:10010100
编码如下—
10010100
替换为正确的破折号—
— 回到
11100010:10000000:10010100
但是你必须写下包含与错误匹配的字节的每个多字节字符角色来实现这一目标。
相关:EM Dash #151; 之间有什么区别?和#8212;?
Answering my own question with the solution that I used in the end
The problem:
I needed to replace html entities and decimal and hexadecimal character references that looked like this
‚
and‚
and&#emdash;
to their UTF-8 equvalents, like a normal browser would, and convert the text into UTF-8.The problem was that there were often references that were in the range of 130-150 and x82-x9F, which as thirtydot has found out were invalid windows word characters that people use with ASCII text for special characters like emdashes, which are not supported by php's html_entity_decode.
You would think that these invalid characters would not work in browsers, but it looks like browsers made a silent undocumented agreement to fix these characters and display them properly anyway.
While trying to fix these references I also found out that the actual characters like
<?php echo chr(151);?>
were also being used, which were probably directly copied from word, and would cause all sorts of problems, so I needed them to be fixed too.What most answers that I found regarding encodings fail to mention is that the solution to encoding related problems often largely depends on the encoding used.
Here is an example:
The invalid windows character
chr(151)
will work with "ISO-8859-1" encoded text, and Josh B mentions as per Jukka Korpelas suggestion that you should fix them like this:What it does is replace the windows character to a safe ASCII alternative, but knowing that the text will be stored in UTF-8, I did not want to loose the original characters.
While changing them like this was not an option because ASCII does not support the proper Unicode character:
So what I did instead was to first replace the character to its html reference (While the $str was "ISO-8859-1" encoded:
Then I change the encoding
And finally I turn all the entities and character references to pure UTF-8 with my "html_character_reference_decode" function that is largely based on Gumbos solution, which also fixes the bad windows references, but only uses
preg_replace_callback
to go over the bad windows characters.So if your text is "ISO-8859-1" encoded, the complete solution looks like this:
It was tested with a wide range of situations and looks like it works.
Lets look at the same situation with UTF-8 encoded text that contains bad windows characters.
One reliable way to test for the presence of bad characters or "badly formed UTF-8" was to use iconv, it is slow, but was more reliable than using preg_match in my tests:
This was pretty much the best I could think of, as I found no reasonable way to find and replace the bad windows characters in UTF-8 text, let me explain why.
lets take a string with a perfectly valid unicode character
$str = "—".chr(151);
and a bad windows emdash.I don't know what bad windows characters might be present in the UTF-8 string, only that they might be present.
Using
str_replace
to try and fix the bad windows characterchr(148)
(right double quote) in the above valid emdash string which does not even contain any double quotes will result in a scrambeled character, at first I thought thatstr_replace
might not be multibyte safe, and tried usingmb_eregi_replace
but the problem was the same.The comments on the php website and stackoverflow mention that
str_replace
is binary safe, and works fine with well formed UTF-8 text, because of the way that UTF-8 was designed.Why it breaks
It figures that the bad windows character
chr(148)
is made up of the following bits "10010100", whilethe (emdash character)(http://www.fileformat.info/info/unicode/char/2014/index.htm), which according to the fileformat website is made up of 3 bytes: "11100010:10000000:10010100"
Notice that the bits in the last byte in the perfectly valid UTF-8 character match the bits in the bad windows right double quote, so
str_replace
just replaces the last byte, breaking the UTF-8 character.This problem happens with lots of unicode characters, and would scramble lots of characters in russian text for example.
This can't happen with ASCII text because each character is always made up of a single byte.
So when you get an UTF-8 string, that contains any amount of multibyte characters, you can no longer safely fix the bad windows characters, and the only solution I found was to strip them with iconv
The only solution that I can think of
Although you can always replace the valid unicode characters that contain a byte of the bad characters to their encoded counterparts, then replace the bad characters and then decode the good characters, thus keeping everything :)
like this:
11100010:10000000:10010100
with the encoding like—
10010100
with the proper em dash—
—
back to11100010:10000000:10010100
But you have to write down every multibyte character that contains bytes that match the bad characters to achieve this.
Related: What is the difference between EM Dash #151; and #8212;?
这比我写答案时想象的要复杂得多。
Gumbo 已更新了他对一个非常相似的问题的回答,所以只需阅读:
如何转换 HTML 字符引用 (ף) 到常规 UTF-8?
This is much more complicated than I thought it was when I wrote my answer.
Gumbo has updated his answer to a very similar question, so just read that:
How can I convert HTML character references (ף) to regular UTF-8?