如何修复使用不同编码的页面中的无效 HTML 字符?

发布于 2024-09-26 06:04:06 字数 449 浏览 7 评论 0原文

我有许多网站呈现无效字符。页面的元标记指定 UTF-8 编码。但是,许多页面包含 UTF-8 无法解释的字符,可能是因为文件是使用其他编码(例如 ANSI)保存的。我现在特别担心的是一个奇特的撇号(如“Bob's”中的撇号……如果没有正确显示,抱歉)。 W3 的验证器指示实体是“\x92”,但它不会验证该文件,因为它不映射到 unicode。当然,如果我在 Notepad++ 中打开该文件并将编码更改为 UTF-8,则该字符将被黑框中的 92 替换。

这是我的问题:解决这个问题最简单的方法是什么?我是否必须打开所有页面并用常规撇号替换该字符?或者是否有我可以添加(例如,添加到 IIS)的快速修复,该修复可能会覆盖或修复编码问题?或者我必须暴力查找/替换吗?我在这些网站上有数百个页面,我不知道我需要更改其中有多少页面,因此,如果有人知道一种方法,我可以绕过这个问题或快速修复它,我将不胜感激。

I have a number of websites that are rendering invalid characters. The pages' meta tags specify UTF-8 encoding. However, a number of pages contain characters that can't be interpreted by UTF-8, probably because the files were saved with another encoding (such as ANSI). The one in particular I'm concerned about right now is a fancy apostrophe (as in "Bob’s"...sorry if that doesn't show up correctly). W3's validator indicates the entity is "\x92", but it won't validate the file because it doesn't map to unicode. And, of course, if I open the file in Notepad++ and change the encoding to UTF-8, the character is replaced by a 92 in a black box.

Here's my question: what's the easiest way to fix this? Do I have to open all the pages and replace that character with a conventional apostrophe? Or is there a quick fix I could add (say, to IIS) that might override or fix the encoding issue? Or do I have to brute-force find/replace? I have hundreds of pages on these websites and I have no idea how many of them I'd have to change, so if anyone knows a way I could either circumvent this problem or fix it quickly I would appreciate it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

很酷不放纵 2024-10-03 06:04:06

您是否以直接 HTML 的形式提供页面,或者是否有其他脚本提供内容?如果您有一个提供内容的脚本,则该脚本可以仅查找 \x92 的任何实例并将其替换为撇号。在 PHP 中,这将是一个简单的 str_replace()

如果您直接提供 HTML,那么您将必须实际修改文件本身。但是,这可以是自动化的(如果您有数百个文件,则可能应该是自动化的),具体取决于您可以使用哪些工具以及您所在的操作系统。既然您说您正在使用 Notepad++,我想可以安全地假设您使用的是 MS Windows(因此没有有趣的 Unix 命令来加快速度)

但是,也许可以创建一个可以执行此操作的 BATCH 脚本。命令提示符中内置了非常简单的 ASCII 文本编辑工具。如果这是不可能的,那么如果您的系统上有编译器并且对 C 有一定的了解,那么很有可能创建一个 C 或 C++ 程序来执行此操作。如果您有前者而不是后者,请询问,我会制作一些为您提供来源。

Are you serving the pages as straight HTML, or do you have another script serving the content? If you have a script which is serving the content, that script could just look for any instance of \x92 and replace it with an apostrophe. In PHP this would be a simple str_replace()

If you're serving straight HTML then you'll have to actually modify the files themselves. This can be automated, however (and probably should be if you have hundreds of files) depending on what tools you have available to you and what Operating System you're in. Since you said you're using Notepad++ I suppose it's safe to assume you're in MS Windows (therefore no fun Unix commands to speed things up)

It may be possible to create a BATCH script which can do this, however. There are very simple ASCII text editing tools built into Command Prompt. If that's not possible then it's very possible to make a C or C++ program to do this if you have a compiler on your system and moderate knowledge of C. If you have the former and not the latter, ask and I'll whip up some source for you.

柏林苍穹下 2024-10-03 06:04:06

我自己不确定它的编码部分,但是如果您最终不得不通过暴力来完成,您总是可以编写一个简短的程序来迭代所有网页,将每个文件加载到内存中,运行regex.replace 修复有问题的字符,并将文件保存回磁盘。显然并不理想,但比自己打开每个文件要好。

祝你好运

I'm not sure about the encoding part of it myself, but if you wind up having to do it by brute force, you could always write a short program that iterates through all of your web pages, loads each file into memory, runs a regex.replace to fix the problem character, and saves the file back to disk. Obviously not ideal but better than opening each file on your own.

Good Luck

终难遇 2024-10-03 06:04:06

我刚刚遇到了一个类似的问题,其中一些不破坏空格的“xA0”进入了一个所谓的 UTF-8 文档。在记事本++中,这些显示在一个黑框中,其中写有“xA0”。但是 notepad++ 不允许复制或粘贴它们。

我做了一些研究,发现发生了什么事。十六进制编辑器显示这些被编码为单个字节:“A0”,这是无效的 UTF-8。任何非 ASCII 的内容都应至少为两个字节,因此正确的编码是十六进制的“C2 A0”。

对于你奇特的撇号示例,你正在处理同样的事情。实际上,你的问题更复杂,因为在扩展ascii字符\x92(十进制146)中是撇号,但在unicode中\x92是控制字符,正确的单引号应该是U+2019(十进制8217)。在记事本++中添加此符号(通过“编辑”->“字符”面板)并在十六进制编辑器中检查,显示正确的十六进制编码是“E2 80 99”,其二进制为 11100010 10000000 10011001。当您删除 UTF-8 控制字节(非粗体)时,会产生 0010 0000 0000 0001 1001,它等于十进制值 8217。

处理此问题的正确方法是将文件作为字节流打开(unsigned char *在 c) 中并搜索不正确的 UTF-8 序列。然后您可以将它们替换为 � (请参阅 https://en.wikipedia.org/wiki /UTF-8#Invalid_byte_sequences) 或者您可以尝试自定义处理它们,通过进行 A0 -> 等替换C2 A0(编码不正确的不间断空格)和 92 -> E2 80 99(右单引号编码不正确)。

I just ran into a similar issue where some not breaking spaces "xA0" got into a supposedly UTF-8 document. In notepad++ these are displayed in a black box with "xA0" written in it. However notepad++ doesn't allow them to be copied or pasted.

I did a little research and found out what is going on. A hex editor reveals that these are being encoded as a single byte: "A0" which is invalid UTF-8. Anything not ASCII should be at least two bytes, so the proper encoding is "C2 A0" in hexadecimal.

For your fancy apostrophe example, you are dealing with the same thing. Actually though, your problem is more complicated because in extended ascii character \x92 (decimal 146) is an apostrophe but in unicode \x92 is a control character and the right single quotation should be U+2019 (decimal 8217). Adding this symbol in notepad++ (via Edit->Character panel) and inspecting in a hex editor reveals that the proper hexadecimal encoding is "E2 80 99" which in binary is 11100010 10000000 10011001. When you remove the UTF-8 control bytes (non bold) this yields 0010 0000 0000 0001 1001 which is equal to a decimal value of 8217.

The proper way of handling this would be to open your file as a byte stream (unsigned char * in c) and search for improper UTF-8 sequences. Then you can either replace them with � (see https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) or you can try to custom handle them, by making replacements like A0 -> C2 A0 (improperly encoded non breaking space) and 92 -> E2 80 99 (improperly encoded right single quotation mark).

别再吹冷风 2024-10-03 06:04:06

所有特殊字符都应该是 HTML 编码的,例如版权符号应该作为

©

HTML 实体列表出现在 HTML 中:

http ://www.w3schools.com/HTML/html_entities.asp

至于如何实现这很大程度上取决于您首先如何创建代码,但是像 ASP.Net 这样的东西将具有服务器端功能,例如:

Server.HTMLEncode("string with special chars")

All special charcters should be HTML encoded, e.g. a copyright symbol should be in your HTML as

©

HTML entity list:

http://www.w3schools.com/HTML/html_entities.asp

As for how you implement this largely depends on how you are creating the code in the first place, but something like ASP.Net will have server side functions like:

Server.HTMLEncode("string with special chars")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文