HTML 编码问题 - “”xC2;”显示字符而不是“ ”

发布于 2024-08-05 06:47:36 字数 1186 浏览 6 评论 0原文

我的旧版应用程序刚刚开始出现问题,无论出于何种原因,我都不确定。它生成一堆 HTML,然后由 ActivePDF 将其转换为 PDF 报告。

该过程的工作原理如下:

  1. 从数据库中提取一个 HTML 模板,其中包含要替换的标记(例如“~CompanyName~”、“~CustomerName~”等)
  2. 用真实数据替换标记
  3. 使用简单的正则表达式整理 HTML属性格式化 HTML 标签属性值的函数(确保引号等,因为 ActivePDF 的渲染引擎讨厌除属性值周围的单引号之外的任何内容)
  4. 将 HTML 发送到创建 PDF 的 Web 服务。

在混乱中的某个地方,HTML 模板中的不间断空格( s)被编码为 ISO-8859-1,因此在以下情况下它们会错误地显示为“”字符:在浏览器 (FireFox) 中查看文档。 ActivePDF 对这些非 UTF8 字符感到恶心。

我的问题:由于我不知道问题出在哪里,也没有时间调查它,是否有一种简单的方法来重新编码或查找并替换坏字符?我尝试过通过我组合在一起的这个小函数发送它,但是它将其全部变成了官样文章并没有改变任何东西。

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

有什么想法吗?

编辑:

我现在正在解决这个问题,尽管这似乎不是一个好的解决方案:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.

The process works like this:

  1. Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
  2. Replace the tokens with real data
  3. Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
  4. Send off the HTML to a web service that creates the PDF.

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.

My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

Any ideas?

EDIT:

I'm getting by with this for now, though it hardly seems like a good solution:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

猫卆 2024-08-12 06:47:36

在混乱中的某个地方,HTML 模板中的不间断空格( s)被编码为 ISO-8859-1,因此它们错误地显示为“”字符

然后编码为 UTF-8 ,而不是 ISO-8859-1。不间断空格字符是 ISO-8859-1 中的字节 0xA0;当编码为 UTF-8 时,它将是 0xC2,0xA0,如果您(错误地)将其视为 ISO-8859-1,则会显示为 " "。其中包括您可能没有注意到的尾随 nbsp;如果该字节不存在,则说明有其他内容损坏了您的文档,我们需要进一步查看以找出原因。

什么是正则表达式,模板如何工作?如果您的   字符串(正确地)被转换为 U+00A0 NON-BREAKING SPACE 字符,那么似乎在某处涉及了正确的 HTML 解析器。如果是这样,您可以在 DOM 中本地处理模板,并要求它使用 ASCII 编码进行序列化,以将非 ASCII 字符保留为字符引用。这也将阻止您对 HTML 本身进行正则表达式后处理,这始终是一件非常危险的事情。

无论如何,现在您可以将以下内容之一添加到文档的 中,看看这是否使其在浏览器中看起来正确:

  • 对于 HTML4:
  • 对于 HTML5:

如果您完成此操作后,剩下的任何问题都是 ActivePDF 的错。

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.

原野 2024-08-12 06:47:36

如果有人和我有同样的问题并且字符集已经正确,只需执行以下操作:

  1. 复制 .html 文件中的所有代码。
  2. 打开记事本(或任何基本文本编辑器)并粘贴代码。
  3. 转到“文件 -> 另存为”
  4. 输入文件名“example.html”(选择“保存类型:所有文件(.)”)
  5. 选择编码为 UTF-8
  6. 点击保存即可现在删除旧的 .html 文件并且编码应该被修复

If any one had the same problem as me and the charset was already correct, simply do this:

  1. Copy all the code inside the .html file.
  2. Open notepad (or any basic text editor) and paste the code.
  3. Go "File -> Save As"
  4. Enter you file name "example.html" (Select "Save as type: All Files (.)")
  5. Select Encoding as UTF-8
  6. Hit Save and you can now delete your old .html file and the encoding should be fixed
伪装你 2024-08-12 06:47:36

问题:
即使我也面临这样的问题:我们在 POST 请求中向 CRM 系统发送带有某些字符串的 '£' ,但是当我们从 CRM 进行 GET 调用时,它返回 '£ ' 带有一些字符串内容。所以我们分析的是 '£' 被转换为 'â£'

分析:
研究后发现的问题是,在 POST 调用中,我们将 HttpWebRequest ContentType 设置为“text/xml”,而在 GET 调用中,设置为 “text/xml; charset:utf- 8”。

解决方案:
因此,作为解决方案的一部分,我们在 POST 请求中包含了 charset:utf-8 并且它可以工作。

Problem:
Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.

Analysis:
The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".

Solution:
So as the part of solution we have included the charset:utf-8 in POST request and it works.

℡寂寞咖啡 2024-08-12 06:47:36

就我而言,这种情况(带有插入符号)发生在我使用自己的代码生成工具从 Visual Studio 生成的代码中。解决起来很简单:

在文档中选择单个空格 ( )。您应该能够看到许多看起来与其他单个空间不同的单个空间,它们未被选中。选择这些其他单个空格 - 它们是浏览器中出现不需要的字符的原因。转至查找并用单个空格替换 ( )。完毕。

PS:将光标放在某个字符上或者在VS2017+中选择它,可以更容易地看到所有相似的字符;我希望其他IDE也能有类似的功能

In my case this (a with caret) occurred in code I generated from visual studio using my own tool for generating code. It was easy to solve:

Select single spaces ( ) in the document. You should be able to see lots of single spaces that are looking different from the other single spaces, they are not selected. Select these other single spaces - they are the ones responsible for the unwanted characters in the browser. Go to Find and Replace with single space ( ). Done.

PS: It's easier to see all similar characters when you place the cursor on one or if you select it in VS2017+; I hope other IDEs may have similar features

浸婚纱 2024-08-12 06:47:36

就我而言,即使页面已正确编码为 UTF-8,我也得到拉丁十字符号而不是 nbsp。以上都没有帮助解决问题,我尝试了所有方法。

最后,更改 IE 字体(使用浏览器特定的 css)有所帮助,我使用 Helvetica-Nue 作为正文字体,更改为 Arial 解决了问题。

In my case I was getting latin cross sign instead of nbsp, even that a page was correctly encoded into the UTF-8. Nothing of above helped in resolving the issue and I tried all.

In the end changing font for IE (with browser specific css) helped, I was using Helvetica-Nue as a body font changing to the Arial resolved the issue .

风铃鹿 2024-08-12 06:47:36

原因是 PHP 不识别 utf-8。

在这里您可以检查 HTML 中的所有特殊字符

http://www.degraeve.com/reference/特殊字符.php

The reason for this is PHP doesn't recognise utf-8.

Here you can check it for all Special Characters in HTML

http://www.degraeve.com/reference/specialcharacters.php

萌辣 2024-08-12 06:47:36

好吧,我在我的几个网站中也遇到了这个问题,我所需要做的就是为 HTML 实体自定义内容 fetler。在此之前,我删除了更多的内容,所以只需更改页面的 html fiter 或解析功能即可。这主要是由于大多数 CMS 中的 HTML 编辑器所致。他们存储解析数据的方式导致了这个问题(就我而言)。希望这对你的情况也有帮助

Well I got this Issue too in my few websites and all i need to do is customize the content fetler for HTML entites. before that more i delete them more i got, so just change you html fiter or parsing function for the page and it worked. Its mainly due to HTML editors in most of CMSs. the way they store parse the data caused this issue (In My case). May this would Help in your case too

囍孤女 2024-08-12 06:47:36

我也遇到了同样的问题。显然这只是因为 PHP 不识别 utf-8。

当“£”符号一直显示为“£”时,我一开始感到抓狂,尽管它在 DreamWeaver 中显示正常。最终我记得我在与索引文件相关的链接方面遇到了问题,当页面直接查看时可以使用幻灯片,但与包含一起使用时则不行(但这不是重点。无论如何,我想知道这是否可能是一个类似的问题,因此我没有将其放入遇到问题的页面,而是将其放入 index.php 文件中 - 问题始终得到解决。

I was having the same sort of problem. Apparently it's simply because PHP doesn't recognise utf-8.

I was tearing my hair out at first when a '£' sign kept showing up as '£', despite it appearing ok in DreamWeaver. Eventually I remembered I had been having problems with links relative to the index file, when the pages, if viewed directly would work with slideshows, but not when used with an include (but that's beside the point. Anyway I wondered if this might be a similar problem, so instead of putting into the page that I was having problems with, I simply put it into the index.php file - problem fixed throughout.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文