当前位置：文江博客话题详情

KRL RSS 解析器：处理编码问题？

发布于 2024-10-13 05:25:10 字数 282 浏览 4 评论 0原文

我正在将 RSS 提要从 Tumblr 导入到 Kynetx 应用程序中。 RSS 提要似乎存在一些编码问题，因为撇号显示如下：

撇号编码不正确

该提要（您可以找到此处）声称以 UTF-8 编码。

有没有办法指定编码或用常规撇号替换这些字符？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

稚气少女 2024-10-20 05:25:10

虽然不是最佳选择，但您可以尝试捕获这些编码并将其替换为 UTF-8 标准：

newstring = oldstring.replace(re/â€™/\'/);

Windowsspecial chars

这将出现这是指定 UTF-8 但未明确强制执行的服务的情况。我上传了您提供的 RSS 源的图像。为了进行比较，我将文本剪切并粘贴到记事本文档中，然后从键盘输入相同的文本。

我不知道你是否能从图像中看出，但被破坏的撇号与我的 UTF-8 浏览器生成的撇号不同。

我怀疑这篇文章是通过 Windows 客户端提交的。如果您查看编码选项，您将看到西方的选项（Windows-1252）。

Windows-1252 是 Windows 的传统编码，类似于 ISO 8859-1，但用自己的一些字符替换 ANSI 标准中的控制字符，并更改其他代码页中的位置。

我上面引用的维基百科页面上的几句话：

将 Windows-1252 文本数据错误标记为字符集标签 ISO-8859-1 的情况很常见。许多网络浏览器和电子邮件客户端将 MIME 字符集 ISO-8859-1 视为 Windows-1252 字符，以容纳此类错误标签
当输入标准 ASCII 字符时，许多 Microsoft 程序（例如 Word）会自动替换 Windows-1252 字符，例如“智能引号”（例如，用 ' 替换缩写中的撇号）或用 © 替换三个字符 ' (c)'。

KRL支持UTF-8支持的所有语言字符集，因此它原生支持多字节国际字符；但是，这是以当您只有 ISO-8859-1 或 Windows-1252 可供选择时可能伪造编码为代价的。

While not optimal, you could try to catch these encodings and replace them with the UTF-8 standard:

newstring = oldstring.replace(re/â€™/\'/);

Windows special chars

This appears to be a case of a service that specifies UTF-8, but does't explicitly enforce it. I uploaded an image of the RSS feed that you provided. For comparison, I cut and pasted the text into a notepad document and then typed in the same text from my keyboard.

I don't know if you can tell from the image, but the apostrophe that is mangled is different from the apostrophe that is generated by my UTF-8 browser.

I suspect that this post was submitted via a Windows client. If you look at your encoding options, you will see an option for Western (Windows-1252).

Windows-1252 is a legacy encoding from windows that resembles ISO 8859-1, but substitutes some of their own characters for control characters in the ANSI standard and changes the location in the codepage of others.

A couple of quotes from the wikipedia page that I cite above:

It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling
Many Microsoft programs, such as Word will automatically substitute Windows-1252 characters when standard ASCII characters are entered, such as for "smart quotes" (e.g. substituting ’ for the apostrophe in a contraction) or substituting © for the three characters '(c)'.

KRL supports all of the language charsets supported by UTF-8, so it supports multi-byte international characters natively; however, that comes at the expense of being able to fudge encodings that is possible when you only have ISO-8859-1 or Windows-1252 to choose from.

回复收藏 0 原文

~没有更多了~