KRL RSS 解析器:处理编码问题?

发布于 2024-10-13 05:25:10 字数 282 浏览 4 评论 0原文

我正在将 RSS 提要从 Tumblr 导入到 Kynetx 应用程序中。 RSS 提要似乎存在一些编码问题,因为撇号显示如下:

撇号编码不正确

该提要(您可以找到此处)声称以 UTF-8 编码。

有没有办法指定编码或用常规撇号替换这些字符?

I'm importing an RSS feed from Tumblr into a Kynetx app. It appears that the RSS feed has some encoding issues, as apostrophes appear like this:

Apostrophes encoded incorrectly

The feed (which you can find here) claims to be encoded in UTF-8.

Is there a way to specify the encoding or else replace those characters with regular apostrophes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

稚气少女 2024-10-20 05:25:10

虽然不是最佳选择,但您可以尝试捕获这些编码并将其替换为 UTF-8 标准:

newstring = oldstring.replace(re/’/\'/);

Windowsspecial chars

这将出现这是指定 UTF-8 但未明确强制执行的服务的情况。我上传了您提供的 RSS 源的图像。为了进行比较,我将文本剪切并粘贴到记事本文档中,然后从键盘输入相同的文本。

我不知道你是否能从图像中看出,但被破坏的撇号与我的 UTF-8 浏览器生成的撇号不同。

我怀疑这篇文章是通过 Windows 客户端提交的。如果您查看编码选项,您将看到西方的选项(Windows-1252)。

Windows-1252 是 Windows 的传统编码,类似于 ISO 8859-1,但用自己的一些字符替换 ANSI 标准中的控制字符,并更改其他代码页中的位置。

我上面引用的维基百科页面上的几句话:

将 Windows-1252 文本数据错误标记为字符集标签 ISO-8859-1 的情况很常见。许多网络浏览器和电子邮件客户端将 MIME 字符集 ISO-8859-1 视为 Windows-1252 字符,以容纳此类错误标签

当输入标准 ASCII 字符时,许多 Microsoft 程序(例如 Word)会自动替换 Windows-1252 字符,例如“智能引号”(例如,用 ' 替换缩写中的撇号)或用 © 替换三个字符 ' (c)'。

KRL支持UTF-8支持的所有语言字符集,因此它原生支持多字节国际字符;但是,这是以当您只有 ISO-8859-1 或 Windows-1252 可供选择时可能伪造编码为代价的。

While not optimal, you could try to catch these encodings and replace them with the UTF-8 standard:

newstring = oldstring.replace(re/’/\'/);

Windows special chars

This appears to be a case of a service that specifies UTF-8, but does't explicitly enforce it. I uploaded an image of the RSS feed that you provided. For comparison, I cut and pasted the text into a notepad document and then typed in the same text from my keyboard.

I don't know if you can tell from the image, but the apostrophe that is mangled is different from the apostrophe that is generated by my UTF-8 browser.

I suspect that this post was submitted via a Windows client. If you look at your encoding options, you will see an option for Western (Windows-1252).

Windows-1252 is a legacy encoding from windows that resembles ISO 8859-1, but substitutes some of their own characters for control characters in the ANSI standard and changes the location in the codepage of others.

A couple of quotes from the wikipedia page that I cite above:

It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling

Many Microsoft programs, such as Word will automatically substitute Windows-1252 characters when standard ASCII characters are entered, such as for "smart quotes" (e.g. substituting ’ for the apostrophe in a contraction) or substituting © for the three characters '(c)'.

KRL supports all of the language charsets supported by UTF-8, so it supports multi-byte international characters natively; however, that comes at the expense of being able to fudge encodings that is possible when you only have ISO-8859-1 or Windows-1252 to choose from.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文