ruby 字符串编码

发布于 2024-08-23 14:28:00 字数 1264 浏览 3 评论 0原文

因此,我正在尝试使用 nokogiri 从 某个网站 进行一些屏幕抓取,但是网站所有者无法在 标记中指定页面的正确编码。这样做的结果是,我试图处理那些认为自己是 utf-8 但实际上不是的字符串。

(如果你关心的话,这里是我用来测试这个的文件:

经过大量搜索后(这个SO问题特别有用),我发现在该测试字符串上调用encode('iso-8859-1', 'utf-8')“有效”,因为我得到了正确的©符号。现在的问题是,我想要的其他一些字符串中还有其他字符,这些字符在转换为拉丁编码时确实不起作用(例如,Shōta 会变成 Sh�\x8Dta代码>)。

现在,我可能会打扰相应的网站管理员并尝试让他们修复他们该死的编码,但与此同时,我希望能够使用我所拥有的字节。我相当肯定有一种方法,但我就是无法弄清楚它是什么。

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.

(If you care, here are the files I was using to test this:

)

After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).

Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡淡绿茶香 2024-08-30 14:28:00

这些页面似乎已正确编码为 UTF-8。这就是我的浏览器看到它们的方式,当我查看它们的源代码并告诉编辑器将它们解码为 UTF-8 时,它们看起来很好。我看到的唯一问题是一些版权符号在添加到内容之前(或添加时)似乎已被损坏。 o 马克龙和其他非 ASCII 字母都可以正常显示。

我不知道您是否意识到这一点,但通知客户端页面编码的正确方法是通过标头。页面可能标记中包含该信息,但这既不是必需的,也不是预期的;如果标头存在,浏览器通常会忽略此类标签。

由于您的页面是 XHTML,因此它们还可以将编码信息嵌入 XML 处理指令中,但同样,它们不是必需的。但这也意味着您可以让 Nokogiri 将它们视为 XML 而不是 HTML,在这种情况下,我希望它默认使用 UTF-8。但我对Nokogiri不熟悉,所以我不能确定。而且无论如何,header仍然是最终的权威。

Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.

I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.

Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.

情场扛把子 2024-08-30 14:28:00

因此,问题在于 ANN 仅通过标头指定编码,而 Nokogiri 不会从 open() 函数接收标头。因此,Nokogiri 猜测该页面是拉丁编码的,并生成我们确实无法反转以取回原始字符的字符串。

您可以将 Nokogiri 的编码指定为 Nokogiri::HTML() 的第三个参数,这解决了我最初试图解决的问题。因此,我会接受这个答案,即使我提出的更具体的问题(如何从拉丁字符串中获取这些非拉丁字符)是无法回答的。

So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.

You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文