与 ruby 和 Nokogiri HTML 不兼容的编码
我正在使用 Nokogiri 解析外部 HTML 页面。该页面使用 ISO-8859-1 进行编码。我想要提取的部分数据包含一些 –(破折号)html 实体:
xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed
在最后一行中,字符串应使用 破折号 在浏览器上呈现。如果我将页面指定为 ISO-8859-1 编码,浏览器会正确呈现它,但是,我的 Sinatra 应用程序使用 UTF-8。如何在浏览器中正确显示该文本?今天显示为一个正方形,里面有一个小数字。 我尝试了force_encoding('ISO-8859-1'),但随后我从 Sinatra 收到了 CompatibilityError 。
有什么线索吗?
[编辑] 以下是该应用程序的屏幕截图:
->字符编码为 UTF-8 的 Firefox
-> [带有西方字符编码的 Firefox (ISO-8859-1)
值得一提的是,在上面的 ISO-8859-1 模式下,破折号显示正确,但破折号之前有另一个不正确的字符。诡异的 :(
I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:
xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed
In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside.
I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.
Any clues?
[Edit]
Below are screenshots of the app:
-> Firefox with character encoding UTF-8
-> [Firefox with character encoding Western (ISO-8859-1)
It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 Nokogiri 中解析文档后,您可以告诉它采用不同的 编码。尝试:
我从这里看不到该页面,以确认这可以解决问题,但它可以解决类似的问题。
After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:
I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.
摘要:有问题的字符是 ISO-8859-1 中的控制字符,不用于显示。
详细信息和调查:
下面的测试显示您正在从 Nokogiri 和 Sinatra 获取有效的 UTF-8:
这可以正确提供
Content-Type:text/html;charset=utf-8
的内容在我的电脑上。然而,Chrome 不会在浏览器中显示我的这个字符。分析该响应,会发现破折号返回相同的 Unicode 字节对,如上所示:
\xC2\x96
。这似乎是 这个 Unicode 字符,这似乎很奇怪短跑。我会将其归咎于错误的源数据,然后简单地抛出:
在 Ruby 源文件的顶部,然后放入:
编辑:如果您查看 该字符的浏览器测试页 您将看到(至少在 Chrome 和 Firefox 中)我)UTF-8 文字版本是空白的,但显示了十六进制和十进制转义版本。我无法理解为什么会这样,但你已经明白了。当以原始形式呈现时,浏览器根本无法正确显示您的角色。
要么使其成为 HTML 实体,要么成为不同的 Unicode 破折号。无论哪种方式,都会调用
gsub
。编辑#2:还有一个奇怪的注意事项:源编码中的字符具有十六进制字节值
0x96
。据我所知,这似乎不是可打印的 ISO-8859-1 字符。如 ISO-8859-1 的官方规范所示,这属于两个非打印区域之一。Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.
Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:
This properly serves up content with
Content-Type:text/html;charset=utf-8
on my computer. Chrome does not show my this character in the browser, however.Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above:
\xC2\x96
. This appears to be this Unicode character which seem to be an odd dash.I would chalk this up to bad source data, and simply throw:
at the top of your Ruby source file(s), and then put in:
Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.
Either make it an HTML entity, or a different Unicode dash. Either way a
gsub
is called for.Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of
0x96
. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.我从事科学手稿出版工作,有很多破折号。您使用的破折号不是 ASCII 破折号,而是 unicode 破折号。强制 ISO 编码可能会导致破折号发生变化。
http://www.fileformat.info/info/unicode/char/96 /index.htm
该站点非常适合解决 unicode 问题。
您得到正方形的原因是您的浏览器可能不支持此功能。它可能被正确渲染。我会保留 UTF-8 编码,如果您想让该破折号让每个人都可以看到它,请将其转换为 ascii 破折号。
您可能想尝试使用 Iconv 将字符转换为 ASCII/UTF-8 http:// /craigjolicoeur.com/blog/ruby-iconv-to-the-rescue
I work in publishing of scientific manuscripts and there are many dashes. The dash that you are using is not an ASCII dash, it is a unicode dash. Forcing the ISO encoding is probably having the effect of making the dash change.
http://www.fileformat.info/info/unicode/char/96/index.htm
That site is excellent for unicode issues.
The reason you are getting a square is that perhaps your browser does not support this. It is probably correctly rendered. I would keep UTF-8 encoding, and if you want to make that dash so everyone can see it, convert it to an ascii dash.
You may want to try Iconv to convert the characters to ASCII/UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue