与 ruby 和 Nokogiri HTML 不兼容的编码

发布于 2024-10-14 12:50:39 字数 941 浏览 3 评论 0原文

我正在使用 Nokogiri 解析外部 HTML 页面。该页面使用 ISO-8859-1 进行编码。我想要提取的部分数据包含一些 –（破折号）html 实体：

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed

在最后一行中，字符串应使用 破折号 在浏览器上呈现。如果我将页面指定为 ISO-8859-1 编码，浏览器会正确呈现它，但是，我的 Sinatra 应用程序使用 UTF-8。如何在浏览器中正确显示该文本？今天显示为一个正方形，里面有一个小数字。我尝试了force_encoding('ISO-8859-1')，但随后我从 Sinatra 收到了 CompatibilityError 。

有什么线索吗？

[编辑] 以下是该应用程序的屏幕截图：

->字符编码为 UTF-8 的 Firefox

-> [带有西方字符编码的 Firefox (ISO-8859-1) 采用西方字符编码的 Firefox (ISO-8859-1)

值得一提的是，在上面的 ISO-8859-1 模式下，破折号显示正确，但破折号之前有另一个不正确的字符。诡异的：（

原文

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed

In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside.
I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.

Any clues?

[Edit]
Below are screenshots of the app:

-> Firefox with character encoding UTF-8

-> [Firefox with character encoding Western (ISO-8859-1)

It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

征棹 2024-10-21 12:50:39

在 Nokogiri 中解析文档后，您可以告诉它采用不同的编码。尝试：

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'

我从这里看不到该页面，以确认这可以解决问题，但它可以解决类似的问题。

After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'

I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.

回复收藏 0 原文

撑一把青伞 2024-10-21 12:50:39

摘要：有问题的字符是 ISO-8859-1 中的控制字符，不用于显示。

详细信息和调查：
下面的测试显示您正在从 Nokogiri 和 Sinatra 获取有效的 UTF-8：

require 'sinatra'
require 'open-uri'

get '/' do
  html = open("http://flybynight.com.br/agenda.php").read
  p [ html.encoding, html.valid_encoding? ]
  #=> [#<Encoding:ISO-8859-1>, true]

  str  = html[ /Preview.+?John Digweed/ ]
  p [ str, str.encoding, str.valid_encoding? ]
  #=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]

  utf8 = str.encode('UTF-8')
  p [ utf8, utf8.encoding, utf8.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  require 'nokogiri'
  doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
  p doc.encoding
  #=> "ISO-8859-1"

  dig = doc.xpath("//div[@class='tit_inter_cnz']")[1]
  p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  <<-ENDHTML
  <!DOCTYPE html>
  <html><head><title>Dig it!</title></head><body>
  <p>Here it comes...</p>
  <p>#{dig.text}</p>
  </body></html>
  ENDHTML
end

这可以正确提供 Content-Type:text/html;charset=utf-8 的内容在我的电脑上。然而，Chrome 不会在浏览器中显示我的这个字符。

分析该响应，会发现破折号返回相同的 Unicode 字节对，如上所示：\xC2\x96。这似乎是这个 Unicode 字符，这似乎很奇怪短跑。

我会将其归咎于错误的源数据，然后简单地抛出：

#encoding: UTF-8

在 Ruby 源文件的顶部，然后放入：

f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character

编辑：如果您查看该字符的浏览器测试页您将看到（至少在 Chrome 和 Firefox 中）我）UTF-8 文字版本是空白的，但显示了十六进制和十进制转义版本。我无法理解为什么会这样，但你已经明白了。当以原始形式呈现时，浏览器根本无法正确显示您的角色。

要么使其成为 HTML 实体，要么成为不同的 Unicode 破折号。无论哪种方式，都会调用 gsub。

编辑#2：还有一个奇怪的注意事项：源编码中的字符具有十六进制字节值0x96。据我所知，这似乎不是可打印的 ISO-8859-1 字符。如 ISO-8859-1 的官方规范所示，这属于两个非打印区域之一。

Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.

Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:

require 'sinatra'
require 'open-uri'

get '/' do
  html = open("http://flybynight.com.br/agenda.php").read
  p [ html.encoding, html.valid_encoding? ]
  #=> [#<Encoding:ISO-8859-1>, true]

  str  = html[ /Preview.+?John Digweed/ ]
  p [ str, str.encoding, str.valid_encoding? ]
  #=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]

  utf8 = str.encode('UTF-8')
  p [ utf8, utf8.encoding, utf8.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  require 'nokogiri'
  doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
  p doc.encoding
  #=> "ISO-8859-1"

  dig = doc.xpath("//div[@class='tit_inter_cnz']")[1]
  p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  <<-ENDHTML
  <!DOCTYPE html>
  <html><head><title>Dig it!</title></head><body>
  <p>Here it comes...</p>
  <p>#{dig.text}</p>
  </body></html>
  ENDHTML
end

This properly serves up content with Content-Type:text/html;charset=utf-8 on my computer. Chrome does not show my this character in the browser, however.

Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96. This appears to be this Unicode character which seem to be an odd dash.

I would chalk this up to bad source data, and simply throw:

#encoding: UTF-8

at the top of your Ruby source file(s), and then put in:

f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character

Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.

Either make it an HTML entity, or a different Unicode dash. Either way a gsub is called for.

Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.

回复收藏 0 原文