open-uri 从以 iso-8859 编码的网页返回 ASCII-8BIT

发布于 2024-11-02 08:16:08 字数 551 浏览 5 评论 0 原文

我正在使用 open-uri 来读取声称以 iso-8859-1 编码的网页。当我读取页面内容时,open-uri 返回一个以 ASCII-8BIT 编码的字符串。

open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
 => ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>] 

我猜测这是因为网页有字节(或字符)\x92,它不是有效的 iso-8859 字符。 http://en.wikipedia.org/wiki/ISO/IEC_8859-1

我需要将网页存储为 utf-8 编码文件。关于如何处理编码不正确的网页的任何想法。我可以捕获异常并尝试猜测正确的编码,但这似乎很麻烦且容易出错。

I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.

open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
 => ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>] 

I am guessing this is because the webpage has the byte (or character) \x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.

I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

流殇 2024-11-09 08:16:09
  • ASCII-8BIT 是 BINARY 的别名
  • open-uri 做了一件有趣的事情:如果文件小于 10kb(或类似的东西),它会返回一个 String,如果它更大,则返回一个 StringIO。如果您试图处理编码问题,这可能会令人困惑。

如果文件不大,我建议手动将它们加载到字符串中:

require 'uri'
require 'net/http'
require 'net/https'

uri = URI.parse url_to_file

http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
  http.use_ssl = true
  # possibly useful if you see ssl errors
  # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body

然后您可以使用 https:// /rubygems.org/gems/ensure-encoding gem

require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)

我对 ensure-encoding 非常满意...我们在生产中使用它 http://data.brighterplanet.com

请注意,您也可以说 :invalid_characters =>; :ignore 而不是 :transcode

另外,如果您以某种方式知道编码,您可以传递 :external_encoding =>; 'ISO-8859-1' 而不是 :sniff

  • ASCII-8BIT is an alias for BINARY
  • open-uri does a funny thing: if the file is less than 10kb (or something like that), it returns a String and if it's bigger then it returns a StringIO. That can be confusing if you're trying to deal with encoding issues.

If the files aren't huge, I would recommend manually loading them into strings:

require 'uri'
require 'net/http'
require 'net/https'

uri = URI.parse url_to_file

http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
  http.use_ssl = true
  # possibly useful if you see ssl errors
  # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body

Then you can use the https://rubygems.org/gems/ensure-encoding gem

require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)

I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com

Note that you can also say :invalid_characters => :ignore instead of :transcode.

Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文