如何将 Net::HTTP 响应转换为 Ruby 1.9.1 中的某种编码？

发布于 2024-07-29 03:13:54 字数 670 浏览 3 评论 0原文

我有一个 Sinatra 应用程序 (http://analyzethis.espace-technologies.com)，它执行以下

检索HTML 页面（通过 net/http）
从 response.body 创建 Nokogiri 文档
提取一些信息并将其发送回响应中。响应应该是 UTF-8 编码的

所以我在尝试阅读使用 windows-1256 编码的网站（如 www.filfan.com 或 www.masrawy.com）时遇到了这个问题。

问题是编码转换的结果不正确，但没有抛出错误。

net/http response.body.encoding 给出 ASCII-8BIT，无法转换为 UTF-8

如果我执行 Nokogiri::HTML(response.body) 并使用 css 选择器从页面获取某些内容 - 说内容例如，标题标签的 - 我得到一个字符串，当我调用 string.encoding 时返回 WINDOWS-1256。我使用 string.encode("utf-8") 并使用它发送响应，但响应再次不正确。

关于我的方法有什么问题有什么建议或想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清秋悲枫 2024-08-05 03:13:54

因为 Net::HTTP 无法正确处理编码。参见http://bugs.ruby-lang.org/issues/2567

即可解析response['content-type'] 包含字符集，而不是解析整个 response.body。

然后使用 force_encoding() 设置正确的编码。

response.body.force_encoding("UTF-8")（如果网站以 UTF-8 提供服务）。

回复收藏 0 原文

听，心雨的声音 2024-08-05 03:13:54

我发现以下代码现在对我有用

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end

I found the following code working for me now

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end

回复收藏 0 原文

~没有更多了~