如何让 Mechanize 自动将正文转换为 UTF8?

发布于 2024-12-27 01:45:51 字数 549 浏览 0 评论 0原文

我找到了一些使用 post_connect_hookpre_connect_hook 的解决方案,但似乎它们不起作用。我正在使用最新的 Mechanize 版本 (2.1)。新版本中没有 [:response] 字段,我不知道新版本中从哪里获取它们。

是是否可以让 Mechanize 返回 UTF8 编码版本,而不必使用 iconv 手动转换它?

I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.

Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

单身狗的梦 2025-01-03 01:45:51

自 Mechanize 2.0 起,pre_connect_hooks()post_connect_hooks() 的参数已更改。

请参阅 Mechanize 文档:

pre_connect_hooks()

检索响应之前要调用的挂钩列表。使用代理、URI、响应和响应正文来调用挂钩。

post_connect_hooks()

检索响应后要调用的挂钩列表。使用代理、URI、响应和响应正文来调用挂钩。

现在您无法更改内部响应主体值,因为参数不是数组。因此,下一个最佳方法是用您自己的解析器替换内部解析器:

class MyParser
  def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
    # insert your conversion code here. For example:
    # thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
    Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
  end
end

agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...

Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.

See the Mechanize documentation:

pre_connect_hooks()

A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

post_connect_hooks()

A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:

class MyParser
  def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
    # insert your conversion code here. For example:
    # thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
    Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
  end
end

agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
野心澎湃 2025-01-03 01:45:51

我找到了一个效果很好的解决方案:

class HtmlParser
  def self.parse(body, url, encoding)
    body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
    Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
  end
end

Mechanize.new.tap do |web|
  web.html_parser = HtmlParser
end

尚未发现问题。

I found a solution that works pretty well:

class HtmlParser
  def self.parse(body, url, encoding)
    body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
    Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
  end
end

Mechanize.new.tap do |web|
  web.html_parser = HtmlParser
end

No issues were found yet.

浊酒尽余欢 2025-01-03 01:45:51

在您的脚本中,只需输入:page.encoding = 'utf-8'

但是,根据您的情况,您可能需要输入相反的内容(Mechanize 正在使用的网站的编码) 。为此,请打开 Firefox,打开您希望 Mechanize 使用的网站,选择菜单栏中的“工具”,然后打开“页面信息”。从那里确定页面的编码内容。

使用该信息,您可以输入页面的编码内容(例如 page.encoding = 'windows-1252')。

In your script, just enter: page.encoding = 'utf-8'

However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.

Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').

无法回应 2025-01-03 01:45:51

像这样的事情怎么样:

class Mechanize
    alias_method :original_get, :get
    def get *args
        doc = original_get *args
        doc.encoding = 'utf-8'
        doc
    end
end

How about something like this:

class Mechanize
    alias_method :original_get, :get
    def get *args
        doc = original_get *args
        doc.encoding = 'utf-8'
        doc
    end
end
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文