使用OpenUri,如何获取重定向页面的内容?

发布于 2024-08-31 00:11:14 字数 424 浏览 2 评论 0原文

我想从此页面获取数据:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793

但是该页面转发到:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1

因此,当我从 OpenUri 使用 open 来尝试获取数据时,它会抛出 RuntimeError错误说 HTTP 重定向循环:

我不太确定在重定向并抛出该错误后如何获取该数据。

I want to get data from this page:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793

But that page forwards to:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1

So, when I use open, from OpenUri, to try and fetch the data, it throws a RuntimeError error saying HTTP redirection loop:

I'm not really sure how to get that data after it redirects and throws that error.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

平安喜乐 2024-09-07 00:11:14

您需要一个像 Mechanize 这样的工具。从它的描述来看:

Mechanize 库用于
与网站的自动化交互。
机械化自动存储和
发送cookie,遵循重定向,可以
点击链接并提交表格。形式
可以填充并提交字段。
Mechanicalize 还跟踪
您访问过的网站
历史。

这正是您所需要的。 ,

sudo gem install mechanize

那么

require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"

page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri

你就准备好摇滚了。

You need a tool like Mechanize. From it's description:

The Mechanize library is used for
automating interaction with websites.
Mechanize automatically stores and
sends cookies, follows redirects, can
follow links, and submit forms. Form
fields can be populated and submitted.
Mechanize also keeps track of the
sites that you have visited as a
history.

which is exactly what you need. So,

sudo gem install mechanize

then

require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"

page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri

and you're ready to rock 'n' roll.

彩虹直至黑白 2024-09-07 00:11:14

该网站似乎正在使用会话执行一些重定向逻辑。如果您不发回他们在第一个请求中发送的会话 cookie,您将最终陷入重定向循环。恕我直言,这对他们来说是一个蹩脚的实施。

然而,我试图将cookies传回给他们,但我没有让它发挥作用,所以我不能完全确定这就是这里发生的一切。

The site seems to be doing some of the redirection logic with sessions. If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. IMHO it's a crappy implementation on their part.

However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here.

走野 2024-09-07 00:11:14

虽然机械化是一个很棒的工具,但我更喜欢“烹饪”自己的东西。

如果您认真对待解析,可以看一下这段代码。它每天都会在国际范围内爬行数千个网站,据我研究和调整,没有更稳定的方法可以让您以后根据自己的需求进行高度定制。

require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"

def crawl(url_address)
self.errors = Array.new
begin
  begin
    url_address = URI.parse(url_address)
  rescue URI::InvalidURIError
    url_address = URI.decode(url_address)
    url_address = URI.encode(url_address)
    url_address = URI.parse(url_address)
  end
  url_address.normalize!
  stream = ""
  timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
  if stream.size > 0
    url_crawled = URI.parse(stream.base_uri.to_s)
  else
    self.errors << "Server said status 200 OK but document file is zero bytes."
    return
  end
rescue Exception => exception
  self.errors << exception
  return
end
# extract information before html parsing
self.url_posted       = url_address.to_s
self.url_parsed       = url_crawled.to_s
self.url_host         = url_crawled.host
self.status           = stream.status
self.content_type     = stream.content_type
self.content_encoding = stream.content_encoding
self.charset          = stream.charset
if    stream.content_encoding.include?('gzip')
  document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
  document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
  document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
  document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
  item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end

While mechanize is a wonderful tool I prefer to "cook" my own thing.

If you are serious about parsing you can take a look at this code. It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs.

require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"

def crawl(url_address)
self.errors = Array.new
begin
  begin
    url_address = URI.parse(url_address)
  rescue URI::InvalidURIError
    url_address = URI.decode(url_address)
    url_address = URI.encode(url_address)
    url_address = URI.parse(url_address)
  end
  url_address.normalize!
  stream = ""
  timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
  if stream.size > 0
    url_crawled = URI.parse(stream.base_uri.to_s)
  else
    self.errors << "Server said status 200 OK but document file is zero bytes."
    return
  end
rescue Exception => exception
  self.errors << exception
  return
end
# extract information before html parsing
self.url_posted       = url_address.to_s
self.url_parsed       = url_crawled.to_s
self.url_host         = url_crawled.host
self.status           = stream.status
self.content_type     = stream.content_type
self.content_encoding = stream.content_encoding
self.charset          = stream.charset
if    stream.content_encoding.include?('gzip')
  document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
  document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
  document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
  document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
  item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文