如何获取“下一页” 与 Scrubyt 的链接
我正在尝试使用 Scrubyt 从此页面获取详细信息 http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events。 我已设法从列表中获取标题和详细 URL,但无法使用 next_page 让抓取工具转到下一页。 我认为这是因为我没有为下一页链接使用正确的模式。 我尝试了字符串“Next Page”,也尝试了 XPath。 还有其他想法吗?
代码如下:
require 'rubygems'
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events'
event do
title 'The Coast of Mayo'
#url "href", :type => :attribute
link_url
end
next_page "Next Page", :limit => 2
end
nuffield_data.to_xml.write($stdout,1)
I'm trying to use Scrubyt to get the details from this page http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events. I've managed to get the titles and detail URLs from the list, but I can't use next_page to get the scraper to go to the next page. I assume that's cause I'm not using the correct pattern for the next page link. I tried the string "Next Page", and I've also tried the XPath. Any other ideas?
The code is below:
require 'rubygems'
require 'scrubyt'
nuffield_data = Scrubyt::Extractor.define do
fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?section=events'
event do
title 'The Coast of Mayo'
#url "href", :type => :attribute
link_url
end
next_page "Next Page", :limit => 2
end
nuffield_data.to_xml.write($stdout,1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请尝试使用稍微不同的 URL:
scruyt 似乎在 URL 末尾的“?section=events”查询上存在问题。
当它查找下一页时,它会尝试返回以下 URL:
http://www.nuffieldtheatre.co.uk/cn/events/?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
而不是:
http://www.nuffieldtheatre.co.uk/cn/events /event_listings.php?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
删除 URL 末尾的查询字符串似乎可以解决此问题 - 您可能希望将其作为错误归档。
Try this with a slightly different URL:
scrubyt seems to be having issues with "?section=events" query on the end of the URL.
When it looks for the next page it is trying to return this URL:
http://www.nuffieldtheatre.co.uk/cn/events/?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
instead of:
http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php?pageNum_rsSearch=1&totalRows_rsSearch=39§ion=events
Removing the query string on the end of the URL seems to fix this - you might want to file this as a bug.