使用 Hpricot (Ruby on Rails) 抓取隐藏的 HTML(当visible = false 时)
我遇到了一个问题,不幸的是我似乎无法超越,不幸的是,我也只是 Ruby on Rails 的新生儿,不幸的是,因此
我试图抓取网页的问题数量如下:
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx
我想抓取下一页的地址、电话和 URL,在本例中,
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx
我一直在尝试我能想到的任何方法,但似乎没有任何效果,因为它们被设置为不可见等。
该地址位于 h3 标记内,但它似乎不可废弃。我也一直在从以下网址http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/
研究ScRUBYt,但我似乎真的找不到如何在这种情况下应用它们的头或尾。
我真的很感激任何指示,因为这是我真正需要克服的障碍,以便继续完成我的任务。预先感谢您的任何帮助。
I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions
I am attempting to scrape a webpage such as the following:
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx
I would like to scrape The Addresses, Phones and URL of the next Page which in this case is
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx
I've been trying just about anything i could think of but nothing seems to work due to them being set to invisible or so.
The Address is within an h3 tag but it does not appear to be scrap-able. I've been also looking into ScRUBYt from the following url http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/
, but i really cant seem to find heads or tails of how to apply them in this case.
I would really appreciate any pointers as this is an obstacle which i really need to surpass in order to move forward on my assignment. Thanks in advance for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在您给出的特定示例中,元素不是隐藏的,而是在页面加载后通过 ajax 加载的。所以基本上你需要的是一个可以运行 javascript(网络浏览器?)的 http 客户端来查看这些地址和其他内容。
如果您想真正实现流程自动化并抓取通过ajax或javascript获取的数据,您可以尝试 硒。尽管它不是为此目的而开发的,但它可以满足您的需求。
In the particular example you have given, the elements are not hidden, but loaded via ajax after the page load. So basically what you need is a http client which can run javascript (web browser?) to see those address and other contents.
If you want to really automate the process and scrape the data which is got through ajax or javascript, you can try selenium. Even though it is not developed for that purpose, it serves your needs.
我没有回答你的具体问题,但我想我应该指出 Ryan Bates 的 Railscast 剧集在屏幕上用 ruby 进行抓取: http://railscasts.com/episodes/173-screen-scraping-with-scrapi
他使用一个名为 scrAPI 的库而不是 ScRUBYt,因为他不能让 ScRUBYt 工作。 scrAPI 似乎更容易一些?
我希望这对您有所帮助,祝您作业顺利! :)
-约翰
I don't have an answer to your specific question, but I thought I'd point to Ryan Bates' Railscast episode on screen scraping with ruby: http://railscasts.com/episodes/173-screen-scraping-with-scrapi
He uses a library called scrAPI instead of ScRUBYt, since he couldn't get ScRUBYt working. scrAPI seems to be a bit easier maybe?
I hope this helps somewhat, good luck with your assignment! :)
-John
Google 论坛上发布了一个很好的脚本。它似乎提取地址等。您可能想查看脚本
page.txt
的代码。There is a good script posted at the google group. It seems to extract address, etc. You may want to look at the code for the script
page.txt
.