屏幕抓取

发布于 2024-12-05 18:10:09 字数 1491 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

泅人 2024-12-12 18:10:09

我发现如果页面具有相当静态的布局,则 HTML Agility Pack非常适合获取我需要的所有数据。我还没有遇到过它无法处理的单个页面,并且没有得到我想要的结果。

如果您发现页面是用大量动态代码呈现的,那么您将需要做的不仅仅是下载页面,您还必须实际执行它。

为此,您需要类似 WebKit .NET 库(围绕 < a href="http://www.webkit.org/" rel="nofollow noreferrer">WebKit 渲染引擎),它允许您下载页面并实际执行 Javascript。然后,一旦确定文档已完全呈现,您就可以获得页面详细信息。

I find that if the page has a pretty static layout, then the HTML Agility Pack is perfect for getting all the data I need. I've not run into a single page that it hasn't been able to handle and not get me the results I wanted.

If you find that the page is rendered with a great deal of dynamic code, you're going to have to do more than just download the page, you'll have to actually execute it.

To do that, you'll need something like the WebKit .NET library (a .NET wrapper around the WebKit rendering engine) which will allow you to download the page and actually execute Javascript as well. Then, once you are sure the document has been rendered completely, you can get the page details.

可爱暴击 2024-12-12 18:10:09

对于我使用的基础知识:

我尚未启用 JavaScript,但我计划使用 Google 的 V8 JavaScript 引擎。这要求您调用非托管代码,但 V8 的性能证明了这一点。

For the very basics I use:

I don't have JavaScript enabled yet, but I'm planning on using Google's V8 JavaScript Engine. This requires that you make calls to unmanaged code, but the performance of V8 justifies it.

少钕鈤記 2024-12-12 18:10:09

对于自动化屏幕抓取,Selenium 是一个很好的工具。有两件事 - 1)安装 Selenium IDE(仅适用于 Firefox)。 2) 安装 Selenium RC 服务器

启动 Selenium IDE 后,转到您尝试自动化的站点并开始记录您在该站点上执行的事件。将其视为在浏览器中录制宏。然后,您将获得所需语言的代码输出。

正如您所知,Browsermob 使用 Selenium 进行负载测试和在浏览器上自动执行任务。

我上传了一份我前段时间做的ppt。这应该可以节省您大量的时间 - http://www.4shared.com/get /tlwT3qb_/SeleniumInstructions.html

在上面的链接中选择常规下载选项。

我花了很多时间来弄清楚它,所以认为这可能会节省别人的时间。

For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.

风流物 2024-12-12 18:10:09

“当今”最好的工具不仅可以为您提供所需的功能(Javascript、自动化),而且还可以让您不必自己运行……当然,我指的是使用云服务。这种方法将节省您的网络带宽,更快地提供结果(因为它比您最终可能开发的自定义解决方案可以更好地扩展),最重要的是,可以为您省去 IT 和维护方面的麻烦。

关于这一点,请查看名为 Bobik (http://usebobik.com)。我在 http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/

希望这有帮助。

The best tool "these days" is one that not only gives you the desired features (Javascript, automation), but also the one that you don't have to run yourself... I am, of course, alluding to using a cloud service. This approach will save you network bandwidth, will deliver results faster (because it can scale better than a custom solution you'll likely end up developing) and, most importantly, save you the IT and maintenance headache.

On that note, check out a scraping solution called Bobik (http://usebobik.com). I've written an article about it at http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/.

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文