Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 10 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
我发现如果页面具有相当静态的布局,则 HTML Agility Pack非常适合获取我需要的所有数据。我还没有遇到过它无法处理的单个页面,并且没有得到我想要的结果。
如果您发现页面是用大量动态代码呈现的,那么您将需要做的不仅仅是下载页面,您还必须实际执行它。
为此,您需要类似 WebKit .NET 库(围绕 < a href="http://www.webkit.org/" rel="nofollow noreferrer">WebKit 渲染引擎),它允许您下载页面并实际执行 Javascript。然后,一旦确定文档已完全呈现,您就可以获得页面详细信息。
I find that if the page has a pretty static layout, then the HTML Agility Pack is perfect for getting all the data I need. I've not run into a single page that it hasn't been able to handle and not get me the results I wanted.
If you find that the page is rendered with a great deal of dynamic code, you're going to have to do more than just download the page, you'll have to actually execute it.
To do that, you'll need something like the WebKit .NET library (a .NET wrapper around the WebKit rendering engine) which will allow you to download the page and actually execute Javascript as well. Then, once you are sure the document has been rendered completely, you can get the page details.
对于我使用的基础知识:
HttpWeb*
更快(初步测试表明它快了大约 25%)。我尚未启用 JavaScript,但我计划使用 Google 的 V8 JavaScript 引擎。这要求您调用非托管代码,但 V8 的性能证明了这一点。
For the very basics I use:
HttpWeb*
(preliminary tests showed that it was about 25% faster).I don't have JavaScript enabled yet, but I'm planning on using Google's V8 JavaScript Engine. This requires that you make calls to unmanaged code, but the performance of V8 justifies it.
对于自动化屏幕抓取,Selenium 是一个很好的工具。有两件事 - 1)安装 Selenium IDE(仅适用于 Firefox)。 2) 安装 Selenium RC 服务器
启动 Selenium IDE 后,转到您尝试自动化的站点并开始记录您在该站点上执行的事件。将其视为在浏览器中录制宏。然后,您将获得所需语言的代码输出。
正如您所知,Browsermob 使用 Selenium 进行负载测试和在浏览器上自动执行任务。
我上传了一份我前段时间做的ppt。这应该可以节省您大量的时间 - http://www.4shared.com/get /tlwT3qb_/SeleniumInstructions.html
在上面的链接中选择常规下载选项。
我花了很多时间来弄清楚它,所以认为这可能会节省别人的时间。
For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.
“当今”最好的工具不仅可以为您提供所需的功能(Javascript、自动化),而且还可以让您不必自己运行……当然,我指的是使用云服务。这种方法将节省您的网络带宽,更快地提供结果(因为它比您最终可能开发的自定义解决方案可以更好地扩展),最重要的是,可以为您省去 IT 和维护方面的麻烦。
关于这一点,请查看名为 Bobik (http://usebobik.com)。我在 http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/。
希望这有帮助。
The best tool "these days" is one that not only gives you the desired features (Javascript, automation), but also the one that you don't have to run yourself... I am, of course, alluding to using a cloud service. This approach will save you network bandwidth, will deliver results faster (because it can scale better than a custom solution you'll likely end up developing) and, most importantly, save you the IT and maintenance headache.
On that note, check out a scraping solution called Bobik (http://usebobik.com). I've written an article about it at http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/.
Hope this helps.