我需要一个强大的网络爬虫库
我需要一个强大的网络抓取库来从网络中挖掘内容。可以付费也可以免费,两者对我来说都很好。请建议我一个库或更好的方法来挖掘数据并将其存储在我喜欢的数据库中。我已经搜索过,但没有找到任何好的解决方案。我需要专家的好建议。请帮帮我。
I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
抓取确实很容易,您只需解析正在下载的内容并获取所有关联的链接即可。
最重要的部分是处理 HTML 的部分。由于大多数浏览器不需要最干净(或符合标准)的 HTML 来呈现,因此您需要一个 HTML 解析器,该解析器能够理解并不总是格式良好的 HTML。
为此,我建议您使用 HTML Agility Pack。它在处理格式不正确的 HTML 方面做得非常好,并为您提供了一个简单的界面来使用 XPath 查询来获取结果文档中的节点。
除此之外,您只需选择一个数据存储来保存已处理的数据(您可以为此使用任何数据库技术)以及一种从 Web 下载内容的方法,.NET 为其提供了两种高级机制,WebClient 和 HttpWebRequest/HttpWebResponse 类。
Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.
The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.
I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.
Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.
对于简单的网站(= 仅限纯 html),Mechanize 工作得非常好且快。对于使用 Javascript、AJAX 甚至 Flash 的网站,您需要一个真正的浏览器解决方案,例如 iMacros。
For simple websites ( = plain html only), Mechanize works really well and fast. For sites that use Javascript, AJAX or even Flash, you need a real browser solution such as iMacros.
我的建议:
您可以四处寻找 HTML 解析器,然后使用它来解析站点中的信息。 (就像这里)。然后您需要做的就是将该数据保存到您认为合适的数据库中。
我已经制作了自己的抓取工具几次,它非常简单,并且允许您自定义保存的数据。
数据挖掘工具
如果您确实只是想获得一个工具来执行此操作,那么您应该没有问题找到一些。
My Advice:
You could look around for a HTML Parser and then use it to parse out information from sites. (Like here). Then all you would need to do is save that data into your database however you see fit.
I've made my own scraper a few times, it's pretty easy and allow you to customize the data that is saved.
Data Mining Tools
If you really just want to get a tool to do this then you should have no problem finding some.