Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 10 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(1)
自己写——这并不难。如果您不熟悉编程或可以选择编程语言:使用 Python 库支持来进行出色的抓取。
至于如何解决这个问题,有两种流行的技术:使用正则表达式,效果最好用于临时屏幕抓取。如果您的目标网站结构良好 - 请阅读:不是临时的 - 然后使用一个允许您使用 DOM。
导航和提取
这是编写spider的两个阶段。您的蜘蛛需要导航网站以访问不同的页面,并且需要提取感兴趣的信息。这两个阶段都可以由 DOM 或 RE 的
p.s. 驱动,因为你的名字表示 .NET ——我应该提到我已经用 C-Sharp 编写了 scrapers ——这是轻而易举的事。
Write your own -- it isn't hard. if you aren't familiar with programming or have a choice for programming languages: use Python the library support for doing scraping great.
As for how to attack the problem their are two popular techniques: use regular expressions, they work best for ad-hoc screen scraping. If your target web-sites are well structured -- read: not ad-hoc -- then use a framework that allows you to work with the DOM.
Navigation and Extraction
These are the two phases of writing a spider. Your spider needs to navigate a website to visit different pages, and it needs to extract information of interest. Both these phases can be driven by either the DOM or RE's
p.s., Since your name indicates .NET -- I should mention that I have written scrapers in C-Sharp -- it's a doddle.