用于爬行和数据挖掘网站的最佳开源库或应用程序
我想知道用于爬行和分析网站的最佳开源库是什么。 一个例子是爬虫房地产机构,我想从多个网站获取信息并将它们聚合到我自己的网站中。 为此,我需要抓取网站并提取房产广告。
I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我使用优秀的 python 包进行了大量的抓取 urllib2, < a href="http://wwwsearch.sourceforge.net/mechanize/" rel="nofollow noreferrer">mechanize 和 BeautifulSoup.
我还建议查看 lxml 和 Scrapy,尽管我目前不使用它们(仍计划尝试 scrapy)。
Perl 语言还具有强大的抓取功能。
I do a lot of scraping, using excellent python packages urllib2, mechanize and BeautifulSoup.
I also suggest to look at lxml and Scrapy, though I don't use them currently (still planning to try out scrapy).
Perl language also has great facilities for scraping.
PHP/cURL 是一个非常强大的组合,特别是如果您想直接在网页中使用结果......
PHP/cURL is a very powerful combination, especially if you want to use the results directly in a web page...
与莫罗佐夫先生一样,我也进行了大量的搜索工作,主要是工作地点。 我从来没有必要求助于机械化,如果这有帮助的话。 Beautifulsoup 与 urllib2 结合起来一直就足够了。
我用过lxml,非常棒。 不过,我相信几个月前我尝试使用它时,它可能还无法在 Google 应用程序中使用,如果您需要的话。
我要感谢 Morozov 先生提到 Scrapy。 没听说过。
In common with Mr Morozov I do quite a bit of scraping too, principally of job sites. I've never had to resort to mechanize, if that helps any. Beautifulsoup in combination with urllib2 have always been sufficient.
I have used lxml, which is great. However, I believe it may not have been available with Google apps a few months ago when I tried it, if you need that.
My thanks are due to Mr Morozov for mentioning Scrapy. Hadn't heard of it.
除了 Scrapy 之外,你还应该看看 Parselets
Besides Scrapy, you should also look at Parselets