从许多网页获取数据的最佳方式(线程/事件驱动)
我不想以任何理由进行圣战,只是为了获得建议并继续开发。
我需要写一些像爬虫这样的东西, 它必须能够从 URL 列表中获取一些数据并解析它。
我将使用Ruby(mechanize + nokogiri)或Python(mechanize + beautifulsoup)。
但我需要进行并行数据处理以提高效率。现在这对我来说是个大问题。
据我所知,Mechanize(对于两种语言)都不是线程安全的,而且正如许多程序员所说,使用线程也不是“良好实践”。从另一方面来说,我不知道事件驱动的编程技术以及如何在我的案例中使用它。
任何帮助表示赞赏。谢谢。
I don't want to make holywar on any reason, just to get advice and continue development.
I need to write something like crawler,
it must be able to fetch some data from list of URLs and parse it.
I am going to use Ruby (mechanize + nokogiri) or Python (mechanize + beautifulsoup).
But I need to do parallel data handling for efficiency. That's a big problem for me now.
Mechanize (for both languages) is not thread safe as far as I know, also using threads is not "good practice" as many programmers says. From other side I have no idea about event-driven programming technics and how it can be used in my case.
Any help is appreciated. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我一直在使用 Scrapy 并取得了巨大成功。它非常简单,并且允许同时使用多个爬虫。输出到 json、xml 等或直接输出到数据库。确实值得一看。
I've been using Scrapy with great success. It's quite straightforward and allows for multiple crawlers at once. Outputs to json, xml, etc or directly to a database. It's worth a look for sure.