具有正则表达式匹配的多级网络蜘蛛?
我需要一个网络蜘蛛来查找带有正则表达式的某些链接。
蜘蛛将访问网站列表,查找与正则表达式模式列表匹配的链接,访问这些匹配的链接并重复直到配置的深度级别。
我正要在 php 上编写这个代码,但我不太擅长 php 上的线程,我需要这个应用程序的线程。
那么,您认为最好的解决方案是什么?
也许我可以配置一些现有的应用程序/代码来创建这个蜘蛛。
I need a web spider to find certain links with regex.
The spider would visit a list of websites, find links that match a regex pattern list, visit those matched links and repeat until the configured depth level.
I was about to code this on php but im not very good with threads on php and I need threads for this application.
So, what do you think is the best solution?
Maybe theres some existing app/code I could configure to create this spider.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有几个爬虫可以免费使用:
Nutch 可能是最好的,如果您使用它,我建议您利用它们的 OPIC 功能,而不是自己指定爬网深度。 OPIC允许爬虫以智能的方式确定接下来应该爬行哪个站点,而不需要人为的深度限制。
There are several crawlers out there which you can use for free:
Nutch is probably the best and I would recommend that if you use it, you take advantage of their OPIC functionality instead of specifying the crawl depth yourself. OPIC allows the crawler to determine which site should be crawled next in an intelligent way, without the need of artificial depth limits.