机器人网络质量
我正在寻找一个好的开源机器人来确定谷歌索引通常需要的一些质量。
例如,
- 找到重复的标题
- 无效链接(jspider 会这样做,我认为更多的会这样做)
- 完全相同的页面,但不同的网址
- 等,其中等等于谷歌质量要求。
I am looking for a good open source bot to determine some quality, often required for google indexing.
For example
- find duplicate titles
- invalid links ( jspider do this, and I think a lot more will do this)
- exactly the same page, but different urls
- etc, where etc equals google quality reqs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的需求非常具体,因此不太可能有一个开源产品能够完全满足您的需求。
然而,有许多用于构建网络爬虫的开源框架。您使用哪一种取决于您的语言偏好。
例如:
一般来说,这些框架都会提供用于爬取和抓取的类根据您给出的规则抓取网站的页面,然后由您通过挂钩自己的代码来提取所需的数据。
Your requirements are very specific so it's very unlikely there is an open source product that does exactly what you want.
There are, however, many open source frameworks for building web crawlers. Which one you use depends on your language preference.
For example:
Generally, these frameworks will provide classes for crawling and scraping pages of a site based upon the rules you give, but then it's up to you to extract the data you need by hooking in your own code.
Google 网站站长工具是一项基于网络的服务(而不是一个按需机器人),它不会做你所要求的所有事情 - 但它确实做了其中一些事情以及许多你没有要求的事情,而且 - 来自 Google - 它无疑符合你的要求奇怪的“等等,其中等等于谷歌质量要求。”比其他任何地方都好。
Google Webmaster Tools is a web-based service (rather than an on-demand bot), and it doesn't do everything you've asked for - but it does do some of it and a lot of things you haven't asked for, and - being from Google - it no doubt matches your odd "etc, where etc equals google quality reqs." better than anywhere else will.