网络爬虫的最佳数据库设计
许多数据库系统都适合与网络爬虫一起使用,但是有没有专门为网络爬虫开发的数据库系统(在.net中)。
我的经验表明,网络爬虫有许多部分和服务,每个部分都需要一些特定的功能。例如,要缓存网页,我们需要诸如 sql server 的 FILESTREAM 之类的东西。或者检查数据库中是否已存在 URL,最佳选择是 memcached。
事实上我有两个问题
1)什么是与网络爬虫一起使用的最佳数据库系统?
2)有没有涵盖所有功能的数据库系统!!!!!!!!!?
many db systems are suitable to work with a web crawler, but is there any db system specifically developed for web crawlers (in .net).
my experience says that a web crawler has many parts and services and each part need some specific features. for example to cache web pages we need some thing like FILESTREAM of sql server. or to check if a URL already exists in db the best choice is memcached.
in fact I have 2 questions
1) what are best db systems to work with a web crawler?
2) is there any db system that cover all features!!!!!!!!!?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
仅供参考,据我所知,Google 没有使用任何合理的数据库引擎,他们宁愿拥有专有的文件系统 GFS 和自己的数据持久性抽象。
谁告诉你 memcached 是最好的选择?考虑到,如果数据量很大,你会耗尽内存,当然,除非你有一个大数据中心并且能够在内存中跨机器共享数据......
我认为这不是最好的选择,最好的可能是谷歌,他们的大部分工作都是在内部完成的。
如果你可以处理高水平(但仍然不是最好的),我认为所有引擎,如 SQL Server、Oracle、mySQL 和许多其他引擎都可以表现良好,这更多地取决于你如何使用它们以及如何构建你的解决方案。
FYI, to my knowledge Google is not using any rational database engine, they rather have a proprietary file system GFS and their own data persistence abstractions.
Who has told you that memcached is the best choice? consider that in case the amount of data is BIIIG you would run out of memory, except of course if you have a big data center and are able to share data across machines in memory...
I think is not about the best choice, the best is probably Google and they have done most of their things in house.
if you can handle being at high level (but still not the best), I think all engines like SQL Server, Oracle, mySQL and many others could perform well, it depends more on how you use them and how you architect your solution.
Google 使用面向列的数据库BIGTABLE 来存储其抓取工具结果以及谷歌文档,以及构建在 GFS(谷歌文件系统)之上的其他谷歌产品。他们的设计是迄今为止我所知道的最好的。
Apache HBase 在实现上与 Bigtable 类似。 HBase 构建在 HDFS(Hadoop 分布式文件系统)之上。
Google uses a column oriented database BIGTABLE to store its crawler results and also for google docs, other google products which is built on top of GFS (Google File System). Their design is by far the best I know.
Apache HBase is similar in implementaion to Bigtable. HBase is built on top of HDFS (Hadoop Distributed File System).