网络爬虫使用哪种数据库,如何在分布式环境中使用 MySQL?
我应该为网络爬虫使用哪种数据库引擎:InnoDB 还是 MYiSAM?我有两台 PC,每台都有 1TB 硬盘。如果一个已满,我希望它自动保存到另一台电脑,但读取应该转到正确的电脑;我该怎么做?
Which database engine should I use for a web crawler, InnoDB or MYiSAM? I have two PC's, each with 1TB hard drives. If one fills up, I'd like for it to save to the other PC automatically, but reads should go to the correct PC; how do I do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
至于你问题的第一部分,这取决于你的具体实施。如果您要使用一个受网络带宽限制的爬虫,那么 MYiSAM 可能会更快。如果您使用多个爬虫,那么 InnoDB 将为您提供诸如事务之类的优势,这可能会有所帮助。
AFAIK MySQL 不支持您建议的硬件配置。如果您需要大存储空间,您可能需要考虑 MySQL Cluster。
As for the first part of your question, it rather depends on you precise implementation. If you are going to have a single crawler limited by network bandwidth, then MYiSAM can be quicker. If you are using multiple crawlers then InnoDB will give you advantages such as transactions which may help.
AFAIK MySQL doesn't support the hardware configuration you are suggesting. If you need large storage you may wan tot look at MySQL Cluster.
MyISAM 是首选,因为您将只进行写操作,并且爬虫(甚至并行运行)将被配置为(我想)来爬取不同的域/url。因此您无需处理访问冲突。
当向Mysql写入大量数据,尤其是文本时,请避免事务、索引等,因为这会大大减慢MySQL的速度。
MyISAM is the first choice, because you will have write only operations and crawlers -- even run in parallel -- will be configured -- I suppose -- to crawl different domains/urls. So you do not need to take care of access conflicts.
When writing a lot of data, especially text!, to Mysql avoid transactions, indexes, etc., because it will slow down MySQL drastically.