Ruby On Rails/Merb 作为数十亿条记录应用程序的前端
我正在寻找一个用 Ruby on Rails 或 Merb 编写的应用程序的后端解决方案,以处理具有数十亿条记录的数据。 我有一种感觉,我应该采用分布式模型,目前我查看了
我认为HBase解决方案存在问题——ruby支持不是很强,而且Couchdb还没有达到1.0版本。
您有什么建议可以用于处理如此大量的数据吗?
数据有时需要相当快的导入速度,一次导入 30-40Mb,但导入会分块进行。 所以大约 95% 的时间数据都是只读的。
I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
根据您的实际数据使用情况,MySQL 或 Postgres 应该能够在正确的硬件上处理数十亿条记录。 如果您的请求量特别大,这两个数据库都可以跨多个服务器进行复制(并且读复制非常容易设置(与多个主/写复制相比)。
使用带有 Rails 或 Merb 的 RDBMS 的一大优势您是否可以获得访问这些类型数据库的所有优秀工具支持?
我的建议是在其中几个系统中实际分析您的数据并从那里获取数据。
Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.
人们使用了许多不同的解决方案。 根据我的经验,这实际上更多地取决于与该数据相关的使用模式,而不是每个表的绝对行数。
例如,“每秒发生多少次插入/更新”。 诸如此类的问题将影响您选择哪种后端数据库解决方案的决定。
以Google为例:实际上并没有满足他们需求的存储/搜索解决方案,因此他们基于Map/Reduce模型创建了自己的解决方案。
There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.
关于 HBase 和其他类似性质的项目的警告(对 CouchDB 一无所知 - 我认为它根本不是真正的数据库,只是一个键值存储):
Hive 项目也构建在 Hadoop 之上,支持连接; Pig 也是如此(但它不是真正的 sql)。 第 1 点适用于两者。 它们适用于繁重的数据处理任务,而不是您可能使用 Rails 进行的处理类型。
如果您希望 Web 应用程序具有可扩展性,基本上唯一有效的策略是对数据进行分区并尽可能确保分区是隔离的(不需要彼此通信)。 这对于 Rails 来说有点棘手,因为它默认假设有一个中央数据库。 自从我大约一年半前查看这个问题以来,这方面可能已经有所改进。 如果可以对数据进行分区,则可以相当宽地水平扩展。 一台 MySQL 机器可以处理几百万行(PostgreSQL 可能可以扩展到更多的行,但工作速度可能会慢一些)。
另一种有效的策略是设置主从机,其中所有写入均由主机完成,读取在从机(也可能是主机)之间共享。 显然这必须相当小心地完成! 假设读/写比率较高,则可以很好地扩展。
如果您的组织财力雄厚,请查看 Vertica、AsterData 和 Greenplum 提供的服务。
A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.
后端将取决于数据以及数据的访问方式。
但对于 ORM,我最有可能使用 DataMapper 并编写自定义 DataObjects 适配器来访问您选择的任何后端。
The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.
我不确定 CouchDB 不在 1.0 版本与它有什么关系。 我建议用它做一些测试(只需生成十亿个随机文档),看看它是否能坚持下去。 我想说它会的,尽管没有具体的版本号。
当涉及到数据分区/分片时,CouchDB 将为您提供很多帮助,看起来它可能适合您的项目——特别是如果您的数据格式将来可能会发生变化(添加或删除字段),因为 CouchDB 数据库没有模式。
CouchDB 还针对读取量大的应用程序进行了大量优化,根据我的使用经验,这是它真正的亮点。
I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.