哪个数据库适合这项工作?
我正在开发一项功能,并且可以使用关于我应该使用哪个数据库来解决这个问题的意见。
我们有一个使用 MySQL 的 Rails 应用程序。 MySQL 没有任何问题,并且运行得很好。但对于新功能,我们正在决定是否保留 MySQL。为了简化问题,我们假设有一个 User
和 Message
模型。用户可以创建消息。该消息将根据其他用户与发帖者的关联传递给其他用户。
显然,存在基于友谊的关联,但还有更多基于用户个人资料的关联。我计划将有关海报的一些元数据与消息一起存储。这样我就不必每次查询消息时都提取元数据。
因此,消息可能如下所示:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
当我查询消息时,我需要能够基于零个或多个元数据属性进行查询。此调用需要快速且经常发生。
由于元数据属性的数量以及查询中可以包含任何数量的事实,因此在此处创建 SQL 索引似乎不是一个好主意。
就我个人而言,我有使用 MySQL 和 MongoDB 的经验。我已经开始研究 Cassandra、HBase、Riak 和 CouchDB。我可以向那些可能做过研究的人寻求帮助,以确定哪个数据库最适合我的任务。
是的,消息表可以轻松增长到数百万行。
I am working on a feature and could use opinions on which database I should use to solve this problem.
We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User
and Message
model. A user can create messages. The message is delivered to other users based on their association with the poster.
Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.
Therefore, a message might look like this:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.
Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.
Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.
And yes, the messages table can easily grow into millions or rows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是一个非常开放式的问题,所以我们所能做的就是根据经验提供建议。首先要考虑的是,决定使用您以前没有使用过的东西而不是使用您熟悉的 MySQL 是否是一个好主意。当你有机会时不使用闪亮的新东西是很无聊的,但相信我,当你把自己画在角落里时,这很糟糕,因为你认为新玩具会做盒子上所说的一切。没有任何事情像博客文章中所说的那样有效。
我主要有使用 MongoDB 的经验。这是一个糟糕的选择,除非你想花很多时间尝试不同的事情并意识到它们不起作用。一旦你扩大规模,你基本上就不能使用二级索引、更新和其他使 Mongo 成为一个非常好的工具的东西(其中大部分与它的全局写锁和磁盘上的数据库格式有关,它如果删除数据,并发性和碎片基本上很糟糕)。
我不同意 HBase 是不可能的,它没有二级索引,但是一旦超过一定的流量负载,您就无法使用它们。 Cassandra 也是如此(它比 HBase 更容易部署和使用)。基本上,无论您选择哪种解决方案,您都必须实现自己的索引。
您应该考虑的是,如果您需要一致性而不是可用性,反之亦然(例如,如果消息丢失或延迟有多糟糕,而用户无法发布或阅读消息有多糟糕),或者如果您要更新数据(例如,Riak 中的数据是不透明的 blob,要更改它,您需要读取它并将其写回,在 Cassandra、HBase 和 MongoDB 中,您可以添加和删除属性,而无需先读取对象)。易用性也是一个重要因素,从程序员的角度来看,Mongo 确实很容易使用,而 HBase 很糟糕,但只要花一些时间制作自己的库来封装那些讨厌的东西,那就值得了。
最后,不要听我的,尝试一下,看看它们的表现和感觉如何。确保尽可能努力地加载它,并确保测试你将要做的一切。我犯了一个错误,没有测试删除 MongoDB 中的大量数据时会发生什么,并为此付出了高昂的代价。
This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.
I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).
I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.
What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.
Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.
我建议您查看有关为什么数据库对于消息传递来说很糟糕的演示文稿主要针对为什么不应该使用 MySQL 等数据库进行消息传递的事实。
我认为在这种情况下,CouchDB 的 changes feed 可能会非常方便,尽管您可能还必须基于查询消息元数据创建一些更复杂的视图。如果速度很重要,请尝试查看 redis ,它非常快并且附带 pub/sub 功能。 MongoDB 及其临时查询支持也可能是此用例的一个不错的解决方案。
I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.
I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.
我认为您在将元数据与每条消息一起存储方面是正确的!牺牲存储空间以加快检索时间可能是正确的选择。请注意,如果您需要更改用户的元数据并将其传播到所有消息,事情可能会变得复杂。您应该考虑这种情况发生的频率,您是否真的需要更新所有消息记录,以及基于此是否值得为了减少查询而付出代价(可能是值得的,但这取决于您的系统的具体情况)。
我同意 @Andrej_L 的观点,即 Hbase 不是解决此问题的正确解决方案。出于同样的原因,卡桑德拉也陷入了困境。
CouchDB可以解决您的问题,但是您必须为要查询的任何元数据定义视图(物化索引)。如果这里不使用 MySQL 的全部目的是避免对所有内容建立索引,那么 Couch 可能也不是正确的解决方案。
Riak 将是一个更好的选择,因为它使用 Map-Reduce 查询数据。这使您可以构建任何您喜欢的查询,而无需像沙发上那样预先索引所有数据。对于 Riak 来说,数百万行不是问题 - 不用担心。如果需要,它还可以通过简单地添加更多节点来很好地扩展(并且它也可以自我平衡,所以这实际上不是问题)。
所以根据我自己的经验,我推荐Riak。然而,与您不同的是,我没有直接使用 MongoDB 的经验,因此您必须自己对 Riak 进行判断(或者也许这里的其他人可以回答这个问题)。
I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).
I agree with @Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.
CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.
Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).
So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).
根据我的经验,Hbase 对于您的应用程序来说并不是一个好的解决方案。
因为:
默认不包含二级索引(你应该安装插件或类似的东西)。因此您可以仅通过主键进行有效搜索。我已经使用 hbase 和其他表实现了二级索引。所以你不能在在线应用程序中使用这个,因为为了获得结果你应该运行map/reduce作业,这将在百万数据上花费很多时间。
这个数据库的支持和调整非常困难。为了有效地工作,您将使用 HBAse 和 Hadoop,并且需要强大的计算机或几台。
当您需要对大量数据进行聚合报告时,Hbase 非常有用。好像不需要。
From my experience with Hbase is not good solution for your application.
Because:
Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.
It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.
Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.
听起来您需要加入,因此您基本上可以忘记 CouchDB,直到他们整理出已处理的多视图代码(实际上不确定它是否仍在处理)。
It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).
Riak 的查询速度与您的查询速度一样快,这取决于节点
Mongo 将允许您在任何字段上创建索引,即使这是一个数组
CouchDB 非常不同,它使用存储的 Map-Reduce 构建索引(但没有reduce) )他们称之为“视图”
RethinkDB 会让你拥有 SQL,但速度更快一点
TokuDB 也将
在速度上杀死所有 Redis,但它完全存储在 RAM 中,
单级关系可以在所有这些中完成,但每个都不同。
Riak can query as fast as you make it, depends on the nodes
Mongo will let you create an index on any field, even if that is an array
CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"
RethinkDB will let you have SQL but a little faster
TokuDB will too
Redis will kill all in speed, but it's entirely stored in RAM
single level relations can be done in all of them, but differently for each.