使用 mongodb 或 cassandra 的空间数据
我正在考虑处理大量数据的概念证明,例如 > 10 G,每秒至少需要 200+ 次写入,每秒大约 50+ 次读取空间相关数据。这也是一个不断增长的系统。目前,出于性能原因,我正在考虑将这些大容量数据移动到 NoSql 大表类型的数据库中。
我考虑并仔细研究了 MongoDB 和 cassandra。就我的阅读而言,
Mongodb: - 似乎有写入器锁定问题 - 如果不需要多个服务器,stackoverflow 中的一篇文章建议使用此数据库 - 索引保存在内存中。所以指数增长越大,据说表现就会恶化 - 优点是Mongodb直接支持空间数据&索引以及查找附近位置等功能, - 我看到这篇文章 Cassandra 或 MongoDB 对于我们基于位置的应用程序 建议 mongodb 作为 Cassandra 的最佳选择
:
- 似乎是相关数据库中最好的 - 似乎具有出色的写入和读取性能 - 本身不支持空间索引,但可以通过 geohashing 进行扩展
我实际上很喜欢 mongodb,因为它有良好的文档和对空间数据的直接支持。有没有人在这么大的系统中使用 mongodb 时有过不好的经历?我实际上看到很多关于 mongodb iostat 性能的帖子。
如果 mongodb 不适合,有人可以提供一些关于使用 cassandra 进行 geohashing 的指示吗?我看到了用于创建的链接 http://code.google.com/p/geospatialweb/哈希值。但还有如何查询等问题?
I am considering a Proof of concept for handling large volumes of data like > 10 G which requires atleast 200+ writes per second and about 50+ reads per second of spatial related data. This is a growing system as well. Currently I am considering moving this big volume data into a NoSql big table kind of db for performance reasons.
I have considered and taken some closer look at MongoDB and cassandra. As far as my reading goes,
Mongodb:
- seems to have a writer lock problem
- one of the posts in stackoverflow suggested this db if there is no need for multiple servers
- indexes kept on memory. So the bigger the index growth, the performance is said to deteriorate
- advantage is Mongodb has direct support for spatial data & indexing along with features like finding nearby locations etc.,
- I see this post Cassandra Or MongoDB For Our Location Based Application suggesting mongodb as the best choice
Cassandra:
- Seems to be the best of among the related dbs
- Seems to have great write as well as read performance
- Does not natively support spatial indexing but this can be extended via geohashing
My heart actually goes out for mongodb because of its good documentation and direct support for spatial data. Has any body had bad experience using mongodb for such big systems? I actually see lot of posts on mongodb iostat for performance.
If mongodb is not suited, can someone give some pointers on geohashing using cassandra? I saw the link http://code.google.com/p/geospatialweb/ for creating the hashes. But there are questions on how to query etc.?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我意识到这是一个较旧的问题,我知道它不能直接回答您的问题,但根据您的查询,Cassandra 可能不是最佳选择,并且让您的查询与 MongoDB 中的索引一起使用也可能会出现问题(根据我自己的经验)。 Mongo 在大量地理数据和查询方面比 Cassandra 稍有优势。
我建议还考虑研究 ElasticSearch,这取决于您的数据形状和您将进行的查询类型可能是最好的解决方案。不过,当您发布问题时,它的选择可能比今天要少。
I realize this is an older question and I know that it doesn't directly answer your question, but depending on your queries, Cassandra may not be the best option, And getting your queries to work with indexing in MongoDB can be problematic as well (in my own experience). Mongo has a slight edge over Cassandra for heavy geo data and queries imho.
I'd suggest also consider looking into ElasticSearch, which depending on your data shape and the types of queries you'll be making is probably the best solution. When you posted your question it was likely less of an option than today though.
尝试 Cassandra + Solr。
这可能有用:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/
问候,
古萨姆·库马尔
Try Cassandra + Solr.
This might be useful:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/
Regards,
Goutham Kumar
tl;博士
Elassandra Cassandra 和 ElasticSearch 的组合。
未来的一点更新。
我目前正在创建大数据实时系统的概念,还需要存储地理空间数据并进行大规模查询。最近几天,我做了很多研究,如何正确排列数据并能够支持地理空间索引和边界框等查询。
我读到的第一个内容是 PostgreSQL + Postgis,但最大的实例限制为最大 200k 写入/秒。
第二个是地理空间数据库 Tile38,它能够扩展查询,但不能扩展写入。唯一的方法是手动分割数据。
第三个是 MongoDB,因为在那里你可以找到支持我需要的地理空间功能的良好文档,但很难决定是否能够扩展写入。
所以最后一个数据库是 Cassandra。该数据库以水平写入扩展和故障接管而闻名。 Cassandra 的缺点是,查询数据的性能不佳,并且不支持开箱即用的地理空间。对于大规模查询数据,ElasticSearch 是一个很好的解决方案,正如 Tracker1 已经建议的那样。今天我发现了一个由 Cassandra 和 ElasticSearch 组成的新数据库,名为 Elassandra ,它允许大规模写入和读取近乎实时的大规模数据。到目前为止,对我来说这是最好的解决方案,只需最少的设置和维护工作。
tl;dr
Elassandra a combination out of Cassandra and ElasticSearch.
A little update from the future.
I'm currently on creating a concept for a Big Data Real-time system and also need to store geospatial data and do queries at scale. The last days I did a lot of research how to arrange the data properly and be able to support a geospatial index and queries like a bounding box.
The first I read about was PostgreSQL + Postgis but the biggest instance is limited to max 200k writes/sec.
The second was a geospatial database, Tile38, which is able to scale queries but not the writes. The only way with this would be to shard the data manually.
The third was MongoDB because there you can find a good documentation supporting the geospatial functionality I need, but it was hard to decide, if you are able to scale the writes.
So the last database was Cassandra. This database is well known for the horizontal write scaling and failure-takeover. The trade-off with Cassandra is, that querying the data has not good performance and does not support geo spatial out of the box. For querying the data at scale ElasticSearch is a good solution, as Tracker1 already suggested. Today I found a new database made up of Cassandra and ElasticSearch, called Elassandra which allows writes at scale and also reading data at scale in near-realtime. So far for me the best solution, with a minimum effort for setup and maintenance.
我们目前还使用 Cassandra 并寻找空间索引解决方案。我们使用 Lucene 来提供全文和属性搜索,并支持空间索引。也许您也想检查一下。
我们当前的实现看起来像是基于一个简单的树(基于网格)对信息进行分片,每个分片都是一个 Lucene 索引,一旦它增长超过一定大小,索引就会被 x 或 y 分割。由于这样的分片具有二进制表示形式(网格中的位置由两位组成,下一级接下来的 2 位,依此类推),因此搜索由位置发出,并将由任何带有位置/网格分辨率的分片帽前缀来回答。简单的系统到目前为止运行良好,但目前尚未有效使用。
We also use Cassandra at the moment and look for a spatial index solution. We go with Lucene in order to provide full text and attributed search and along with it comes support for spartial indexing. Maybe you want to check this, too.
Our current implementation looks like sharding the information based on a simple tree (grid based) and each shard is a Lucene index and once it grows over a certain size the index is split by either x or y. And since such a shard has a binary representation (position in the grid consists of two bits, next level next 2 bits and so on), a search is issued by the position and will be answered by any shard hat prefix the position / grid resolution. Simple system works good so far but is not in use productively at the moment.