适用于大容量静态分析数据的最佳存储引擎
我正在构建一个需要存储和查询大量带时间戳的数据的应用程序。具体来说,这是分析类型数据,其中特定“集”可以包含通过共享时间戳连接在一起的最多 100 个单独的数据点。该数据是静态的,因为它一旦存储就永远不会改变,但在从数据库中删除之前可能会被查询多次(我们目前只存储 12 周的数据)。
目前,我们将这些数据存储在 MySQL 中,其中包含多个表,每个表约 1 亿行(我们每天添加约 300 万行)。当仅按时间戳存储和查询数据时,这很好,但在执行带有排序的 SELECT 时,尤其是在尝试从表中删除旧的过时数据时,它会变得乏味。使用例如 WHERE timestamp < 的 DELETE 查询$12weeksago 通常需要几个小时才能完成。
我们还希望将一些数据保留到 12 周之后,这会在针对另一列的 DELETE 中添加另一个 WHERE 子句。这会进一步减慢速度。
考虑到大量的读/写和查询需求,您会根据此用例推荐什么存储引擎?
I'm building an application that requires storage and querying of very large amount of timestamped data. Specifically, this is analytics type data whereby a particular "set" can contain up to 100 separate data points tied together via a shared timestamp. This data is static, in that it will never change once stored but may be queried multiple times before it is expunged from the database (we currently only store 12 weeks of data).
At the moment, we're storing this data within MySQL with several tables of around 100 million rows each (we add roughly 3 million rows daily). This is fine when storing and querying the data only by timestamp, but it becomes tedious when performing SELECTs with sorts and especially when trying to remove old stale data from the tables. A DELETE query using for example WHERE timestamp < $12weeksago often takes several hours to complete.
We would also like to keep some of the data beyond the 12 week period, which adds another WHERE clause to the DELETE against another column. This slows things up further.
What storage engine would you recommend based on this use case, bearing in mind the high volume of reads/writes and querying needs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这完全取决于您所做的选择类型。如果 NoSQL 数据库可以做到这一点,那么这可能是您最好的猜测。
但是,我认为通过对数据库进行分区可能可以大大提高性能。由于它已经是基于时间的,您可以轻松地每天/每周/每月创建一个分区,这样您只需查询与当前查询相关的表。并且删除旧数据是简单的drop table而不是慢删除。
It all depends on the kind of selects that you do. If a NoSQL database can do it than that might be your best guess.
However, I think you can probably improve your performance a lot by partitioning the database. Since it's already time based you can easily create a partition per day/week/month, that way you only have to query the tables that are relevant for your current query. And deleting old data is a simple drop table instead of a slow delete.
尝试 Redis 或 MongoDB。它们都是为此类用例而设计的。
Try Redis or MongoDB. They were both designed for this sort of use case.