Amazon SimpleDB 适用于来自数千个独立设备的大型时态数据集
我正在尝试确定 Amazon SimpleDB 是否适合我拥有的数据子集。
我部署了数千个记录数据的自主传感器设备。
每个传感器设备基本上每天每小时四次报告几个值,持续数月甚至数年。我需要保留所有这些数据以进行历史统计分析。一般是写一次,读多次。基于服务器的应用程序定期运行来查询数据以推断其他信息。
今天,SQL 中的数据行看起来像这样:
- (id, device_id, utc_timestamp, value1, value2)
我们现有的 MySQL 解决方案不会进一步扩展,具有数千万行。我们查询诸如“告诉我昨天所有 value1 的总和”或“显示过去 8 小时内 value2 的平均值”之类的内容。我们在 SQL 中执行此操作,但可以愉快地更改为在代码中执行此操作。 SimpleDBs 的“最终一致性”似乎很适合我们的目的。
我正在尽我所能地阅读并准备开始尝试我们的 AWS 账户,但是我不清楚各种 SimpleDB 概念(项目、域、属性等)如何与我们的域相关。
SimpleDB 是否是一个合适的工具?通用方法是什么?
PS:我们主要使用Python,但是从高层次考虑这一点时这应该不重要。目前我知道 boto 库。
编辑:
继续搜索此问题的解决方案时,我确实遇到了 Stack Overflow 问题存储时间序列数据的最佳开源解决方案是什么? 这很有用。
I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
- (id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
几个月后我就跟进了这个问题……
去年夏天我确实有机会直接与亚马逊讨论这个问题,并最终获得了最终成为 DynamoDB 的测试版程序,但无法谈论它。
我会推荐它用于这种情况,您需要一个主键以及可能被描述为辅助索引/范围的内容 - 例如时间戳。这可以让您对搜索更有信心,即“向我显示周一到周五期间设备 X 的所有数据”。
由于各种原因,我们实际上还没有转向这一点,但仍然计划这样做。
http://aws.amazon.com/dynamodb/
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
我认为,只要您的查询非常简单,Amazon SimpleDb 以及 Microsoft Azure Tables 就是一个很好的解决方案。一旦你尝试做一些在关系数据库上绝对不是问题的事情,比如聚合,你就会开始遇到麻烦。因此,如果您要做一些繁重的报告工作,可能会变得混乱。
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
听起来您的问题可能最好由 循环数据库 (RRD) 处理。 RRD 以这样的方式存储时间变量数据,以便文件大小永远不会超出其初始设置。它对于生成图形和时间序列信息非常酷且非常有用。
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
我同意 Oliver Weichhold 的观点,即基于云的数据库解决方案将处理您所描述的用例。您可以将数据分布在多个 SimpleDB 域(例如分区)中,并以大多数查询可以从单个域执行而无需遍历整个数据库的方式存储数据。定义分区策略将是成功转向基于云的数据库的关键。 此处讨论了数据集分区
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here