对 SQL 存储的数据进行二次采样以绘制绘图
假设您有一个每 30 秒将(时间戳、股票价格)记录到 SQL 数据库的程序,并且您想要生成不同时间尺度上的股票价格图。如果您绘制 1 小时范围内的测量结果,则可以使用该时间段内采集的全部 120 个样本。但是,如果您想绘制 1 年范围内的价格,您显然不想从数据库中提取超过 100 万个样本。最好从数据库中提取一些具有代表性的样本子集。
这让我想起了计算机图形学中的细节级别技术——当您远离 3D 模型时,可以使用模型的保真度较低的版本。
是否有通用技术来表示数据库中的详细级别信息,或者快速查询均匀分布的数据子集(例如,给我 2009 年 1 月以来的 100 个均匀分布的样本)?
到目前为止我提出的解决方案是在数据库表中包含 level_of_detail 列。如果 level_of_detail=0,则该行保存单个瞬时样本。如果 level_of_detail=n,则该行包含最后 (sample_interval*(2^n)) 秒数据的平均值,并且该级别有 1/(2^n) 行。该表在 (level_of_detail, timestamp) 上有一个索引,当您想要生成绘图时,您可以根据所需的样本数量计算适当的 level_of_detail 值,并使用该约束进行查询。缺点是:
- 对于 N 个样本,表需要存储 2*N 行
- 客户端必须知道指定适当的 level_of_detail 约束
- 当样本添加到表中时,某些进程需要负责构建平均行
Suppose you have a program that logs (timestamp, stock_price) to an SQL database every 30 seconds, and you want to generate plots of the stock price over various timescales. If you plot measurements over a 1-hour range, it's OK to use all 120 samples taken during that time. However, if you want to plot price over a 1-year range, you obviously don't want to pull over 1 million samples out of the database. It would be better to pull some representative subset of the samples out of the database.
This reminds me of the Level of Detail technique in computer graphics -- as you move farther from a 3d model, a lower-fidelity version of the model can be used.
Are there common techniques for representing Level of Detail information in a database, or for quickly querying an evenly spaced subset of data (e.g. give me 100 evenly spaced samples from January 2009)?
The solution I've come up with so far is to include a level_of_detail column in the database table. If level_of_detail=0, the row holds a single instantaneous sample. If level_of_detail=n, the row contains an average of the last (sample_interval*(2^n)) seconds of data, and there are 1/(2^n) as many rows at this level. The table has an index on (level_of_detail, timestamp), and when you want to generate a plot, you calculate an appropriate level_of_detail value based on the number of samples you want and query with that constraint. Disadvantages are:
- For N samples, the table needs to store 2*N rows
- The client must know to specify an appropriate level_of_detail constraint
- Some process needs to be responsible for building the averaged rows as samples are added to the table
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于 SQL Server,您可以使用
ntile
。这会对数据集进行排序,然后将其分为 N 个不同的组,第一组返回 1,最后一组返回 N。这将恰好返回 100 行。如果您有兴趣尝试查询,这里是测试数据的副本:
For SQL Server, you could use
ntile
. This orders the dataset, and then splits it in N different groups, returning 1 for the first group and N for the last group.This would return exactly 100 rows. Here's a copy of the test data if you're interested in trying the query: