动态插值大型数据集
插值大型数据集
我有一个包含大约 50 万条记录的大型数据集,代表给定一天中美元/英镑之间的汇率。
我有一个应用程序希望能够绘制这些数据或其子集的图表。出于显而易见的原因,我不想在图表上绘制 50 万个点。
我需要的是一个较小的数据集(100 个点左右),它(尽可能)准确地表示给定的数据。有谁知道可以通过哪些有趣且高效的方式来获取这些数据?
干杯,卡尔
Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如何制作枚举/迭代器包装器。我对 Java 不熟悉,但它看起来可能类似于:
How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
有多种统计方法可以将大型数据集缩减为更小、更易于可视化的数据集。从您的问题中不清楚您想要什么汇总统计数据。我只是假设您想了解汇率如何随时间变化,但也许您对汇率高于某个值的频率或我没有考虑的其他统计数据感兴趣。
总结一段时间内的趋势
以下是使用 lowess 方法的示例在 R 中(来自 散点图平滑):
参数 f 控制回归与数据的拟合程度。对此要深思熟虑,因为您希望能够准确地拟合您的数据而不会过度拟合。您可以绘制汇率与时间的关系,而不是速度和距离。
访问平滑结果也很简单。操作方法如下:
您返回的数据对象包含名为 x 和 y 的条目,它们对应于传递到 lowess 函数的 x 和 y 值。在这种情况下,x和y代表速度和距离。
There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.
一种想法是使用 DBMS 通过适当的查询来压缩数据。类似于让它取特定范围的中位数的伪查询:
其中 truncate_to_hour 适合您的 DBMS。或者使用类似的方法,使用某种函数将时间分割成独特的块(例如舍入到最近的 5 分钟间隔),或者使用另一个数学函数来聚合适合代替中位数的组。考虑到时间分段过程的复杂性以及 DBMS 的优化方式,使用分段时间值对临时表运行查询可能会更有效。
One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.
如果您想编写自己的记录集,一种明显的解决方案是将记录集分成固定数量的点数块,其值将是平均值(平均值、中位数……选择一个)。这可能具有最快的优势,并显示总体趋势。
但它缺乏价格变动的戏剧性。更好的解决方案可能涉及寻找拐点,然后使用滑动窗口在拐点中进行选择。这样做的优点是可以更好地显示当天的实际事件,但速度会较慢。
If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.
像 RRDTool 这样的东西会自动完成你所需要的 - 教程 应该可以帮助您入门,drraw 将绘制数据图表。
我在工作中使用它来处理错误图之类的事情,我不需要 6 个月时间段的 1 分钟分辨率,只需要最近几个小时的分辨率。之后,我会持续几天的 1 小时决议,然后几个月的 1 天决议。
Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.
最简单的方法是简单地计算对应于像素的每个时间间隔的平均值。
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
这不显示波动。我建议还计算每个时间间隔的标准偏差并绘制它(本质上使每个像素高于一个像素)。我找不到示例,但我知道 Gnuplot 可以做到这一点(但不是用 Java 编写的)。
The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).