绘制大量数据的图表

发布于 2024-07-08 21:35:21 字数 625 浏览 7 评论 0原文

在我开发的产品中,有一个迭代循环,可以进行几百到几百万次迭代。 每次迭代计算一组统计变量(双精度),变量数量最多可达 1000 个(通常为 15-50 个)。

作为循环的一部分,我们绘制迭代过程中变量的变化,因此 X 轴是迭代,y 轴是变量(按颜色编码):

http://sawtoothsoftware.com/download/temp/walt/graph.jpg

目前数据存储在包含以下内容的文件中:< br> 一个 4 字节整数,用于表示变量,
一个 4 字节整数,用于迭代,
和一个 8 字节双精度值。

y 轴的总比例随时间变化,并且需要调整图形大小以适应当前比例(这可以在图中看到)。

以大约 5 秒的间隔,读取数据并将其绘制在位图上,然后显示给用户。 我们尝试做一些优化以避免重新绘制整个东西,但如果迭代次数或变量数量变大,我们最终会得到一个巨大的文件,需要超过 5 秒的时间来绘制。

我正在寻找有关如何在可能的情况下更有效、更快速地处理这么多数据的想法。

In a product I work on, there is an iteration loop which can have anywhere between a few hundred to a few million iterations. Each iteration computes a set of statistic variables (double precision), and the number of variables can be up to 1000 (typically 15-50).

As part of the loop, we graph the change in the variables over the iterations, so the X axis is iterations, and the y axis are the variables (coded by color):

http://sawtoothsoftware.com/download/temp/walt/graph.jpg

Currently the data are stored in a file with containing:
a 4 byte integer for which variable,
a 4 byte integer for which iteration,
and a 8 byte double for the value.

The total scale of the y axis changes over time, and it is desired that the graph resize to accomodate the current scale (this can be seen in the picture).

At about 5 second intervals, the data are read and plotted on a bitmap which is then displayed to the user. We try to do a few optimizations to avoid repainting the whole thing, but if the number of iterations or the number of variables gets big, we end up with an enormous file which takes longer than 5 seconds to draw.

I'm looking for ideas on how to handle this much data more effectively and quickly if possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

恋你朝朝暮暮 2024-07-15 21:35:21

用 SQL 术语来说,您应该对结果进行分组和聚合。 如果不滚动屏幕,您不可能在图表上显示所有 10,000 个数据点。 一种方法是您可以按时间范围(秒、分钟等)进行分组并查询 AVG()MAX()MIN() 将数据点缩小到较小的规模。

MySQL 示例,按秒分组:

select time_collected, AVG(value)
from Table
group by UNIX_TIMESTAMP(time_collected)

还可以考虑将聚合值组合起来并在蜡烛图中进行可视化。

In SQL terms, you should group and aggregate the results. You can't possibly show all 10,000 data points on the graph without scrolling way off the screen. One way is you could group by a time scale (seconds, minutes, etc.) and query the AVG(), MAX(), or MIN() to reduce the data points to a smaller scale.

MySQL example, group by seconds:

select time_collected, AVG(value)
from Table
group by UNIX_TIMESTAMP(time_collected)

Also consider combining aggregate values and visualizing in a candle stick chart.

烟火散人牵绊 2024-07-15 21:35:21

您应该问自己每次迭代显示数据的价值有多大,以及用户真正关心的这些数据是什么。 我认为您在这里需要做的主要事情就是减少向用户显示的数据量。

例如,如果用户只关心趋势,那么您可以轻松地只在每次迭代(而不是每次迭代)时评估这些函数。 在上图中,您可以通过每 100 次迭代仅绘制曲线上的值来获得同样信息丰富的绘图,这会将数据集的大小(以及绘图算法的速度)减少 100 倍。显然,如果您碰巧需要更多细节,您可以对此进行调整。

为了避免重绘时必须重新计算数据点,只需保留已在内存中绘制的一小组点,而不是重新计算或重新加载所有数据。 您可以避免以这种方式访问​​磁盘,并且您将不需要做那么多工作来再次渲染所有这些点。

如果您担心由于采样误差而丢失异常值之类的问题,您可以做的一个简单的事情是基于滑动窗口而不是原始数据中的单个样本来计算样本点集。 您可能会保留最大值、最小值、平均值、中值,并可能计算向用户显示的数据的误差线。

如果您需要真正激进,人们已经想出了大量奇特的方法来减少和显示时间序列数据。 有关更多信息,您可以查看维基百科文章,或查看R,其中已经内置了很多此类方法。

最后,这个 stackoverflow 问题似乎也相关。

You should ask yourself how valuable it is to display data for every iteration, and what about this data the user really cares about. I think the main thing you need to do here is just reduce the amount of data you display to the user.

For example, if the user only care about the trend, then you could easily get away with evaluating these functions only every so many iterations (instead of every iteration). On the graph above, you could probably get just as informative a plot by drawing only the value on the curve every 100 iterations, which would reduce the size of your data set (and the speed of your drawing algorithm) by a factor of 100. Obviously, you could adjust this if you happen to need more detail.

To avoid having to recompute data points when you redraw, just keep around the small set of points you've already drawn in memory instead of recomputing or reloading all the data. You can avoid going to disk this way, and you won't be doing nearly as much work getting all those points rendered again.

If you're concerned about things like missing outliers due to sampling error, a simple thing you can do would be to compute the set of sample points based on sliding windows instead of single samples from the original data. You might keep around max, min, mean, median, and possibly compute error bars for the data you display to the user.

If you need to get really aggressive, people have come up with tons of fancy methods for reducing and displaying time series data. For further information, you could check out the wikipedia article, or look into toolkits like R, which have a lot of these methods built in already.

Finally, this stackoverflow question seems relevant, too.

梅窗月明清似水 2024-07-15 21:35:21

我从图表中看到,您正在几百个像素上绘制 10,000 次迭代,因此只需在图表中使用 100 个信息点中的一个,而忽略其余部分。 对于用户来说它看起来是一样的

I see by the graph that you're plotting 10,000 iterations on a few hundred pixels so just use one in 100 information points for the graph and ignore the rest. It will look the same to users

悲喜皆因你 2024-07-15 21:35:21

为什么不生成位图(或像 XPM 这样的像素图)? 每列(或行)对应迭代,相同颜色的高度(行的宽度)对应变量值。 XPM 格式更简单,因为它是文本(一个字符代表像素)并且是跨平台的。

Why you don't produce a bitmap (or pixmap like XPM)? Each column (or row) correspond to iteration, and height of same colors (width for rows) correspond to the variable value. XPM format is simpler since it is textual (one character for pixel) and cross-platform.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文