绘制大量数据图表

发布于 2024-10-14 17:52:55 字数 1429 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

℡寂寞咖啡 2024-10-21 17:52:55

我以前也需要这个,但做起来并不容易。由于这个要求,我最终编写了自己的图形组件。最终结果更好,因为我添加了我们需要的所有功能。

基本上,您需要获取数据范围(最小和最大可能/需要的索引值),将其细分为段(假设 100 个段),然后通过某种算法确定每个段的值(平均值、中值、 ETC。)。然后根据这些汇总的 100 个元素进行绘图。这比尝试绘制数百万个点要快得多:-)。

所以我说的和你说的很相似。您提到您不想绘制每个 X 元素,因为元素之间可能存在很长的时间(x 轴上的索引值)。我的意思是,对于每个数据细分,确定最佳值,并将其作为数据点。我的方法是基于索引值的,因此在 0 秒和 10 秒索引值之间没有数据的示例中,我仍然会将数据点放在那里,它们之间只会具有相同的值。
重点是在绘制数据之前先总结数据。仔细考虑你的算法来做到这一点,有很多方法可以做到这一点,选择适合你的应用程序的一种。
您可能不需要编写自己的图形组件,而只需编写数据汇总算法。

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end because I put in all the features we needed.

Basically, you need to get the range of data (min and max possible/needed index values), subdivide it into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

So what I am saying is similar to what you are saying. You mention you do not want to plot every X element because there might be a long stretch of time (index values on the x-axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value-based, so in your example of no data between the 0 sec and 10-sec index values I would still put data points there, they would just have the same values among themselves.
The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application.
You might get away with not writing your own graph component and just write the data summarization algorithm.

难得心□动 2024-10-21 17:52:55

我将分两步解决这个问题:

  1. 预处理数据
  2. 显示数据

第 1 步
该文件应预处理为二进制固定格式文件。
在格式中添加索引,它将是 int,double,double。
请参阅本文以了解速度比较:

http://www.codeproject.com/KB/files/ fastbinaryfileinput.aspx

然后您可以将文件分解为时间间隔,例如
每小时或每天一次,这将为您提供一种简单的表达方式
访问不同的时间间隔。你也可以只保留
一个大文件,并有一个索引文件,告诉您在哪里可以找到特定时间,

1,1/27/2011 8:30:00
13456,1/27/2011 9:30:00

通过使用这些方法之一,您将能够快速找到任何数据块
按时间,通过索引或文件名,或按条目数,由于固定字节
格式。

步骤2
显示数据的方式
1. 只需按索引显示每条记录。
2. 标准化数据并创建包含开盘价、最高价、最低价、收盘价的聚合数据条。
一个。按时间
b.按记录数
c.通过值之间的差异

有关聚合非均匀数据集的更多可能方法,您可能需要查看
用于汇总金融市场贸易数据的不同方法。当然,
为了提高实时渲染速度,您可能需要使用这些数据创建文件
聚合。

I would approach this in two steps:

  1. Pre-processing the data
  2. Displaying the data

Step 1
The file should be preprocessed into a binary fixed format file.
Adding an index to the format, it would be int,double,double.
See this article for speed comparisons:

http://www.codeproject.com/KB/files/fastbinaryfileinput.aspx

You can then either break up the file into time intervals, say
one per hour or day, which will give you an easy way to express
accessing different time intervals. You could also just keep
one big file and have an index file which tells you where to find specific times,

1,1/27/2011 8:30:00
13456,1/27/2011 9:30:00

By using one of these methods you will be able to quickly find any block of data
by either time, via a index or file name, or by number of entries, due to the fixed byte
format.

Step 2
Ways to show data
1. Just display each record by index.
2. Normalize data and create aggregate data bars with open, high, low ,close values.
a. By Time
b. By record count
c. By Diff between value

For more possible ways to aggregate non-uniform data sets, you may want to look at
different methods used to aggregate trade data in the financial markets. Of course,
for speed in realtime rendering you would want to create files with this data already
aggregated.

逆光飞翔i 2024-10-21 17:52:55

1-我们如何剔除不存在的数据
统一准时到达吗?

注意 - 我假设您的加载器数据文件是文本格式。)

在一个类似的项目中,我必须读取大小超过 5GB 的数据文件。我解析它的唯一方法是将其读入 RDBMS 表中。我们选择 MySQL 是因为它使将文本文件导入数据表变得非常简单。 (有趣的是——我使用的是 32 位 Windows 机器,无法打开文本文件进行查看,但 MySQL 读取它没有问题。)另一个好处是 MySQL 正在尖叫,尖叫速度很快

一旦数据进入数据库,我们就可以轻松地对其进行排序,并将大量数据量化为单个释义查询(使用内置的 SQL 汇总函数,如 SUM)。 MySQL 甚至可以将其查询结果读回文本文件以用作加载器数据。

长话短说,消耗这么多数据需要使用可以汇总数据的工具。 MySQL 符合这个要求(双关语……它是免费的)。

1- How do we decimate data that is not
arriving uniformly in time?

(Note - I'm assuming your loader datafile is in text format.)

On a similar project, I had to read datafiles that were more than 5GB in size. The only way I could parse it out was by reading it into an RDBMS table. We chose MySQL because it makes importing text files into datatables drop-dead simple. (An interesting aside -- I was on a 32-bit Windows machine and couldn't open the text file for viewing, but MySQL read it no problem.) The other perk was MySQL is screaming, screaming fast.

Once the data was in the database, we could easily sort it and quantify large amounts of data into singular paraphrased queries (using built-in SQL summary functions like SUM). MySQL could even read its query results back out to a text file for use as loader data.

Long story short, consuming that much data mandates the use of a tool that can summarize the data. MySQL fits the bill (pun intended...it's free).

动听の歌 2024-10-21 17:52:55

我发现执行此操作的一个相对简单的替代方法是执行以下操作:

  1. 迭代小点分组中的数据(例如一次 3 到 5 个点 - 组越大,算法运行速度越快,但准确度越低)聚合将是)。
  2. 计算最小值和小团体的最大值。
  3. 删除该组中最小值或最大值的所有点(即,您只保留每组中的 2 个点,并忽略其余的点)。
  4. 从头到尾不断循环数据(重复此过程),删除点,直到聚合数据集的点数量足够少,可以在不阻塞 PC 的情况下绘制图表。

我过去曾使用此算法将约 1000 万个点的数据集降低到约 5K 点的数量级,而图形没有任何明显的可见失真。

这里的想法是,在抛出点的同时,您保留了峰值和谷值,因此最终图表中看到的“信号”不会“平均下来”(通常,如果平均,您会看到峰值和峰值)山谷变得不那么突出)。

另一个优点是,您总是在最终图表上看到“真实”数据点(它缺少一堆点,但这些点实际上存在于原始数据集中,因此,如果您将鼠标悬停在某些内容上,您可以显示实际的 x 和 y 值,因为它们是真实的,而不是平均值)。

最后,这也有助于解决 x 轴间距不一致的问题(同样,您将拥有真实的点,而不是平均 X 轴位置)。

我不确定这种方法在处理像您这样的数百个数百万个数据点时效果如何,但可能值得一试。

A relatively easy alternative I've found to do this is to do the following:

  1. Iterate through the data in small point groupings (say 3 to 5 points at a time - the larger the group, the faster the algorithm will work but the less accurate the aggregation will be).
  2. Compute the min & max of the small group.
  3. Remove all points that are not the min or max from that group (i.e. you only keep 2 points from each group and omit the rest).
  4. Keep looping through the data (repeating this process) from start to end removing points until the aggregated data set has a sufficiently small number of points where it can be charted without choking the PC.

I've used this algorithm in the past to take datasets of ~10 million points down to the order of ~5K points without any obvious visible distortion to the graph.

The idea here is that, while throwing out points, you're preserving the peaks and valleys so the "signal" viewed in the final graph isn't "averaged down" (normally, if averaging, you'll see the peaks and the valleys become less prominent).

The other advantage is that you're always seeing "real" datapoints on the final graph (it's missing a bunch of points, but the points that are there were actually in the original dataset so, if you mouse over something, you can show the actual x & y values because they're real, not averaged).

Lastly, this also helps with the problem of not having consistent x-axis spacing (again, you'll have real points instead of averaging X-Axis positions).

I'm not sure how well this approach would work w/ 100s of millions of datapoints like you have, but it might be worth a try.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文