最快&绘制超过 200 万行平面文件数据图表的最灵活方法?

发布于 2024-10-19 00:41:34 字数 1231 浏览 3 评论 0原文

我正在平面文件中收集一些系统数据,其格式如下:

YYYY-MM-DD-HH24:MI:SS DD1 DD2 DD3 DD4

其中 DD1-DD4 是四项数据。 该文件的示例如下:

2011-02-01-13:29:53 16 8 7 68
2011-02-01-13:29:58 13 8 6 110
2011-02-01-13:30:03 26 25 1 109
2011-02-01-13:30:08 13 12 1 31
2011-02-01-13:30:14 192 170 22 34
2011-02-01-13:30:19 16 16 0 10
2011-02-01-13:30:24 137 61 76 9
2011-02-01-13:30:29 452 167 286 42
2011-02-01-13:30:34 471 177 295 11
2011-02-01-13:30:39 502 192 309 10

该文件超过 200 万行,每五秒就有一个数据点。

我需要将这些数据绘制成图表,以便能够从中得出意义。

我尝试过的

目前,我已经尝试使用 gnuplot 和 rrdtool 以及各种 unix 工具(awk、sed 等)。这两种方法都有效,但每次我想以不同的方式查看数据时,似乎都需要大量切割和重新切割数据。 我的直觉是 rrdtool 是正确的方法,但目前我正在努力以足够快的速度将数据输入其中,部分原因是我必须将时间戳转换为 Unix 纪元。我的理解是,如果我决定想要一个新的聚合粒度,我必须重建 rrd (这对于实时收集有意义,但对于像这样的回顾性加载则不然)。这些事情让我觉得我可能使用了错误的工具。

平面文件中的数据集合是固定的 - 例如,我无法将集合直接通过管道传输到 rrdtool 中。

我的问题

我想了解人们对制作图表的最佳方式的意见。我有这些要求:

  1. 它应该尽可能快地创建一个图形(不仅仅是渲染,而且还设置渲染)
  2. 它应该尽可能灵活 - 我需要处理图形才能工作 得出数据的最佳粒度(5秒可能太粒度)
  3. 它应该能够在必要时聚合(MAX/AVG/等)
  4. 它应该是可重复的并且新的数据文件进来时
  5. 理想情况下我希望能够覆盖DD1 与 DD2,或上周 DD1 与本周 DD1
  6. Unix 或 Windows,无关紧要。不过更喜欢 *nix :-)

有什么建议吗?

I'm collecting some system data in a flatfile, which has this format:

YYYY-MM-DD-HH24:MI:SS DD1 DD2 DD3 DD4

Where DD1-DD4 are four items of data.
An example of the file is this:

2011-02-01-13:29:53 16 8 7 68
2011-02-01-13:29:58 13 8 6 110
2011-02-01-13:30:03 26 25 1 109
2011-02-01-13:30:08 13 12 1 31
2011-02-01-13:30:14 192 170 22 34
2011-02-01-13:30:19 16 16 0 10
2011-02-01-13:30:24 137 61 76 9
2011-02-01-13:30:29 452 167 286 42
2011-02-01-13:30:34 471 177 295 11
2011-02-01-13:30:39 502 192 309 10

The file is over 2 million rows, with data points every five seconds.

I need to chart this data to be able to derive meaning from it.

What I've tried

At the moment I've tried gnuplot and rrdtool with a variety of unix tools (awk, sed, etc). Both of these work, but seem to require a lot of chopping and recutting the data each time I want to view it in a different way.
My gut feel is that rrdtool is the right way to go, but at the moment I'm struggling to get the data into it fast enough, partly because I have to convert my timestamp into Unix epoch. My understanding is also that if I decide I want a new granularity of aggregation I have to rebuild the rrd (which makes sense for realtime collection, but not retrospective loads like this). These things make me think maybe I'm using the wrong tool.

The collection of the data to flat file is fixed - for example I cannot pipe the collection directly into rrdtool.

My Question

I would like people's opinions on the best way to do charting. I have these requirements:

  1. It should be as fast as possible to create a graph (not just render, but set up to render too)
  2. It should be as flexible as possible - I need to muck around with the graphs to work
    out the best granularity for the data (5 seconds is probably too granular)
  3. It should be able to aggregate (MAX/AVG/etc) where necessary
  4. It should be repeatable and new data files as they come in
  5. Ideally I want to be able to overlay DD1 vs DD2, or DD1 last week with DD1 this week
  6. Unix or Windows, don't care. Prefer *nix though :-)

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

月下客 2024-10-26 00:41:34

这是一个非常好的问题。我很高兴看到一些 R 人员参与进来。我也认为 R 是完成这项工作的正确工具,尽管它是我的主要锤子,所以对我来说一切看起来都有点像钉子。

应对这一挑战需要一些 R 概念。在我看来,您需要以下内容(括号中的引用):

  1. 将数据导入到 R 中。 (R导入导出指南)
  2. 将数据获取到适当的时间序列结构中。 (XTS Vignette PDF)
  3. 一些绘图。 (图形的 Quick-R 介绍

以下是​​使用 2mm 点的示例代码。如果您注意到的话,我不会说明绘制所有 2mm 点。它很慢而且信息量不大。但这应该会给您一些入门的想法。如果您决定跳入 R 兔子洞,请随时回来提出更具体的问题!

require( xts )
require( lubridate )

## set up some example data
dataLength <- 2e6
startTime <- ymd_hms("2011-02-01-13-29-53")
fistFullOfSeconds <- 1:dataLength
date <- startTime + fistFullOfSeconds
DD1 <- rnorm( dataLength )
DD2 <- DD1 + rnorm(dataLength, 0, .1 )
DD3 <- rnorm( dataLength, 10, 2)
DD4 <- rnorm( dataLength )

myXts <- xts(matrix( c( DD1, DD2, DD3, DD4 ), ncol=4 ), date)

## now all the data are in the myXts object so let's do some
## summarizing and visualization

## grabbing just a single day from the data
## converted to data.frame to illustrate default data frame plotting
oneDay <- data.frame( myXts["2011-02-02"] ) 
plot( oneDay )

DD1和DD2的关系有点跳脱
在此处输入图像描述

boxplot( oneDay )

Boxplot 是统计图形的饼图。让你又爱又恨的剧情。当我们在这里时,不妨链接到此
在此处输入图像描述

## look at the max value of each variable every minute
par(mfrow=c(4,1)) ## partitions the graph window
ep <- endpoints(myXts,'minutes')
plot(period.apply(myXts[,1],INDEX=ep,FUN=max))
plot(period.apply(myXts[,2],INDEX=ep,FUN=max))
plot(period.apply(myXts[,3],INDEX=ep,FUN=max))
plot(period.apply(myXts[,4],INDEX=ep,FUN=max))

即使分辨率为一分钟,我也不确定这是否能提供信息。也许应该是子集。
在此处输入图像描述

This is a really good question. I'm glad to see some R folks weighing in. I too think R is the right tool for the job, although it's my main hammer so everything looks a bit like a nail to me.

There are a handful of R concepts needed to tackle this challenge. As I see it, you need the following (references in parens) :

  1. Import data into R. (R Import Export Guide)
  2. Get the Data into an appropriate time series structure. (XTS Vignette PDF)
  3. A little bit of plotting. (Quick-R intro to graphics)

Here's example code using 2mm points. If you notice, I don't illustrate plotting all 2mm points. It's slow and not that informative. But this should give you some ideas on getting started. Feel free to come back with more specific questions if you do decide to jump down the R rabbit hole!

require( xts )
require( lubridate )

## set up some example data
dataLength <- 2e6
startTime <- ymd_hms("2011-02-01-13-29-53")
fistFullOfSeconds <- 1:dataLength
date <- startTime + fistFullOfSeconds
DD1 <- rnorm( dataLength )
DD2 <- DD1 + rnorm(dataLength, 0, .1 )
DD3 <- rnorm( dataLength, 10, 2)
DD4 <- rnorm( dataLength )

myXts <- xts(matrix( c( DD1, DD2, DD3, DD4 ), ncol=4 ), date)

## now all the data are in the myXts object so let's do some
## summarizing and visualization

## grabbing just a single day from the data
## converted to data.frame to illustrate default data frame plotting
oneDay <- data.frame( myXts["2011-02-02"] ) 
plot( oneDay )

The relationship between DD1 and DD2 kinda jumps out
enter image description here

boxplot( oneDay )

Boxplot is the piechart of statistical graphics. The plot you love to hate. Might as well link to this while we're here.
enter image description here

## look at the max value of each variable every minute
par(mfrow=c(4,1)) ## partitions the graph window
ep <- endpoints(myXts,'minutes')
plot(period.apply(myXts[,1],INDEX=ep,FUN=max))
plot(period.apply(myXts[,2],INDEX=ep,FUN=max))
plot(period.apply(myXts[,3],INDEX=ep,FUN=max))
plot(period.apply(myXts[,4],INDEX=ep,FUN=max))

Even at one minute resolution I'm not sure this is informative. Should probably subset.
enter image description here

空袭的梦i 2024-10-26 00:41:34

下面是一些 R 代码,用于处理 2000000 行的 4 列中的 8000000 个数字:

> d=matrix(runif(8000000),ncol=4)
> dim(d)
[1] 2000000       4
> plot(d[1:1000,1])
> plot(d[1:1000,1],type='l')
> plot(d[1:10000,1],type='l')

现在它开始变得有点慢:

> plot(d[1:100000,1],type='l')

两列的相关性怎么样:

> cor(d[,1],d[,2])
[1] 0.001708502

--即时。傅里叶变换?

> f=fft(d[,1])

也是即时的。但不要尝试绘制它。

让我们绘制其中一列的细化版本:

> plot(d[seq(1,2000000,len=1000),1],type='l')

-- 即时。

真正缺少的是一个交互式绘图,您可以在其中缩放和平移整个数据集。

Here's some R code for playing around with 8000000 numbers in 4 columns of 2000000 rows:

> d=matrix(runif(8000000),ncol=4)
> dim(d)
[1] 2000000       4
> plot(d[1:1000,1])
> plot(d[1:1000,1],type='l')
> plot(d[1:10000,1],type='l')

now it starts to get a bit slow:

> plot(d[1:100000,1],type='l')

what about correlation of two columns:

> cor(d[,1],d[,2])
[1] 0.001708502

-- instant. Fourier transform?

> f=fft(d[,1])

also instant. Don't try and plot it though.

Let's plot a thinned version of one of the columns:

> plot(d[seq(1,2000000,len=1000),1],type='l')

-- instant.

What's really missing is an interactive plot where you could zoom and pan around the whole data set.

浅唱々樱花落 2024-10-26 00:41:34

下面是一个关于您拥有的数据的示例,如加载到 R 中、聚合等...

首先,将一些虚拟数据写入文件:

stime <- as.POSIXct("2011-01-01-00:00:00", format = "%Y-%d-%m-%H:%M:%S")
## dummy data
dat <- data.frame(Timestamp = seq(from = stime, by = 5, length = 2000000),
                  DD1 = sample(1:1000, replace = TRUE),
                  DD2 = sample(1:1000, replace = TRUE),
                  DD3 = sample(1:1000, replace = TRUE),
                  DD4 = sample(1:1000, replace = TRUE))
## write it out
write.csv(dat, file = "timestamp_data.txt", row.names = FALSE)

然后我们可以计时读取 200 万行。为了加快速度,我们告诉 R 文件中列的类:"POSIXct" 是 R 中存储时间戳排序的一种方式。

## read it in:
system.time({
             tsdat <- read.csv("timestamp_data.txt", header = TRUE,
                                 colClasses = c("POSIXct",rep("integer", 4)))
            })

在我的普通笔记本电脑上,以内部 UNIX 时间读取和格式化大约需要 13 秒。

   user  system elapsed 
 13.698   5.827  19.643 

聚合可以通过多种方式完成,其中一种是使用aggregate()。说聚合为小时平均值/平均值:(

## Generate some indexes that we'll use the aggregate over
tsdat <- transform(tsdat,
                   hours   = factor(strftime(tsdat$Timestamp, format = "%H")),
                   jday    = factor(strftime(tsdat$Timestamp, format = "%j")))
## compute the mean of the 4 variables for each minute
out <- aggregate(cbind(Timestamp, DD1, DD2, DD3, DD4) ~ hours + jday, 
                 data = tsdat, FUN = mean)
## convert average Timestamp to a POSIX time
out <- transform(out,
                 Timestamp = as.POSIXct(Timestamp, 
                                        origin = ISOdatetime(1970,1,1,0,0,0)))

创建 out 的行)在我的笔记本电脑上需要大约 16 秒,并给出以下输出:

> head(out)
  hours jday           Timestamp      DD1      DD2      DD3      DD4
1    00  001 2010-12-31 23:29:57 500.2125 491.4333 510.7181 500.4833
2    01  001 2011-01-01 00:29:57 516.0472 506.1264 519.0931 494.2847
3    02  001 2011-01-01 01:29:57 507.5653 499.4972 498.9653 509.1389
4    03  001 2011-01-01 02:29:57 520.4111 500.8708 514.1514 491.0236
5    04  001 2011-01-01 03:29:57 498.3222 500.9139 513.3194 502.6514
6    05  001 2011-01-01 04:29:57 515.5792 497.1194 510.2431 496.8056

可以使用 plot() 实现简单的绘图 功能:

plot(DD1 ~ Timestamp, data = out, type = "l")

我们可以通过以下方式覆盖更多变量:

ylim <- with(out, range(DD1, DD2))
plot(DD1 ~ Timestamp, data = out, type = "l", ylim = ylim)
lines(DD2 ~ Timestamp, data = out, type = "l", col = "red")

或通过多个面板:

layout(1:2)
plot(DD1 ~ Timestamp, data = out, type = "l", col = "blue")
plot(DD2 ~ Timestamp, data = out, type = "l", col = "red")
layout(1)

这一切都是通过基本 R 功能完成的。其他人已经展示了附加包如何使日期处理变得更容易。

Here is an example along the lines of the data you have, as loaded into R, aggregated etc...

First, some dummy data to write out to a file:

stime <- as.POSIXct("2011-01-01-00:00:00", format = "%Y-%d-%m-%H:%M:%S")
## dummy data
dat <- data.frame(Timestamp = seq(from = stime, by = 5, length = 2000000),
                  DD1 = sample(1:1000, replace = TRUE),
                  DD2 = sample(1:1000, replace = TRUE),
                  DD3 = sample(1:1000, replace = TRUE),
                  DD4 = sample(1:1000, replace = TRUE))
## write it out
write.csv(dat, file = "timestamp_data.txt", row.names = FALSE)

Then we can time reading in the 2-million rows. To speed this up, we tell R the classes of the columns in the file: "POSIXct" is one way in R to store the sort of timestamps you have.

## read it in:
system.time({
             tsdat <- read.csv("timestamp_data.txt", header = TRUE,
                                 colClasses = c("POSIXct",rep("integer", 4)))
            })

which, takes about 13 seconds to read in and format in internal unix times on my modest laptop.

   user  system elapsed 
 13.698   5.827  19.643 

Aggregation can be done in lots of ways, one is using aggregate(). Say aggregate to the hour mean/average:

## Generate some indexes that we'll use the aggregate over
tsdat <- transform(tsdat,
                   hours   = factor(strftime(tsdat$Timestamp, format = "%H")),
                   jday    = factor(strftime(tsdat$Timestamp, format = "%j")))
## compute the mean of the 4 variables for each minute
out <- aggregate(cbind(Timestamp, DD1, DD2, DD3, DD4) ~ hours + jday, 
                 data = tsdat, FUN = mean)
## convert average Timestamp to a POSIX time
out <- transform(out,
                 Timestamp = as.POSIXct(Timestamp, 
                                        origin = ISOdatetime(1970,1,1,0,0,0)))

That (the line creating out) takes ~16 seconds on my laptop, and gives the following output:

> head(out)
  hours jday           Timestamp      DD1      DD2      DD3      DD4
1    00  001 2010-12-31 23:29:57 500.2125 491.4333 510.7181 500.4833
2    01  001 2011-01-01 00:29:57 516.0472 506.1264 519.0931 494.2847
3    02  001 2011-01-01 01:29:57 507.5653 499.4972 498.9653 509.1389
4    03  001 2011-01-01 02:29:57 520.4111 500.8708 514.1514 491.0236
5    04  001 2011-01-01 03:29:57 498.3222 500.9139 513.3194 502.6514
6    05  001 2011-01-01 04:29:57 515.5792 497.1194 510.2431 496.8056

Simple plotting can be achieved using the plot() function:

plot(DD1 ~ Timestamp, data = out, type = "l")

We can overlay more variables via, e.g.:

ylim <- with(out, range(DD1, DD2))
plot(DD1 ~ Timestamp, data = out, type = "l", ylim = ylim)
lines(DD2 ~ Timestamp, data = out, type = "l", col = "red")

or via multiple panels:

layout(1:2)
plot(DD1 ~ Timestamp, data = out, type = "l", col = "blue")
plot(DD2 ~ Timestamp, data = out, type = "l", col = "red")
layout(1)

This has all been done with base R functionality. Others have shown how add-on packages can make working with dates easier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文