合并来自多个文件的数据并绘制它们

发布于 2024-12-03 02:01:11 字数 688 浏览 0 评论 0原文

我编写了用于分析数据并将结果写入 CSV 文件的应用程序。它包含三列：id、diff 和 count。
1. id 是周期的 id - 理论上 id 越大，diff 应该越低
2. 差异是总和

(Estimator - RealValue)^2

for each observation in the cycle

3 count 是周期期间的观察次数

对于参数 K 的 15 个不同值，我正在生成名称为 %K%.csv 的 CSV 文件，其中 %K% 是使用的值。我的文件总数是 15。

我想做的是用 R 编写一个简单的循环，它将能够绘制我的文件的内容，以便让我决定 K 的哪个值是最好的（对于哪个值）一般来说，diff 是最低的，

对于单个文件，我正在做类似的事情，

 ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))

我想做的事情有意义吗？请注意，统计数据完全不是我的领域，R 也不是（但是你）可能已经知道了这一点）。

我可以选择吗？从理论的角度来看，我正在做我期望做的事情吗？

如果有任何评论、提示、批评和答案，我将不胜感激。

原文

I have written application that is analyzing data and writing results in CSV file. It contains three columns: id, diff and count.
1. id is the id of the cycle - in theory the greater id, the lower diff should be
2. Diff is the sum of

(Estimator - RealValue)^2

for each observation in the cycle

3 count is number of observation during cycle

For 15 different values of parameter K, I am generating CSV file with name: %K%.csv , where %K% is the used value. My total number of files is 15.

What I would like to do, is to write in R simple loop, that will be able to plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest.

For single file I am doing something like

 ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))

Does it make sense what I am trying to do ? Please note that statistics is completely not my domain, nor is R (but you probably could figure out this already).

Is there any better approach I can choose? And from theoretical point of view, am I doing what I am expecting to do?

I Would be very greateful for any comments, hints, critic and answers

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不甘平庸 2024-12-10 02:01:11

编辑以清理一些拼写错误并解决多个 K 值问题。

我假设您已将所有 .csv 文件放在一个目录中（并且该目录中没有其他内容）。我还将假设每个 .csv 确实具有相同的结构（相同的列数、相同的顺序）。我首先生成文件名列表：

myCSVs <- list.files("path/to/directory")

然后，我将使用 lapply 在文件名列表上“循环”，使用 read.csv 将每个文件读入数据帧中code>：

setwd("path/to/directory")
#This function just reads in the file and
# appends a column with the K val taken from the file
# name. You may need to tinker with the particulars here.
myFun <- function(fn){
     tmp <- read.csv(fn)
     tmp$K <- strsplit(fn,".",fixed = TRUE)[[1]][1]
     tmp
}
dataList <- lapply(myCSVs, FUN = myFun,...)

根据 .csv 的结构，您可能需要将一些附加参数传递给 read.csv。最后，我会将这个数据框列表合并到一个数据框中：

myData <- do.call(rbind, dataList)

然后您应该将所有数据放在一个数据框 myData 中，您可以将其传递给 ggplot。

至于你问题的统计方面，如果没有具体的数据示例，很难提供意见。一旦你弄清楚了编程部分，你可以提出一个单独的问题，提供一些示例数据（在这里，或者在 stats.stackexchange.com 上），人们将能够建议一些可能有帮助的可视化或分析技术。

Edited to clean up some typos and address the multiple K value issue.

I'm going to assume that you've placed all your .csv files in a single directory (and there's nothing else in this directory). I will also assume that each .csv really do have the same structure (same number of columns, in the same order). I would begin by generating a list of the file names:

myCSVs <- list.files("path/to/directory")

Then I would 'loop' over the list of file names using lapply, reading each file into a data frame using read.csv:

setwd("path/to/directory")
#This function just reads in the file and
# appends a column with the K val taken from the file
# name. You may need to tinker with the particulars here.
myFun <- function(fn){
     tmp <- read.csv(fn)
     tmp$K <- strsplit(fn,".",fixed = TRUE)[[1]][1]
     tmp
}
dataList <- lapply(myCSVs, FUN = myFun,...)

Depending on the structure of your .csv's you may need to pass some additional arguments to read.csv. Finally, I would combine this list of data frames into a single data frame:

myData <- do.call(rbind, dataList)

Then you should have all your data in a single data frame, myData, that you can pass to ggplot.

As for the statistical aspect of your question, it's a little difficult to offer an opinion without concrete examples of your data. Once you've figured the programming part out, you could ask a separate question that provides some sample data (either here, or on stats.stackexchange.com) and folks will be able to suggest some visualization or analysis techniques that may help.

回复收藏 0 原文

神经大条 2024-12-10 02:01:11

我不了解你问题的背景，但希望能理解你的要求。

您的命令：

ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))

正在为归一化差异 ~ 周期的关系绘制 xyplot。您提到“理论上 id 越大，diff 应该越低”。所以这个图验证了这个假设。实际上还有另一种方法可以用数字来做到这一点：spearman 相关系数，可以用 cor(x, y, method='spearman') 计算。

您提到“绘制我的文件的内容，以便让我决定 K 的哪个值是最好的（通常差异是最低的”。所以可能您需要使用诸如“sapply（ read.csv(...), Simply=T)" 加载所有数据，之后您应该将所有加载的文件转换为某种格式，其中四列包括 K、Id、diff 和 count。然后您可以可视化数据集在三维空间中，其中包含函数（levelplot） latticeExtra 包（抱歉，我不知道如何使用 ggplot2 执行此操作），或者您可以使用颜色编码的方式使用 ggplot2 的 geom_tile 函数在二维中执行此操作，或者，您可以使用 facet 来可视化数据以网格方式。

回复收藏 0 原文

~没有更多了~