合并来自多个文件的数据并绘制它们
我编写了用于分析数据并将结果写入 CSV 文件的应用程序。它包含三列:id、diff 和 count。
1. id 是周期的 id - 理论上 id 越大,diff 应该越低
2. 差异是总和
(Estimator - RealValue)^2for each observation in the cycle
3 count 是周期期间的观察次数
对于参数 K 的 15 个不同值,我正在生成名称为 %K%.csv 的 CSV 文件,其中 %K% 是使用的值。我的文件总数是 15。
我想做的是用 R 编写一个简单的循环,它将能够绘制我的文件的内容,以便让我决定 K 的哪个值是最好的(对于哪个值)一般来说,diff 是最低的,
对于单个文件,我正在做类似的事情,
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
我想做的事情有意义吗?请注意,统计数据完全不是我的领域,R 也不是(但是你) 可能已经知道了这一点)。
我 可以选择吗?从理论的角度来看,我正在做我期望做的事情吗?
如果有任何评论、提示、批评和答案,我将不胜感激。
I have written application that is analyzing data and writing results in CSV file. It contains three columns: id, diff and count.
1. id is the id of the cycle - in theory the greater id, the lower diff should be
2. Diff is the sum of
(Estimator - RealValue)^2
for each observation in the cycle
3 count is number of observation during cycle
For 15 different values of parameter K, I am generating CSV file with name: %K%.csv , where %K% is the used value. My total number of files is 15.
What I would like to do, is to write in R simple loop, that will be able to plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest.
For single file I am doing something like
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
Does it make sense what I am trying to do ? Please note that statistics is completely not my domain, nor is R (but you probably could figure out this already).
Is there any better approach I can choose? And from theoretical point of view, am I doing what I am expecting to do?
I Would be very greateful for any comments, hints, critic and answers
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
编辑以清理一些拼写错误并解决多个 K 值问题。
我假设您已将所有 .csv 文件放在一个目录中(并且该目录中没有其他内容) 。我还将假设每个 .csv 确实具有相同的结构(相同的列数、相同的顺序)。我首先生成文件名列表:
然后,我将使用
lapply
在文件名列表上“循环”,使用read.csv
将每个文件读入数据帧中code>:根据 .csv 的结构,您可能需要将一些附加参数传递给
read.csv
。最后,我会将这个数据框列表合并到一个数据框中:然后您应该将所有数据放在一个数据框
myData
中,您可以将其传递给ggplot
。至于你问题的统计方面,如果没有具体的数据示例,很难提供意见。一旦你弄清楚了编程部分,你可以提出一个单独的问题,提供一些示例数据(在这里,或者在 stats.stackexchange.com 上),人们将能够建议一些可能有帮助的可视化或分析技术。
Edited to clean up some typos and address the multiple K value issue.
I'm going to assume that you've placed all your .csv files in a single directory (and there's nothing else in this directory). I will also assume that each .csv really do have the same structure (same number of columns, in the same order). I would begin by generating a list of the file names:
Then I would 'loop' over the list of file names using
lapply
, reading each file into a data frame usingread.csv
:Depending on the structure of your .csv's you may need to pass some additional arguments to
read.csv
. Finally, I would combine this list of data frames into a single data frame:Then you should have all your data in a single data frame,
myData
, that you can pass toggplot
.As for the statistical aspect of your question, it's a little difficult to offer an opinion without concrete examples of your data. Once you've figured the programming part out, you could ask a separate question that provides some sample data (either here, or on stats.stackexchange.com) and folks will be able to suggest some visualization or analysis techniques that may help.
我不了解你问题的背景,但希望能理解你的要求。
您的命令:
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
正在为归一化差异 ~ 周期的关系绘制 xyplot。您提到“理论上 id 越大,diff 应该越低”。所以这个图验证了这个假设。实际上还有另一种方法可以用数字来做到这一点:spearman 相关系数,可以用 cor(x, y, method='spearman') 计算。
您提到“绘制我的文件的内容,以便让我决定 K 的哪个值是最好的(通常差异是最低的”。所以可能您需要使用诸如“sapply( read.csv(...), Simply=T)" 加载所有数据,之后您应该将所有加载的文件转换为某种格式,其中四列包括 K、Id、diff 和 count。然后您可以可视化数据集在三维空间中,其中包含函数(levelplot) latticeExtra 包(抱歉,我不知道如何使用 ggplot2 执行此操作),或者您可以使用颜色编码的方式使用 ggplot2 的 geom_tile 函数在二维中执行此操作,或者,您可以使用 facet 来可视化数据以网格方式。
I am not familiar with the background of your question, but I hope I can understand your request.
Your command:
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
is doing the xyplot for the relationship of normalized difference ~ cycle. You mentioned that "in theory the greater id, the lower diff should be". So this plot is validating the assumption. Actually there is another way to do this with a number: spearman correlation coefficient, which can be computed with cor(x, y, method='spearman').
You mentioned that "plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest". So probably you need to load all these files with sth like "sapply(read.csv(...), simplify=T)" to load all the data, and after that you should convert all loaded file into some format with FOUR columns include K, Id, diff and count. Then you can visualize the dataset in a three dimension with functions (levelplot) within latticeExtra package (sorry, I don't know how to do this with ggplot2), or you can use a color-coded way to do this in 2-d using geom_tile function of ggplot2, or, you can use facet to visualize the data in a grid way.