在 gnuplot 中标准化直方图 bin
我正在尝试绘制一个直方图,其箱按箱中元素的数量进行标准化。
我使用以下方法
binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes
来获取基本直方图,但我希望每个箱的值除以箱的大小。我该如何在 gnuplot 中解决这个问题,或者如果需要的话使用外部工具?
I'm trying to plot a histogram whose bins are normalized by the number of elements in the bin.
I'm using the following
binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes
to get a basic histogram, but I want the value of each bin to be divided by the size of the bin. How can I go about this in gnuplot, or using external tools if necessary?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
在 gnuplot 4.4 中,函数具有不同的属性,因为它们可以执行多个连续命令,然后返回一个值(请参阅 gnuplot 技巧)这意味着您实际上可以计算 gnuplot 文件中的点数 n,而无需提前知道。此代码针对文件“out.dat”运行,其中包含一列:来自正态分布的 n 个样本的列表:
第一个绘图语句读取数据文件并为每个点递增 sum 一次,绘制零。
第二个绘图语句实际上使用 sum 的值来标准化直方图。
In gnuplot 4.4, functions take on a different property, in that they can execute multiple successive commands, and then return a value (see gnuplot tricks) This means that you can actually calculate the number of points, n, within the gnuplot file without having to know it in advance. This code runs for a file, "out.dat", containing one column: a list of n samples from a normal distribution:
The first plot statement reads through the datafile and increments sum once for each point, plotting a zero.
The second plot statement actually uses the value of sum to normalise the histogram.
在gnuplot 4.6中,您可以通过stats命令来统计点数,这比plot更快。其实不需要
s(x)=((sum=sum+1),0)
这样的技巧,而是运行后直接通过变量STATS_records
来统计个数统计“out.dat”u 1
。In gnuplot 4.6, you can count the number of points by
stats
command, which is faster thanplot
. Actually, you do not need such a tricks(x)=((sum=sum+1),0)
, but directly count the number by variableSTATS_records
after running ofstats 'out.dat' u 1
.这是我的做法,使用以下命令从 R 生成 n=500 个随机高斯变量:
我使用与您完全相同的想法来定义归一化直方图,其中 y 定义为 1/(binwidth * n),除了我使用
int
而不是floor
并且我没有重新定位 bin 值。简而言之,这是对 smooth.dem 演示脚本的快速改编,以及Janert 的教科书中描述了类似的方法,Gnuplot in Action (第 13 章,第 257 页,免费提供)。您可以用random-points
替换我的示例数据文件,该文件位于 Gnuplot 附带的demo
文件夹中。请注意,我们需要将点数指定为 Gnuplot,因为文件中的记录没有计数功能。这是结果,有两个 bin 宽度
此外,这确实是直方图的粗略方法,并且更详细R 中很容易提供解决方案。事实上,问题是如何定义一个好的 bin 宽度,这个问题已经在 stats.stackexchange.com:使用 Freedman-Diaconis 分箱规则应该不会太难实现,尽管您需要计算间四分位数范围。
以下是 R 如何处理相同的数据集,使用默认选项(斯特奇斯规则,因为在这种特殊情况下,这不会产生影响)和与上面使用的等间距的 bin。
下面给出了所使用的 R 代码:
您甚至可以通过以下方式查看 R 的工作方式:检查调用
hist()
时返回的值:所有这些都表明,如果您愿意,您可以使用 R 结果通过 Gnuplot 处理数据(尽管我建议直接使用 R :-)。
Here is how I would do, with n=500 random gaussian variates generated from R with the following command:
I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth * n), except that I use
int
instead offloor
and I didn't recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert's textbook, Gnuplot in Action (Chapter 13, p. 257, freely available). You can replace my sample data file withrandom-points
which is available in thedemo
folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.Here is the result, with two bin width
Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you'll need to compute the inter-quartile range.
Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won't make a difference) and equally spaced bin like the ones used above.
The R code that was used is given below:
You can even look at how R does its job, by inspecting the values returned when calling
hist()
:All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).
计算文件中数据点数量的另一种方法是使用系统命令。如果您要绘制多个文件并且事先不知道点数,这将非常有用。我使用:
countpoints
函数避免计算以“#”开头的行。然后,您可以使用已经提到的函数来绘制标准化直方图。这是一个完整的示例:
Another way of counting the number of data points in a file is by using a system command. This proves useful if you are plotting multiple files, and you don't know the number of points beforehand. I used:
The
countpoints
functions avoids counting lines that start with '#'. You would then use the already mentioned functions to plot the normalized histogram.Here's a complete example:
简单地
Simply