在 gnuplot 中标准化直方图 bin

发布于 2024-11-03 02:39:54 字数 275 浏览 7 评论 0原文

我正在尝试绘制一个直方图,其箱按箱中元素的数量进行标准化。

我使用以下方法

binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes

来获取基本直方图,但我希望每个箱的值除以箱的大小。我该如何在 gnuplot 中解决这个问题,或者如果需要的话使用外部工具?

I'm trying to plot a histogram whose bins are normalized by the number of elements in the bin.

I'm using the following

binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes

to get a basic histogram, but I want the value of each bin to be divided by the size of the bin. How can I go about this in gnuplot, or using external tools if necessary?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

总以为 2024-11-10 02:39:54

在 gnuplot 4.4 中,函数具有不同的属性,因为它们可以执行多个连续命令,然后返回一个值(请参阅 gnuplot 技巧)这意味着您实际上可以计算 gnuplot 文件中的点数 n,而无需提前知道。此代码针对文件“out.dat”运行,其中包含一列:来自正态分布的 n 个样本的列表:

binwidth = 0.1
set boxwidth binwidth
sum = 0

s(x)          = ((sum=sum+1), 0)
bin(x, width) = width*floor(x/width) + binwidth/2.0

plot "out.dat" u ($1):(s($1))
plot "out.dat" u (bin($1, binwidth)):(1.0/(binwidth*sum)) smooth freq w boxes

第一个绘图语句读取数据文件并为每个点递增 sum 一次,绘制零。

第二个绘图语句实际上使用 sum 的值来标准化直方图。

In gnuplot 4.4, functions take on a different property, in that they can execute multiple successive commands, and then return a value (see gnuplot tricks) This means that you can actually calculate the number of points, n, within the gnuplot file without having to know it in advance. This code runs for a file, "out.dat", containing one column: a list of n samples from a normal distribution:

binwidth = 0.1
set boxwidth binwidth
sum = 0

s(x)          = ((sum=sum+1), 0)
bin(x, width) = width*floor(x/width) + binwidth/2.0

plot "out.dat" u ($1):(s($1))
plot "out.dat" u (bin($1, binwidth)):(1.0/(binwidth*sum)) smooth freq w boxes

The first plot statement reads through the datafile and increments sum once for each point, plotting a zero.

The second plot statement actually uses the value of sum to normalise the histogram.

人间☆小暴躁 2024-11-10 02:39:54

在gnuplot 4.6中,您可以通过stats命令来统计点数,这比plot更快。其实不需要s(x)=((sum=sum+1),0)这样的技巧,而是运行后直接通过变量STATS_records来统计个数统计“out.dat”u 1

In gnuplot 4.6, you can count the number of points by stats command, which is faster than plot. Actually, you do not need such a trick s(x)=((sum=sum+1),0), but directly count the number by variable STATS_records after running of stats 'out.dat' u 1.

深海夜未眠 2024-11-10 02:39:54

这是我的做法,使用以下命令从 R 生成 n=500 个随机高斯变量:

Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat

我使用与您完全相同的想法来定义归一化直方图,其中 y 定义为 1/(binwidth * n),除了我使用 int 而不是 floor 并且我没有重新定位 bin 值。简而言之,这是对 smooth.dem 演示脚本的快速改编,以及Janert 的教科书中描述了类似的方法,Gnuplot in Action (第 13 章,第 257 页,免费提供)。您可以用 random-points 替换我的示例数据文件,该文件位于 Gnuplot 附带的 demo 文件夹中。请注意,我们需要将点数指定为 Gnuplot,因为文件中的记录没有计数功能。

bw1=0.1
bw2=0.3
n=500
bin(x,width)=width*int(x/width)
set xrange [-3:3]
set yrange [0:1]
tstr(n)=sprintf("Binwidth = %1.1f\n", n) 
set multiplot layout 1,2
set boxwidth bw1
plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2)

这是结果,有两个 bin 宽度

在此处输入图像描述

此外,这确实是直方图的粗略方法,并且更详细R 中很容易提供解决方案。事实上,问题是如何定义一个好的 bin 宽度,这个问题已经在 stats.stackexchange.com:使用 Freedman-Diaconis 分箱规则应该不会太难实现,尽管您需要计算间四分位数范围。

以下是 R 如何处理相同的数据集,使用默认选项(斯特奇斯规则,因为在这种特殊情况下,这不会产生影响)和与上面使用的等间距的 bin。

在此处输入图像描述

下面给出了所使用的 R 代码:

par(mfrow=c(1,2), las=1)
hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE)
hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1", 
     xlab="", ylab="", prob=TRUE)

您甚至可以通过以下方式查看 R 的工作方式:检查调用 hist() 时返回的值:

> str(hist(rnd, plot=FALSE))
List of 7
 $ breaks     : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
 $ counts     : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...
 $ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ density    : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ mids       : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
 $ xname      : chr "rnd"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

所有这些都表明,如果您愿意,您可以使用 R 结果通过 Gnuplot 处理数据(尽管我建议直接使用 R :-)。

Here is how I would do, with n=500 random gaussian variates generated from R with the following command:

Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat

I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth * n), except that I use int instead of floor and I didn't recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert's textbook, Gnuplot in Action (Chapter 13, p. 257, freely available). You can replace my sample data file with random-points which is available in the demo folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.

bw1=0.1
bw2=0.3
n=500
bin(x,width)=width*int(x/width)
set xrange [-3:3]
set yrange [0:1]
tstr(n)=sprintf("Binwidth = %1.1f\n", n) 
set multiplot layout 1,2
set boxwidth bw1
plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2)

Here is the result, with two bin width

enter image description here

Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you'll need to compute the inter-quartile range.

Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won't make a difference) and equally spaced bin like the ones used above.

enter image description here

The R code that was used is given below:

par(mfrow=c(1,2), las=1)
hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE)
hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1", 
     xlab="", ylab="", prob=TRUE)

You can even look at how R does its job, by inspecting the values returned when calling hist():

> str(hist(rnd, plot=FALSE))
List of 7
 $ breaks     : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
 $ counts     : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...
 $ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ density    : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ mids       : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
 $ xname      : chr "rnd"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).

别靠近我心 2024-11-10 02:39:54

计算文件中数据点数量的另一种方法是使用系统命令。如果您要绘制多个文件并且事先不知道点数,这将非常有用。我使用:

countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )
file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)
...

countpoints 函数避免计算以“#”开头的行。然后,您可以使用已经提到的函数来绘制标准化直方图。

这是一个完整的示例:

n=100
xmin=-50.
xmax=50.
binwidth=(xmax-xmin)/n

bin(x,width)=width*floor(x/width)+width/2.0
countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )

file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)

plot file1 using (bin(($1),binwidth)):(1.0/(binwidth*file1count)) smooth freq with boxes,\
     file2 using (bin(($1),binwidth)):(1.0/(binwidth*file2count)) smooth freq with boxes,\
     file3 using (bin(($1),binwidth)):(1.0/(binwidth*file3count)) smooth freq with boxes
...

Another way of counting the number of data points in a file is by using a system command. This proves useful if you are plotting multiple files, and you don't know the number of points beforehand. I used:

countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )
file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)
...

The countpoints functions avoids counting lines that start with '#'. You would then use the already mentioned functions to plot the normalized histogram.

Here's a complete example:

n=100
xmin=-50.
xmax=50.
binwidth=(xmax-xmin)/n

bin(x,width)=width*floor(x/width)+width/2.0
countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )

file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)

plot file1 using (bin(($1),binwidth)):(1.0/(binwidth*file1count)) smooth freq with boxes,\
     file2 using (bin(($1),binwidth)):(1.0/(binwidth*file2count)) smooth freq with boxes,\
     file3 using (bin(($1),binwidth)):(1.0/(binwidth*file3count)) smooth freq with boxes
...
差↓一点笑了 2024-11-10 02:39:54

简单地

plot 'file' using (bin($2, binwidth)):($4/$4) smooth freq with boxes

Simply

plot 'file' using (bin($2, binwidth)):($4/$4) smooth freq with boxes
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文