如何在 gnuplot 生成的 cdf 上绘制引导线？

发布于 2024-12-28 13:19:11 字数 647 浏览 2 评论 0原文

在工作中，我有一组浮点值，我对其进行排序和计算 CDF，并在 gnuplot 中进行绘制。我想画一条线来显示 CDF 的 80% 和 90% 阈值在哪里，即一条从左侧 @ 0.8 y 刻度线进入的线，接触图形，然后下降到该值可能是什么。这是为了帮助引导观众的眼睛。

数据是自动生成的，我制作了多个图，所以我不想每次都手工制作这些线。

在 0.8 和 0.9 y 值点处绘制完全穿过绘图的水平箭头很简单，但我不明白如何确定应在何处绘制垂直线。这是aq/a wrt绘图箭头： Gnuplot：特定位置的垂直线，但位置是先验已知的。

这是一些示例数据（我的工作机器无法访问互联网，因此共享很困难）

  X                Y
 5.0   |         0.143
 8.0   |         0.288
16.0   |         0.429
25.0   |         0.714
39.0   |         0.857
47.0   |         1.000

有什么想法吗？

原文

At work have a set of floating point values that I sort and compute a CDF for and plot within gnuplot. I'd like to draw a line showing where the 80% and 90% thresholds of the CDF are, i.e. a line coming in from the left @ the 0.8 y tic mark, touching the graph and then dropping down to whatever that value might be. This is to help guide the viewers eye.

The data is generated automatically and I make multiple plots so I don't want to have to hand craft these lines each time.

It's trivial to draw a horizontal arrow going completely across the plot at the 0.8 and 0.9 y-value points, but I don't understand how to determine where the vertical line should be drawn.
Here is a q/a wrt drawing arrows: Gnuplot: Vertical lines at specific positions, but the positions are known a priori.

Here is some sample data (my work machine is not internet accessible so sharing is hard)

  X                Y
 5.0   |         0.143
 8.0   |         0.288
16.0   |         0.429
25.0   |         0.714
39.0   |         0.857
47.0   |         1.000

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瘫痪情歌 2025-01-04 13:19:11

这是我的看法（使用百分位数排名），仅假设单变量系列测量可用（您的列标题为 X）。您可能需要稍微调整它以使用预先计算的累积频率，但这并不困难。

# generate some artificial data
reset
set sample 200
set table 'rnd.dat'
plot invnorm(rand(0))
unset table

# display the CDF
unset key
set yrange [0:1]
perc80=system("cat rnd.dat | sed '1,4d' | awk '{print $2}' | sort -n | \
          awk 'BEGIN{i=0} {s[i]=$1; i++;} END{print s[int(NR*0.8-0.5)]}'")
set arrow from perc80,0 to perc80,0.8 nohead lt 2 lw 2
set arrow from graph(0,0),0.8 to perc80,0.8 nohead lt 2 lw 2
plot 'rnd.dat' using 2:(1./200.) smooth cumulative

这会产生以下输出：

在此处输入图像描述

当然，您可以根据需要添加任意数量的百分位数；您只需定义一个新变量，例如 perc90，并请求另外两个 arrow 命令，并替换所有出现的 0.8 (啊...神奇数字的乐趣！）通过所需的数字（在本例中为 0.9）。

关于上述代码的一些解释：

我生成了一个保存在磁盘上的人工数据集。
第 80 个百分位数是使用 awk 计算的，但在此之前我们需要
1. 删除table生成的表头（前四行）；（我们可以要求 awk 从第 5 行开始，但我们就这样吧。）
2. 仅保留第二列；
3. 对条目进行排序。
计算第 80 个百分位数的 awk 命令需要截断，这是按照此处的建议完成的。（在 R 中，我只需使用像 trunc(rank(x))/length(x) 这样的函数来获取百分位数排名。）

如果您想尝试一下 R，您可以安全地替换一系列长串的 sed/awk 命令以及对 R 的调用，就像

Rscript -e 'x=read.table("~/rnd.dat")[,2]; sort(x)[trunc(length(x)*.8)]'

假设 rnd.dat 位于您的主目录中一样。

旁注：如果您可以不用 gnuplot，这里有一些 R 命令可以完成此类图形（即使不使用 quantile 函数）：

x <- rnorm(200)
xs <- sort(x)
xf <- (1:length(xs))/length(xs)
plot(xs, xf, xlab="X", ylab="Cumulative frequency")
## quick outline of the 80th percentile rank
perc80 <- xs[trunc(length(x)*.8)]
abline(h=.8, v=perc80) 
## alternative solution
plot(ecdf(x))
segments(par("usr")[1], .8, perc80, .8)
segments(perc80, par("usr")[3], perc80, .8)

在此处输入图像描述

Here is my take (using percentile ranks), which only assumes a univariate series of measurement is available (your column headed X). You may want to tweak it a little to work with your pre-computed cumulative frequencies, but that's not really difficult.

# generate some artificial data
reset
set sample 200
set table 'rnd.dat'
plot invnorm(rand(0))
unset table

# display the CDF
unset key
set yrange [0:1]
perc80=system("cat rnd.dat | sed '1,4d' | awk '{print $2}' | sort -n | \
          awk 'BEGIN{i=0} {s[i]=$1; i++;} END{print s[int(NR*0.8-0.5)]}'")
set arrow from perc80,0 to perc80,0.8 nohead lt 2 lw 2
set arrow from graph(0,0),0.8 to perc80,0.8 nohead lt 2 lw 2
plot 'rnd.dat' using 2:(1./200.) smooth cumulative

This yields the following output:

enter image description here

You can add as many percentile values as you want, of course; you just have to define a new variable, e.g. perc90, as well as ask for two other arrow commands, and replace every occurrence of 0.8 (ah... the joy of magic numbers!) by the desired one (in this case, 0.9).

Some explanations about the above code:

I generated an artificial dataset which was saved on disk.
The 80th percentile is compute using awk, but before that we need to
1. remove the header generated by table (first four lines); (we could ask awk to start at the 5th lines, but let's go with that.)
2. keep only the second column;
3. sort the entries.
The awk command to compute the 80th percentile requires truncation, which is done as suggested here. (In R, I would simply use a function like trunc(rank(x))/length(x) to get the percentile ranks.)

If you want to give R a shot, you can safely replace that long series of sed/awk commands with a call to R like

Rscript -e 'x=read.table("~/rnd.dat")[,2]; sort(x)[trunc(length(x)*.8)]'

assuming rnd.dat is in your home directory.

Sidenote: And if you can live without gnuplot, here are some R commands to do that kind of graphics (even not using the quantile function):

x <- rnorm(200)
xs <- sort(x)
xf <- (1:length(xs))/length(xs)
plot(xs, xf, xlab="X", ylab="Cumulative frequency")
## quick outline of the 80th percentile rank
perc80 <- xs[trunc(length(x)*.8)]
abline(h=.8, v=perc80) 
## alternative solution
plot(ecdf(x))
segments(par("usr")[1], .8, perc80, .8)
segments(perc80, par("usr")[3], perc80, .8)

enter image description here

回复收藏 0 原文

爱要勇敢去追 2025-01-04 13:19:11

您可以使用 awk 来计算给定值的行。

示例

如果您有一个如下所示的数据文件 Data.csv：

则可以使用 Now 绘制它

plot "Data.csv" u 1:2 w l

如果您想在第二列最大值的 90%（本例中为 90）处绘制一条线，运行 awk 脚本。其目的是确定最小和最大 x 值以及最大 y 值的 90% 值。它可能看起来像这样：

awk '
{
if(x_min == "") {x_min = x_max = $1; y_max = $2}; 
if($1 > x_max) {x_max = $1}; 
if($1 < x_min) {x_min = $1}; 
if(y_max < $2) {y_max = $2}} 
END {
print x_min, y_max * 0.9; 
print x_max, y_max * 0.9
}' Data.csv

基本上它的作用如下：

检查x_min是否存在以及是否未设置x_min、x_max 和 y_max 到 Data.csv 的第一列或第二列。
检查当前第一列是否大于当前x_min，如果是，则将x_min设置为当前第一列的值。
对 x_max 和 y_max 执行等效操作（注意：我们只需要第二列的最大值，而不是最小值）
循环遍历数据文件后，打印结果如下：
<前><代码>x_min y_max * 0.9
x_最大 y_最大 * 0.9

为了在 gnuplot 中工作，我们从上面附加我们的脚本，如下所示：

plot "Data.csv" u 1:2 w l, \
     "< awk '{if(x_min == \"\") {x_min = x_max = $1; y_max = $2}; if($1 > x_max) {x_max = $1}; if($1 < x_min) {x_min = $1}; if(y_max < $2) {y_max = $2}} END {print x_min, y_max * 0.9; print x_max, y_max * 0.9}' Data.csv" u 1:2 w l

注意 gnuplot 脚本中的 \"。 “ 需要转义，以便 gnuplot 不会被它们绊倒......

毕竟你应该得到这样的情节：

输入图像描述这里

绿线标记最大 y 值的 90%。

You can use awk to calculate the line at a given value.

Example

If you have a data file Data.csv like so:

you can plot it with

plot "Data.csv" u 1:2 w l

Now if you want to draw a line at 90% of the maximal value of the second column (in this case 90) run an awk script. Its purpose is to identify the minimum and maximum x-value and the 90% value of the maximal y-value. It could look something like this:

awk '
{
if(x_min == "") {x_min = x_max = $1; y_max = $2}; 
if($1 > x_max) {x_max = $1}; 
if($1 < x_min) {x_min = $1}; 
if(y_max < $2) {y_max = $2}} 
END {
print x_min, y_max * 0.9; 
print x_max, y_max * 0.9
}' Data.csv

Basically what it does is the following:

Check if x_min exists and if it does not set x_min, x_max and y_max to the first or second column of Data.csv.
Check if the current first column is larger than the current x_min, if that is the case, set x_min to the value of the current first column.
Do the equivalent for x_max and y_max (Note: we only need the maximum of the second column and not the minimum)
After we looped through our data file print the result like so:
```
x_min y_max * 0.9
x_max y_max * 0.9
```

In order to make this work in gnuplot we append our script from above like so:

plot "Data.csv" u 1:2 w l, \
     "< awk '{if(x_min == \"\") {x_min = x_max = $1; y_max = $2}; if($1 > x_max) {x_max = $1}; if($1 < x_min) {x_min = $1}; if(y_max < $2) {y_max = $2}} END {print x_min, y_max * 0.9; print x_max, y_max * 0.9}' Data.csv" u 1:2 w l

Note the \" in the gnuplot script. The " need to be escaped for gnuplot not to stumble over them...

After all you should end up with a plot like this:

enter image description here

The green line marks the 90% value of the maximal y-value.

回复收藏 0 原文

~没有更多了~