当前位置：文江博客话题详情

在 R 中创建直方图时，正确的参数有什么作用？

发布于 2024-12-23 04:53:21 字数 282 浏览 3 评论 0原文

我试图找出 R 中 hist 函数中的正确参数的作用。不幸的是，对于像我这样对统计没有深入了解的人来说，该文档不清楚。

在线声明的文档是：

逻辑正确；如果为 TRUE，则直方图单元格为右闭（左开）区间。

右闭（或左开）区间是什么意思？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

难忘№最初的完美 2024-12-30 04:53:21

创建非分类数据（例如 pH、温度等）的直方图时，您需要指定称为“bins”的内容。每个垃圾箱都有一个为其指定的间隔。例如，如果我有数据：

11  12  13  14  15  16  17  18  19

我可以创建 5 个具有右开、左闭间隔的容器，如下所示：

1st bin: [10, 12)
2nd bin: [12, 14)
3rd bin: [14, 16)
4th bin: [16, 18)
5th bin: [18, 20)

这意味着第一个容器将“保存”10 到 12 之间的值，包括 10，但是不包括 12。上面使用的间隔表示法是对此的简写：

1st bin: 10 ≤ x < 12
2nd bin: 12 ≤ x < 14
3rd bin: 14 ≤ x < 16
4th bin: 16 ≤ x < 18
5th bin: 18 ≤ x < 20

因此，这意味着值 11 将进入第一个 bin，但值 12 将进入第二个 bin，依此类推。R 将为您执行此分箱过程，然后根据每个垃圾箱中有多少物品。对于上述数据，您将得到一个相当无趣（或有趣，取决于您的期望）的直方图，除了第一个 bin 之外，该直方图大部分是平坦的。

以下示例说明了使用区间表示法时括号和圆括号的不同组合的含义（假设 x 是实数轴的元素）：

(1, 4) --> 1 < x < 4    left-open, right-open
[3, 7) --> 3 ≤ x < 7    left-closed, right-open
(2, 9] --> 2 < x ≤ 9    left-open, right-closed
[5, 6] --> 5 ≤ x ≤ 6    left-closed, right-closed

请注意，假设您不使用扩展实数线

(-∞, ∞)   -->   -∞ < x < ∞ 
(-∞, 20]  -->   -∞ < x ≤ 20 
[20, ∞)   -->   20 ≤ x < ∞
(1000, ∞) --> 1000 < x < ∞
(-∞, ∞]   -->   Invalid
(41, ∞]   -->   Invalid

如果我想要左开、右闭区间，那么垃圾箱看起来像这个：

1st bin: (10, 12] i.e. 10 < x ≤ 12
2nd bin: (12, 14]      12 < x ≤ 14
3rd bin: (14, 16]      14 < x ≤ 16
4th bin: (16, 18]      16 < x ≤ 18
5th bin: (18, 20]      18 < x ≤ 20

看到区别了吗？在本例中，现在值 11 和 12 将进入第一个 bin。直方图的外观可能会发生变化，具体取决于数据的装箱方式。现在，这次您的直方图仍然几乎平坦，但现在第 5 个 bin 与其余的不同（只有 1 个数据点，而不是其余的 2 个数据点）。

现在，幸运的是，在 R 中，您不必自己指定 bin，但 R 足够好，可以询问您是否希望 bin 左闭、右开 ([a, b)) 或左开、右闭 ((a, b])。这就是 hist() 函数中“right”参数的区别。

When creating histograms of non-categorial data (things like pH, temperature, etc.), you need to specify things called "bins". Each bin has something called an interval specified for it. For example, if I have the data:

11  12  13  14  15  16  17  18  19

I can create 5 bins with right-open, left-closed intervals like this:

1st bin: [10, 12)
2nd bin: [12, 14)
3rd bin: [14, 16)
4th bin: [16, 18)
5th bin: [18, 20)

What this means is that the first bin will "hold" values between 10 and 12, including 10 but not including 12. The interval notation used above is shorthand for this:

1st bin: 10 ≤ x < 12
2nd bin: 12 ≤ x < 14
3rd bin: 14 ≤ x < 16
4th bin: 16 ≤ x < 18
5th bin: 18 ≤ x < 20

So that means the values 11 will go into the 1st bin, but the value 12 will go into the second bin, etc. R will do this binning process for you then draw the histogram based on how many items are in each bin. For the above data, you'll get a rather not-interesting (or interesting, depending on your expectations) histogram that is mostly flat except at the first bin.

The following examples illustrate what the different combinations of brackets and parentheses mean when using interval notation (assume x is an element of the real number line):

(1, 4) --> 1 < x < 4    left-open, right-open
[3, 7) --> 3 ≤ x < 7    left-closed, right-open
(2, 9] --> 2 < x ≤ 9    left-open, right-closed
[5, 6] --> 5 ≤ x ≤ 6    left-closed, right-closed

Note that you can't use brackets for infinities, assuming you're not using the extended real number line

(-∞, ∞)   -->   -∞ < x < ∞ 
(-∞, 20]  -->   -∞ < x ≤ 20 
[20, ∞)   -->   20 ≤ x < ∞
(1000, ∞) --> 1000 < x < ∞
(-∞, ∞]   -->   Invalid
(41, ∞]   -->   Invalid

If I want left-open, right-closed intervals, then the bins would look like this:

1st bin: (10, 12] i.e. 10 < x ≤ 12
2nd bin: (12, 14]      12 < x ≤ 14
3rd bin: (14, 16]      14 < x ≤ 16
4th bin: (16, 18]      16 < x ≤ 18
5th bin: (18, 20]      18 < x ≤ 20

See the difference? In this case, now values 11, and 12 will go into the first bin. This may change in the appearance of the histogram depending on how you bin the data. Now, this time your histogram is still almost flat but now the 5th bin is different from the rest (only 1 data point instead of 2 for the rest).

Now, fortunately in R you don't have to specify the bins yourself, but R is nice enough to ask you whether you want the bins to be left-closed, right-open ([a, b)) or left-open, right-closed ((a, b]). That's the difference you get w.r.t the "right" parameter does in the hist() function.

回复收藏 0 原文

吲‖鸣 2024-12-30 04:53:21

默认值为 right = TRUE，它给出 (a, b] 形式的间隔。让我们举个例子来看看这意味着什么。假设我们的数据中有值 5。还假设直方图使用断点3, 4, 5, 6。问题是我们的值 5 应该属于哪个区间？如果我们使用 right = TRUE，则实际使用的区间是 (3, 4], (4, 5], (5, 6]。区间符号 (4, 5] 意味着它包括 4 和 5 之间的所有值 - 它不包括实际值 4，但包含值 5。因此我们的数据点 5 落入这个区间。如果我们使用 right = FALSE ，

则间隔将具有 [a, b) 的形式，因此使用相同的断点 3, 4, 5, 6 我们将得到间隔 [3, 4), [4, 5), [5, 6)。这次我们的数据点进入区间 [5, 6)，因为这个区间包含 5，而 [4, 5) 不包含 5。

本质上，“正确”参数告诉 R 当数据出现时该怎么做。点正好落在断点所在的位置。

回复收藏 0 原文