当前位置：文江博客话题详情

从 R 中的直方图获取频率值

发布于 2024-12-09 17:29:11 字数 178 浏览 2 评论 0原文

我知道如何绘制直方图或其他频率/百分比相关的表格。但现在我想知道，如何在表中获取这些频率值以供事后使用。

我有一个庞大的数据集，现在我绘制一个具有设置的 binwidth 的直方图。我想提取与每个 binwidth 相对应的频率值（即 y 轴上的值）并将其保存在某处。

有人可以帮我解决这个问题吗？谢谢你！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆里有你的影子 2024-12-16 17:29:11

hist 函数有一个返回值（histogram 类的对象）：

R> res <- hist(rnorm(100))
R> res
$breaks
[1] -4 -3 -2 -1  0  1  2  3  4

$counts
[1]  1  2 17 27 34 16  2  1

$intensities
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01

$density
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01

$mids
[1] -3.5 -2.5 -1.5 -0.5  0.5  1.5  2.5  3.5

$xname
[1] "rnorm(100)"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

The hist function has a return value (an object of class histogram):

R> res <- hist(rnorm(100))
R> res
$breaks
[1] -4 -3 -2 -1  0  1  2  3  4

$counts
[1]  1  2 17 27 34 16  2  1

$intensities
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01

$density
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01

$mids
[1] -3.5 -2.5 -1.5 -0.5  0.5  1.5  2.5  3.5

$xname
[1] "rnorm(100)"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

回复收藏 0 原文

沧桑㈠ 2024-12-16 17:29:11

来自 ?hist：
评估

“直方图”类的对象，它是一个包含组件的列表：

打破 n+1 个单元格边界（= 如果是向量则打破）。
这些是名义上的中断，不带有边界模糊。
计算 n 个整数；对于每个单元格，内部 x[] 的数量。
密度值 f^(x[i])，作为估计的密度值。如果
all(diff(breaks) == 1)，它们是相对频率 counts/n
并且一般满足 sum[i; f^(x[i]) (b[i+1]-b[i])] = 1，其中 b[i]
= 中断[i]。
强度与密度相同。已弃用，但保留
兼容性。
n 个单元格的中点。
xname 具有实际 x 参数名称的字符串。
等距逻辑，表示断点之间的距离是否全部
相同。

breaks 和 密度 几乎提供了您所需的一切：

histrv<-hist(x)
histrv$breaks
histrv$density

From ?hist:
Value

an object of class "histogram" which is a list with components:

breaks the n+1 cell boundaries (= breaks if that was a vector).
These are the nominal breaks, not with the boundary fuzz.
counts n integers; for each cell, the number of x[] inside.
density values f^(x[i]), as estimated density values. If
all(diff(breaks) == 1), they are the relative frequencies counts/n
and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i]
= breaks[i].
intensities same as density. Deprecated, but retained for
compatibility.
mids the n cell midpoints.
xname a character string with the actual x argument name.
equidist logical, indicating if the distances between breaks are all
the same.

breaks and density provide just about all you need:

histrv<-hist(x)
histrv$breaks
histrv$density

回复收藏 0 原文

对你而言 2024-12-16 17:29:11

以防万一有人在考虑到 ggplot 的 geom_histogram 时遇到这个问题，请注意，有一种方法可以从 ggplot 对象中提取数据。

以下便利函数输出一个数据帧，其中包含每个 bin 的下限 (xmin)、每个 bin 的上限 (xmax)、每个 bin 的中点 ( x），以及频率值（y）。

## Convenience function
get_hist <- function(p) {
    d <- ggplot_build(p)$data[[1]]
    data.frame(x = d$x, xmin = d$xmin, xmax = d$xmax, y = d$y)
}

# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)

# make geom_histogram 
p <- ggplot(data = df, aes(x = x)) + 
    geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
                color = "black", fill = "white")

插图：

hist = get_hist(p)
head(hist$x)
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
head(hist$y)
## [1]  7 13 24 38 52 57
head(hist$xmax)
## [1] 1 2 3 4 5 6
head(hist$xmin)
## [1] 0 1 2 3 4 5

我在这里回答的一个相关问题（Cumulative histogram with ggplot2）。

Just in case someone hits this question with ggplot's geom_histogram in mind, note that there is a way to extract the data from a ggplot object.

The following convenience function outputs a dataframe with the lower limit of each bin (xmin), the upper limit of each bin (xmax), the mid-point of each bin (x), as well as the frequency value (y).

## Convenience function
get_hist <- function(p) {
    d <- ggplot_build(p)$data[[1]]
    data.frame(x = d$x, xmin = d$xmin, xmax = d$xmax, y = d$y)
}

# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)

# make geom_histogram 
p <- ggplot(data = df, aes(x = x)) + 
    geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
                color = "black", fill = "white")

Illustration:

hist = get_hist(p)
head(hist$x)
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
head(hist$y)
## [1]  7 13 24 38 52 57
head(hist$xmax)
## [1] 1 2 3 4 5 6
head(hist$xmin)
## [1] 0 1 2 3 4 5

A related question I answered here (Cumulative histogram with ggplot2).

回复收藏 0 原文

~没有更多了~