获取数据的概率密度

发布于 2024-10-03 07:23:53 字数 516 浏览 10 评论 0原文

我需要分析有关 DSL 线路互联网会话的一些数据。我想看看会话持续时间是如何分布的。我认为一个简单的方法是首先绘制所有会话持续时间的概率密度图。

我已在 R 中加载数据并使用 密度() 函数。所以，就像这样，

plot(density(data$duration), type = "l", col = "blue", main = "Density Plot of Duration",
     xlab = "duration(h)", ylab = "probability density")

我对 R 和这种分析很陌生。这是我通过谷歌发现的。我得到了一个情节，但我留下了一些问题。这是执行我想做的事情的正确功能还是还有其他功能？

在图中，我发现 Y 轴刻度为 0...1.5。我不明白它怎么可能是1.5，它不应该是从0...1吗？

另外，我希望获得更平滑的曲线。由于数据集非常大，所以线条确实是锯齿状的。当我介绍这个的时候，如果能把它们弄平就更好了。我该怎么做呢？

原文

I need to analyze some data about internet sessions for a DSL Line. I wanted to have a look at how the session durations are distributed. I figured a simple way to do this would be to begin by making a probability density plot of the duration of all the sessions.

I have loaded the data in R and used the density() function. So, it was something like this

plot(density(data$duration), type = "l", col = "blue", main = "Density Plot of Duration",
     xlab = "duration(h)", ylab = "probability density")

I am new to R and this kind of analysis. This was what I found from going through google. I got a plot but I was left with some questions. Is this the right function to do what I am trying to do or is there something else?

In the plot I found that the Y-axis scale was from 0...1.5. I don't get how it can be 1.5, shouldn't it be from 0...1?

Also, I would like to get a smoother curve. Since, the data set is really large the lines are really jagged. It would be nicer to have them smoothed out when I am presenting this. How would I go about doing that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

网名女生简单气质 2024-10-10 07:23:53

正如 nico 所说，您应该查看 hist，但您也可以将它们两者结合起来。然后你可以用lines来调用密度。
示例：

duration <- rpois(500, 10) # For duration data I assume Poisson distributed
hist(duration,
   probability = TRUE, # In stead of frequency
   breaks = "FD",      # For more breaks than the default
   col = "darkslategray4", border = "seashell3")
lines(density(duration - 0.5),   # Add the kernel density estimate (-.5 fix for the bins)
   col = "firebrick2", lwd = 3)

应该给你类似的东西：
持续时间直方图

请注意，核密度估计默认采用高斯核。但带宽通常是最重要的因素。如果您直接调用密度，它会报告默认的估计带宽：

> density(duration)

Call:
        density.default(x = duration)

Data: duration (500 obs.);      Bandwidth 'bw' = 0.7752

       x                 y            
 Min.   : 0.6745   Min.   :1.160e-05  
 1st Qu.: 7.0872   1st Qu.:1.038e-03  
 Median :13.5000   Median :1.932e-02  
 Mean   :13.5000   Mean   :3.895e-02  
 3rd Qu.:19.9128   3rd Qu.:7.521e-02  
 Max.   :26.3255   Max.   :1.164e-01

此处为 0.7752。检查它的数据并按照 nico 的建议进行操作。您可能想查看?bw.nrd。

As nico said, you should check out hist, but you can also combine the two of them. Then you could call the density with lines instead.
Example:

duration <- rpois(500, 10) # For duration data I assume Poisson distributed
hist(duration,
   probability = TRUE, # In stead of frequency
   breaks = "FD",      # For more breaks than the default
   col = "darkslategray4", border = "seashell3")
lines(density(duration - 0.5),   # Add the kernel density estimate (-.5 fix for the bins)
   col = "firebrick2", lwd = 3)

Should give you something like:
Histogram of duration

Note that the kernel density estimate assumes a Gaussian kernel as default. But the bandwidth is often the most important factor. If you call density directly it reports the default estimated bandwidth:

> density(duration)

Call:
        density.default(x = duration)

Data: duration (500 obs.);      Bandwidth 'bw' = 0.7752

       x                 y            
 Min.   : 0.6745   Min.   :1.160e-05  
 1st Qu.: 7.0872   1st Qu.:1.038e-03  
 Median :13.5000   Median :1.932e-02  
 Mean   :13.5000   Mean   :3.895e-02  
 3rd Qu.:19.9128   3rd Qu.:7.521e-02  
 Max.   :26.3255   Max.   :1.164e-01

Here it is 0.7752. Check it for your data and play around with it as nico suggested. You might want to look at ?bw.nrd.

回复收藏 0 原文

哆兒滾 2024-10-10 07:23:53

您应该使用 bandwith (bw) 参数来更改曲线的平滑度。一般来说，R 做得很好，并自动给出漂亮且平滑的曲线，但对于您的特定数据集来说，情况可能并非如此。

至于您正在使用的调用，是的，它是正确的， type="l" 不是必需的，它是用于绘制密度对象的默认值。曲线下的面积（即密度函数从 -Inf 到 +Inf 的积分）将为 = 1。

现在，密度曲线是您的情况下最好使用的吗？也许，也许不是……这实际上取决于您想要进行什么类型的分析。可能使用 hist 就足够了，而且可能会提供更多信息，因为您可以选择特定的持续时间段（请参阅 ?hist 了解更多信息）。

回复收藏 0 原文

飘过的浮云 2024-10-10 07:23:53

我本来打算将其添加为之前答案的评论，但它太大了。
明显的偏差是由于值在直方图中的分组方式造成的。对离散数据使用直方图通常是一个错误。见下文 ...

set.seed(1001)
tmpf <- function() {
  duration <- rpois(500, 10) # For duration data I assume Poisson distributed
  hist(duration,
       probability = TRUE, # In stead of frequency
       breaks = "FD",      # For more breaks than the default
       col = "darkslategray4", border = "seashell3",
       main="",ann=FALSE,axes=FALSE,xlim=c(0,25),ylim=c(0,0.15))
  box()
  lines(density(duration),   # Add the kernel density estimate
        col = "firebrick2", lwd = 3)
  par(new=TRUE)
  plot(table(factor(duration,levels=0:25))/length(duration),
       xlim=c(0,25),ylim=c(0,0.15),col=4,ann=FALSE,axes=FALSE)
}

par(mfrow=c(3,3),mar=rep(0,4))
replicate(9,tmpf())

I was going to add this as a comment to the previous answer, but it's too big.
The apparent skew is due to the way the values are binned in a histogram. It is often a mistake to use histograms for discrete data. See below ...

set.seed(1001)
tmpf <- function() {
  duration <- rpois(500, 10) # For duration data I assume Poisson distributed
  hist(duration,
       probability = TRUE, # In stead of frequency
       breaks = "FD",      # For more breaks than the default
       col = "darkslategray4", border = "seashell3",
       main="",ann=FALSE,axes=FALSE,xlim=c(0,25),ylim=c(0,0.15))
  box()
  lines(density(duration),   # Add the kernel density estimate
        col = "firebrick2", lwd = 3)
  par(new=TRUE)
  plot(table(factor(duration,levels=0:25))/length(duration),
       xlim=c(0,25),ylim=c(0,0.15),col=4,ann=FALSE,axes=FALSE)
}

par(mfrow=c(3,3),mar=rep(0,4))
replicate(9,tmpf())

回复收藏 0 原文

~没有更多了~