使用 R 获取波动率和峰值平均值。互联网流量数据占比

发布于 2024-10-18 07:30:29 字数 1747 浏览 8 评论 0原文

我在 R 数据集中有十天时间段内每小时的网络流量数据，如下所示。

   Day   Hour         Volume          Category
    0    00            100            P2P
    0    00            50             email
    0    00            200            gaming
    0    00            200            video
    0    00            150            web
    0    00            120            P2P
    0    00            180            web
    0    00            80             email
    ....
    0    01            150            P2P
    0    01            200            P2P
    0    01             50            Web
    ...
    ...
    10   23            100            web
    10   23            200            email
    10   23            300            gaming
    10   23            300            gaming

正如所见，一小时内也存在类别的重复。我需要计算这些不同应用类别的波动性以及高峰小时与平均小时的比率。

波动性：每小时成交量的标准差除以每小时平均值。

高峰时段平均。小时比率：最大小时交易量与交易量的比值。该应用程序的平均小时数。

那么如何汇总和计算每个类别的这两个统计数据呢？我是 R 新手，对如何聚合和获取上述平均值不太了解。

因此，最终结果看起来像这样，首先通过对数量进行求和，然后计算两个统计数据，在单个 24 小时内聚合每个类别的数量

Category    Volatility      Peak to Avg. Ratio
Web            0.55            1.5
P2P            0.30            2.1
email          0.6             1.7
gaming         0.4             2.9

编辑：plyr 让我了解到了这一点。

stats = ddply(
    .data = my_data
    , .variables = .( Hour , Category)
    , .fun = function(x){
        to_return = data.frame(
            volatility = sd((x$Volume)/mean(x$Volume))
            , pa_ratio = max(x$Volume)/mean(x$Volume)
        )
        return( to_return )
    }
)

并不是我所希望的。我想要每个类别的统计数据，其中首先通过对交易量求和将一天中的所有时间聚合为 24 小时，然后计算波动率和 PA 比率。有什么改进建议吗？

原文

I have network traffic data in the following for for each hour of a ten day period as follows in a R dataset.

   Day   Hour         Volume          Category
    0    00            100            P2P
    0    00            50             email
    0    00            200            gaming
    0    00            200            video
    0    00            150            web
    0    00            120            P2P
    0    00            180            web
    0    00            80             email
    ....
    0    01            150            P2P
    0    01            200            P2P
    0    01             50            Web
    ...
    ...
    10   23            100            web
    10   23            200            email
    10   23            300            gaming
    10   23            300            gaming

As seen there are repetition of Category within a single hour also. I need to calculate the volatility and the peak hour to average hour ratios of these different application categories.

Volatility: Standard deviation of hourly volumes divided by hourly average.

Peak hour to avg. hour ratio: Ratio of volume of the maximum hour to the vol. of the average hour for that application.

So how do I aggregate and calculate these two statistics for each category? I am new to R and don't have much knowledge of how to aggregate and get the averages as mentioned.

So, the final result would look something like this where first the volume for each category is aggregated on a single 24 hour period by summing the volume and then calculating the two statistics

Category    Volatility      Peak to Avg. Ratio
Web            0.55            1.5
P2P            0.30            2.1
email          0.6             1.7
gaming         0.4             2.9

Edit: plyr got me as far as this.

stats = ddply(
    .data = my_data
    , .variables = .( Hour , Category)
    , .fun = function(x){
        to_return = data.frame(
            volatility = sd((x$Volume)/mean(x$Volume))
            , pa_ratio = max(x$Volume)/mean(x$Volume)
        )
        return( to_return )
    }
)

But this is not what I was hoping for. I want the statistics per Category where all the hours of the days are aggregated first into 24 hours by summing the volumes and then the volatility and PA ratio calculated. Any suggestions for improvement?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

℡寂寞咖啡 2024-10-25 07:30:29

您需要分两个阶段执行此操作（使用 plyr 包）：首先，正如您所指出的，同一类别可以有多个“日-小时”组合，因此我们首先聚合，每个类别，每小时内的总数，无论哪一天：

df1 <- ddply( df, .(Hour, Category), summarise, Volume = sum(Volume))

然后您将获得统计数据：

> ddply(df1, .(Category), summarise,
+            Volatility = sd(Volume)/mean(Volume),
+            PeakToAvg = max(Volume)/mean(Volume) )

  Category Volatility PeakToAvg
1      P2P  0.3225399  1.228070
2      Web         NA  1.000000
3    email  0.2999847  1.212121
4   gaming  0.7071068  1.500000
5    video         NA  1.000000
6      web  0.7564398  1.534884

You'd need to do it in two stages (using the plyr package): First, as you pointed out, there can be multiple Day-Hour combos for the same category, so we first aggregate, for each category, its totals within each Hour, regardless of the day:

df1 <- ddply( df, .(Hour, Category), summarise, Volume = sum(Volume))

Then you get your stats:

> ddply(df1, .(Category), summarise,
+            Volatility = sd(Volume)/mean(Volume),
+            PeakToAvg = max(Volume)/mean(Volume) )

  Category Volatility PeakToAvg
1      P2P  0.3225399  1.228070
2      Web         NA  1.000000
3    email  0.2999847  1.212121
4   gaming  0.7071068  1.500000
5    video         NA  1.000000
6      web  0.7564398  1.534884

回复收藏 0 原文

~没有更多了~