缩尾数据框
我想在这样的数据框中执行缩尾化:
event_date beta_before beta_after
2000-05-05 1.2911707054 1.3215648954
1999-03-30 0.5089734305 0.4269575657
2000-05-05 0.5414700258 0.5326762272
2000-02-09 1.5491034852 1.2839988507
1999-03-30 1.9380674599 1.6169735009
1999-03-30 1.3109909155 1.4468207148
2000-05-05 1.2576420753 1.3659492507
1999-03-30 1.4393018341 0.7417777965
2000-05-05 0.2624037804 0.3860641307
2000-05-05 0.5532216441 0.2618245169
2000-02-08 2.6642931822 2.3815576738
2000-02-09 2.3007578964 2.2626960407
2001-08-14 3.2681270302 2.1611010935
2000-02-08 2.2509121123 2.9481325199
2000-09-20 0.6624503316 0.947935581
2006-09-26 0.6431111805 0.8745333151
通过缩尾化,我的意思是找到 beta_before 的最大值和最小值。该值应替换为同一列中的第二高值或第二低值,而不会丢失观察中的其余细节。例如。在本例中,beta_before 中的最大值为 3.2681270302,应替换为 3.2681270302。 min 和 beta_after 变量将遵循相同的过程。因此,每列只有 2 个值会发生变化,即最高值和最低值,其余的保持不变。
有什么建议吗?我在 plyr 中尝试了不同的方法,但最终替换了整个观察结果,这是我不想做的。我想创建 2 个新变量,例如 beta_before_winsorized 和 beta_after_winsorized
I want to perform winsorization in a dataframe like this:
event_date beta_before beta_after
2000-05-05 1.2911707054 1.3215648954
1999-03-30 0.5089734305 0.4269575657
2000-05-05 0.5414700258 0.5326762272
2000-02-09 1.5491034852 1.2839988507
1999-03-30 1.9380674599 1.6169735009
1999-03-30 1.3109909155 1.4468207148
2000-05-05 1.2576420753 1.3659492507
1999-03-30 1.4393018341 0.7417777965
2000-05-05 0.2624037804 0.3860641307
2000-05-05 0.5532216441 0.2618245169
2000-02-08 2.6642931822 2.3815576738
2000-02-09 2.3007578964 2.2626960407
2001-08-14 3.2681270302 2.1611010935
2000-02-08 2.2509121123 2.9481325199
2000-09-20 0.6624503316 0.947935581
2006-09-26 0.6431111805 0.8745333151
By winsorization I mean to find the max and min for beta_before for example. That value should be replaced by the second highest or second lowest value in the same column, without loosing the rest of the details in the observation. For example. In this case, in beta_before the max value is 3.2681270302 and should be replaced by 3.2681270302. The same process will be followed for the min and then for the beta_after variable. Therefore, only 2 values per column will be changes, the highest and the minimum, the rest will remain the same.
Any advice? I tried different approaches in plyr, but I ended up replacing the whole observation, which I don’t want to do. I would like to create 2 new variables, for example beta_before_winsorized and beta _after_winsorized
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为缩尾处理通常会从有序列表的底部找到值 x%(通常是 10%、15% 或 20%),并将其下面的所有值替换为该值。与顶部相同。在这里,您只需选择顶部和底部值,但缩尾处理通常涉及指定要替换的顶部和底部值的百分比。
I thought winsorizing usually finds the value x% (typically 10%, 15%, or 20%) from the bottom of the ordered list, and replaces all the values below it with that value. Same with the top. Here you're just choosing the top and bottom value, but winsorizing usually involves specifying a percentage of values at the top and bottom to replace.
这是一个执行您描述的Winsorzation的函数:
如果您的数据位于数据帧
dat
中,那么我们可以使用您的过程通过以下方式对数据进行windsoroize:这会导致:
我不确定您在哪里得到了您建议的值,应该替换
beta_before
中的最大值,因为在提供的数据片段中第二高的是2.6642932
,这就是我的函数用来替换最大值的值价值 和。请注意,由于
which.min()
和which.max()
的记录方式,该函数仅在每一列中分别有一个最小值和最大值时才起作用工作。如果您有多个条目采用相同的最大值或最小值,那么我们需要不同的东西:应该这样做(后者未经测试)。
Here is a function that does the winsorzation you describe:
If you data are in a data frame
dat
, then we can windsoroize the data using your procedure via:which results in:
I'm not sure where you got the value you suggest should replace the max in
beta_before
as the second highest is2.6642932
in the snippet of data provided and that is what my function has used to replace with the maximum value with.Note the function will only work if there is one minimum and maximum values respectively in each column owing to the way
which.min()
andwhich.max()
are documented to work. If you have multiple entries taking the same max or min value then we would need something different:should do it (latter is not tested).
严格来说,“缩尾化”是用可接受的百分位数替换最极端的数据点的行为(如其他一些答案中提到的)。用于执行此操作的一个相当标准的 R 函数是
psych
包中的winsor
。尝试:我选择
trim =
为 0.0625(第 6.25 个百分位数和第 93.75 个百分位数),因为您只有 16 个数据点,并且您想要“控制”顶部和底部的数据点:1/16 = 0.0625请注意,这可能会使极端数据等于您的数据集中可能存在也可能不存在的百分位数:理论数据的第 n 个百分位数。
Strictly speaking, "winsorization" is the act of replacing the most extreme data points with an acceptable percentile (as mentioned in some of the other answers). One fairly standard R function to do this is
winsor
from thepsych
package. Try:I chose
trim =
to be 0.0625 (the 6.25th percentile and 93.75th percentile) because you only have 16 data points and you want to "rein in" the top and bottom ones: 1/16 = 0.0625Note that this might make the extreme data equal to a percentile number which may or may not exist in your data set: the theoretical n-th percentile of the data.
statar
包对此非常有效。从自述文件中复制相关片段:https://github.com/matthieugomez/statar
The
statar
package works very well for this. Copying the relevant snippet from the readme file:https://github.com/matthieugomez/statar
延续我之前关于用修剪位置处的值实际替换要修剪的值的观点:
follow up from my previous point about actually replacing the to-be-trimmed values with value at trim position: