缩尾数据框

发布于 2024-11-13 09:54:59 字数 1060 浏览 2 评论 0原文

我想在这样的数据框中执行缩尾化:

event_date  beta_before     beta_after
2000-05-05  1.2911707054    1.3215648954
1999-03-30  0.5089734305    0.4269575657
2000-05-05  0.5414700258    0.5326762272
2000-02-09  1.5491034852    1.2839988507
1999-03-30  1.9380674599    1.6169735009
1999-03-30  1.3109909155    1.4468207148
2000-05-05  1.2576420753    1.3659492507
1999-03-30  1.4393018341    0.7417777965
2000-05-05  0.2624037804    0.3860641307
2000-05-05  0.5532216441    0.2618245169
2000-02-08  2.6642931822    2.3815576738
2000-02-09  2.3007578964    2.2626960407
2001-08-14  3.2681270302    2.1611010935
2000-02-08  2.2509121123    2.9481325199
2000-09-20  0.6624503316    0.947935581
2006-09-26  0.6431111805    0.8745333151

通过缩尾化,我的意思是找到 beta_before 的最大值和最小值。该值应替换为同一列中的第二高值或第二低值,而不会丢失观察中的其余细节。例如。在本例中,beta_before 中的最大值为 3.2681270302,应替换为 3.2681270302。 min 和 beta_after 变量将遵循相同的过程。因此,每列只有 2 个值会发生变化,即最高值和最低值,其余的保持不变。

有什么建议吗?我在 plyr 中尝试了不同的方法,但最终替换了整个观察结果,这是我不想做的。我想创建 2 个新变量,例如 beta_before_winsorized 和 beta_after_winsorized

I want to perform winsorization in a dataframe like this:

event_date  beta_before     beta_after
2000-05-05  1.2911707054    1.3215648954
1999-03-30  0.5089734305    0.4269575657
2000-05-05  0.5414700258    0.5326762272
2000-02-09  1.5491034852    1.2839988507
1999-03-30  1.9380674599    1.6169735009
1999-03-30  1.3109909155    1.4468207148
2000-05-05  1.2576420753    1.3659492507
1999-03-30  1.4393018341    0.7417777965
2000-05-05  0.2624037804    0.3860641307
2000-05-05  0.5532216441    0.2618245169
2000-02-08  2.6642931822    2.3815576738
2000-02-09  2.3007578964    2.2626960407
2001-08-14  3.2681270302    2.1611010935
2000-02-08  2.2509121123    2.9481325199
2000-09-20  0.6624503316    0.947935581
2006-09-26  0.6431111805    0.8745333151

By winsorization I mean to find the max and min for beta_before for example. That value should be replaced by the second highest or second lowest value in the same column, without loosing the rest of the details in the observation. For example. In this case, in beta_before the max value is 3.2681270302 and should be replaced by 3.2681270302. The same process will be followed for the min and then for the beta_after variable. Therefore, only 2 values per column will be changes, the highest and the minimum, the rest will remain the same.

Any advice? I tried different approaches in plyr, but I ended up replacing the whole observation, which I don’t want to do. I would like to create 2 new variables, for example beta_before_winsorized and beta _after_winsorized

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

故人如初 2024-11-20 09:54:59

我认为缩尾处理通常会从有序列表的底部找到值 x%(通常是 10%、15% 或 20%),并将其下面的所有值替换为该值。与顶部相同。在这里,您只需选择顶部和底部值,但缩尾处理通常涉及指定要替换的顶部和底部值的百分比。

I thought winsorizing usually finds the value x% (typically 10%, 15%, or 20%) from the bottom of the ordered list, and replaces all the values below it with that value. Same with the top. Here you're just choosing the top and bottom value, but winsorizing usually involves specifying a percentage of values at the top and bottom to replace.

挽清梦 2024-11-20 09:54:59

这是一个执行您描述的Winsorzation的函数:

winsorize <- function(x) {
    Min <- which.min(x)
    Max <- which.max(x)
    ord <- order(x)
    x[Min] <- x[ord][2]
    x[Max] <- x[ord][length(x)-1]
    x
}

如果您的数据位于数据帧dat中,那么我们可以使用您的过程通过以下方式对数据进行windsoroize:

dat2 <- dat
dat2[, -1] <- sapply(dat[,-1], winsorize)

这会导致:

R> dat2
   event_date beta_before beta_after
1  2000-05-05   1.2911707  1.3215649
2  1999-03-30   0.5089734  0.4269576
3  2000-05-05   0.5414700  0.5326762
4  2000-02-09   1.5491035  1.2839989
5  1999-03-30   1.9380675  1.6169735
6  1999-03-30   1.3109909  1.4468207
7  2000-05-05   1.2576421  1.3659493
8  1999-03-30   1.4393018  0.7417778
9  2000-05-05   0.5089734  0.3860641
10 2000-05-05   0.5532216  0.3860641
11 2000-02-08   2.6642932  2.3815577
12 2000-02-09   2.3007579  2.2626960
13 2001-08-14   2.6642932  2.1611011
14 2000-02-08   2.2509121  2.3815577
15 2000-09-20   0.6624503  0.9479356
16 2006-09-26   0.6431112  0.8745333

我不确定您在哪里得到了您建议的值,应该替换 beta_before 中的最大值,因为在提供的数据片段中第二高的是 2.6642932,这就是我的函数用来替换最大值的值价值 和。

请注意,由于 which.min()which.max() 的记录方式,该函数仅在每一列中分别有一个最小值和最大值时才起作用工作。如果您有多个条目采用相同的最大值或最小值,那么我们需要不同的东西:

winsorize2 <- function(x) {
    Min <- which(x == min(x))
    Max <- which(x == max(x))
    ord <- order(x)
    x[Min] <- x[ord][length(Min)+1]
    x[Max] <- x[ord][length(x)-length(Max)]
    x
}

应该这样做(后者未经测试)。

Here is a function that does the winsorzation you describe:

winsorize <- function(x) {
    Min <- which.min(x)
    Max <- which.max(x)
    ord <- order(x)
    x[Min] <- x[ord][2]
    x[Max] <- x[ord][length(x)-1]
    x
}

If you data are in a data frame dat, then we can windsoroize the data using your procedure via:

dat2 <- dat
dat2[, -1] <- sapply(dat[,-1], winsorize)

which results in:

R> dat2
   event_date beta_before beta_after
1  2000-05-05   1.2911707  1.3215649
2  1999-03-30   0.5089734  0.4269576
3  2000-05-05   0.5414700  0.5326762
4  2000-02-09   1.5491035  1.2839989
5  1999-03-30   1.9380675  1.6169735
6  1999-03-30   1.3109909  1.4468207
7  2000-05-05   1.2576421  1.3659493
8  1999-03-30   1.4393018  0.7417778
9  2000-05-05   0.5089734  0.3860641
10 2000-05-05   0.5532216  0.3860641
11 2000-02-08   2.6642932  2.3815577
12 2000-02-09   2.3007579  2.2626960
13 2001-08-14   2.6642932  2.1611011
14 2000-02-08   2.2509121  2.3815577
15 2000-09-20   0.6624503  0.9479356
16 2006-09-26   0.6431112  0.8745333

I'm not sure where you got the value you suggest should replace the max in beta_before as the second highest is 2.6642932 in the snippet of data provided and that is what my function has used to replace with the maximum value with.

Note the function will only work if there is one minimum and maximum values respectively in each column owing to the way which.min() and which.max() are documented to work. If you have multiple entries taking the same max or min value then we would need something different:

winsorize2 <- function(x) {
    Min <- which(x == min(x))
    Max <- which(x == max(x))
    ord <- order(x)
    x[Min] <- x[ord][length(Min)+1]
    x[Max] <- x[ord][length(x)-length(Max)]
    x
}

should do it (latter is not tested).

君勿笑 2024-11-20 09:54:59

严格来说,“缩尾化”是用可接受的百分位数替换最极端的数据点的行为(如其他一些答案中提到的)。用于执行此操作的一个相当标准的 R 函数是 psych 包中的 winsor。尝试:

dat$beta_before = psych::winsor(dat$beta_before, trim = 0.0625)
dat$beta_after  = psych::winsor(dat$beta_after , trim = 0.0625)

我选择 trim = 为 0.0625(第 6.25 个百分位数和第 93.75 个百分位数),因为您只有 16 个数据点,并且您想要“控制”顶部和底部的数据点:1/16 = 0.0625

请注意,这可能会使极端数据等于您的数据集中可能存在也可能不存在的百分位数:理论数据的第 n 个百分位数。

Strictly speaking, "winsorization" is the act of replacing the most extreme data points with an acceptable percentile (as mentioned in some of the other answers). One fairly standard R function to do this is winsor from the psych package. Try:

dat$beta_before = psych::winsor(dat$beta_before, trim = 0.0625)
dat$beta_after  = psych::winsor(dat$beta_after , trim = 0.0625)

I chose trim = to be 0.0625 (the 6.25th percentile and 93.75th percentile) because you only have 16 data points and you want to "rein in" the top and bottom ones: 1/16 = 0.0625

Note that this might make the extreme data equal to a percentile number which may or may not exist in your data set: the theoretical n-th percentile of the data.

趁年轻赶紧闹 2024-11-20 09:54:59

statar 包对此非常有效。从自述文件中复制相关片段:

# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))

https://github.com/matthieugomez/statar

The statar package works very well for this. Copying the relevant snippet from the readme file:

# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))

https://github.com/matthieugomez/statar

吻泪 2024-11-20 09:54:59

延续我之前关于用修剪位置处的值实际替换要修剪的值的观点:

winsorized.sample<-function (x, trim = 0, na.rm = FALSE, ...) 
{
  if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
    warning("argument is not numeric or logical: returning NA")
    return(NA_real_)
  }
  if (na.rm) 
    x <- x[!is.na(x)]
  if (!is.numeric(trim) || length(trim) != 1L) 
    stop("'trim' must be numeric of length one")
  n <- length(x)
  if (trim > 0 && n) {
    if (is.complex(x)) 
      stop("trimmed sample is not defined for complex data")
    if (any(is.na(x))) 
      return(NA_real_)
    if (trim >= 0.5) { 
      warning("trim >= 0.5 is odd...trying it anyway")
    }
    lo <- floor(n * trim) + 1
    hi <- n + 1 - lo
    #this line would work for just trimming 
    #  x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
    #instead, we're going to replace what would be trimmed
    #with value at trim position using the next 7 lines
    idx<-seq(1,n)
    myframe<-data.frame(idx,x)
    myframe<-myframe[ order(x,idx),]
    myframe$x[1:lo]<-x[lo]
    myframe$x[hi:n]<-x[hi]
    myframe<-myframe[ order(idx,x),]
    x<-myframe$x
  }
  x
}
#test it
mydist<-c(1,20,1,5,2,40,5,2,6,1,5)
mydist2<-winsorized.sample(mydist, trim=.2)
mydist
mydist2
descStat(mydist)
descStat(mydist2)

follow up from my previous point about actually replacing the to-be-trimmed values with value at trim position:

winsorized.sample<-function (x, trim = 0, na.rm = FALSE, ...) 
{
  if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
    warning("argument is not numeric or logical: returning NA")
    return(NA_real_)
  }
  if (na.rm) 
    x <- x[!is.na(x)]
  if (!is.numeric(trim) || length(trim) != 1L) 
    stop("'trim' must be numeric of length one")
  n <- length(x)
  if (trim > 0 && n) {
    if (is.complex(x)) 
      stop("trimmed sample is not defined for complex data")
    if (any(is.na(x))) 
      return(NA_real_)
    if (trim >= 0.5) { 
      warning("trim >= 0.5 is odd...trying it anyway")
    }
    lo <- floor(n * trim) + 1
    hi <- n + 1 - lo
    #this line would work for just trimming 
    #  x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
    #instead, we're going to replace what would be trimmed
    #with value at trim position using the next 7 lines
    idx<-seq(1,n)
    myframe<-data.frame(idx,x)
    myframe<-myframe[ order(x,idx),]
    myframe$x[1:lo]<-x[lo]
    myframe$x[hi:n]<-x[hi]
    myframe<-myframe[ order(idx,x),]
    x<-myframe$x
  }
  x
}
#test it
mydist<-c(1,20,1,5,2,40,5,2,6,1,5)
mydist2<-winsorized.sample(mydist, trim=.2)
mydist
mydist2
descStat(mydist)
descStat(mydist2)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文