data.table 中的子集

发布于 2024-10-27 10:49:03 字数 399 浏览 1 评论 0原文

我正在尝试对 data.table 进行子集化（来自包 data.table ）在 R 中（不是 data.frame）。我有一个 4 位数的年份作为密钥。我想通过一系列的年份来进行子集化。例如，我想提取 1999 年、2000 年、2001 年的所有记录。

我尝试传入我的 DT[J(year)] 二进制搜索语法如下：

1999,2000,2001
c(1999,2000,2001)
1999, 2000, 2001

但这些似乎都不是去工作。任何人都知道如何做一个子集，其中您想要选择的年份不仅仅是 1 年而是多个年份？

原文

I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of years. For example, I want to pull all the records that are from 1999, 2000, 2001.

I have tried passing in my DT[J(year)] binary search syntax the following:

1999,2000,2001
c(1999,2000,2001)
1999, 2000, 2001

but none of these seem to work. Anyone know how to do a subset where the years you want to select are not just 1 but multiple years?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半寸时光 2024-11-03 10:49:03

适用于 data.frame 的内容也适用于 data.table 。

subset(DT, year %in% 1999:2001)

What works for data.frames works for data.tables.

subset(DT, year %in% 1999:2001)

回复收藏 0 原文

妄断弥空 2024-11-03 10:49:03

这个问题不清楚，也没有提供足够的数据来使用，但它很有用，所以如果有人可以用我以后提供的数据对其进行编辑，欢迎。帖子的标题也可以完成：Matthew Dowle 经常回答两个向量的子集问题，但很少回答根据一个向量中的语句进行子集的问题。我一直在寻找答案，直到找到一个字符向量此处。

让我们考虑一下这个数据：

library(data.table)
n <- 100
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)

对应于 X[X$a %in% c(10,20),] 的 data.table 样式查询在某种程度上令人惊讶：

setkey(X,a)
X[.(c(10,20))]
X[.(10,20)] # works for characters but not for integers
            # instead, treats 10 as the filter
            # and 20 as a new variable

# for comparison :
X[X$a %in% c(10,20),]

现在，哪个最好？如果你的密钥已经设置，data.table，显然。否则，可能不会，正如以下时间测量（在我的 1.75 Go RAM 计算机上）所证明的那样：

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(X[X$a %in% c(10,20),])
# utilisateur     système      écoulé (yes, I'm French) 
#        1.92        0.06        1.99
system.time(setkey(X,a))
# utilisateur     système      écoulé 
#       34.91        0.05       35.23 
system.time(X[J(c(10,20))])
# utilisateur     système      écoulé 
#        0.15        0.08        0.23

但也许 Matthew 有更好的解决方案...

[Matthew] 你已经发现了排序类型 numeric（又名double）比integer慢得多。多年来，我们不允许双键，因为担心用户落入这个陷阱并报告这样的糟糕计时。我们允许在键中使用 double，但有些担心，因为 double 尚未实现快速排序。对整数和字符的快速排序非常好，因为它们是使用计数排序完成的。 ~~希望有一天我们能够对数字进行快速排序！~~（现已实施 - 见下文）。

data.table 1.9.0 之前的时间

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#   user  system elapsed 
# 13.898   0.138  14.216 

X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
#   user  system elapsed 
#  0.381   0.019   0.408

请记住，默认情况下，2 在 R 中是 numeric 类型。 2L 是整数。尽管data.table接受numeric，但它仍然更喜欢integer。

从 v1.9.0 开始实现数字的快速基数排序。

从 v1.9.0 开始

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#    user  system elapsed 
#   0.832   0.026   0.871

The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here.

Let's consider this data :

library(data.table)
n <- 100
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)

The data.table-style query corresponding to X[X$a %in% c(10,20),] is somehow surprising :

setkey(X,a)
X[.(c(10,20))]
X[.(10,20)] # works for characters but not for integers
            # instead, treats 10 as the filter
            # and 20 as a new variable

# for comparison :
X[X$a %in% c(10,20),]

Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) :

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(X[X$a %in% c(10,20),])
# utilisateur     système      écoulé (yes, I'm French) 
#        1.92        0.06        1.99
system.time(setkey(X,a))
# utilisateur     système      écoulé 
#       34.91        0.05       35.23 
system.time(X[J(c(10,20))])
# utilisateur     système      écoulé 
#        0.15        0.08        0.23

But maybe Matthew has better solutions...

[Matthew] You've discovered that sorting type numeric (a.k.a. double) is much slower than integer. For many years we didn't allow double in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed double in keys with some trepidation because fast sorting isn't implemented for double yet. Fast sorting on integer and character is pretty good because those are done using a counting sort. ~~Hopefully we'll get to fast sorting numeric one day!~~ (Now implemented - see below).

Timings on data.table pre-1.9.0

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#   user  system elapsed 
# 13.898   0.138  14.216 

X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
#   user  system elapsed 
#  0.381   0.019   0.408

Rememeber that 2 is type numeric in R by default. 2L is integer. Although data.table accepts numeric it still much prefers integer.

Fast radix sort for numerics is implemented since v1.9.0.

From v1.9.0 on

n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#    user  system elapsed 
#   0.832   0.026   0.871

回复收藏 0 原文

当爱已成负担 2024-11-03 10:49:03

与上面类似，但更多的 data.table esque：

DT[year %in% c(1999, 2000, 2001)]

回复收藏 0 原文

执手闯天涯 2024-11-03 10:49:03

这将起作用：

sample_DT = data.table(year = rep(1990:2010, length.out = 1000), 
                       random_number = rnorm(1000), key = "year")
year_subset = sample_DT[J(c(1990, 1995, 1997))]

类似地，您可以使用 setkey(existing_DT,year) 对现有的 data.table 进行键控，然后使用 J() 语法，如上所示。

我认为问题可能是你没有先输入数据。

This will work:

sample_DT = data.table(year = rep(1990:2010, length.out = 1000), 
                       random_number = rnorm(1000), key = "year")
year_subset = sample_DT[J(c(1990, 1995, 1997))]

Similarly, you can key an already existing data.table with setkey(existing_DT, year) and then use the J() syntax as shown above.

I think the problem may be that you didn't key the data first.

回复收藏 0 原文

~没有更多了~