根据列名称模式重塑大数据

发布于 2025-01-16 12:07:06 字数 3304 浏览 4 评论 0原文

我有一个（实际上很大）包含街道街区的数据集；它具有该街区的开始（“从”，在下面的变量中）和结束（“到”）的门牌号，无论是右侧还是左侧。这是一个例子：

library(data.table)

# raw address file
raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  rlat = c(1, 11, 21, 31),
                  llat = c(2, 12, 22, 32))
# for illustration only, hence the nonsensical coordinates

我想重塑这么长的形状，以便我对每个块的左侧和右侧都有一个观察。我打算使用 data.table 中的 melt 来实现此目的，这确实给出了正确的结果：

long <- melt(raw,
             id.vars = c('id', 'street'),
             measure.vars = patterns('from', 'to', 'lat'),
             value.name = c('from', 'to', 'lat'))

# this produces the dataset I want:
long
   id   street variable from  to lat
1:  a birch st        1    1  99   1
2:  b  main st        1  101 199  11
3:  c birch st        1  201 299  21
4:  d   elm rd        1  301 399  31
5:  a birch st        2    2 100   2
6:  b  main st        2  102 200  12
7:  c birch st        2  202 300  22
8:  d   elm rd        2  302 400  32

但是我意识到它是根据列位置而不是列名称的内容来熔化的（即我希望它查看名称中的“r”或“l”前缀），因为例如，如果您在等效的右侧变量之前列出一个“左侧”变量，同时将所有其他变量保留在左侧之前，它返回不正确的结果：

# now switch the order of the latitude variables:
raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  llat = c(2, 12, 22, 32),
                  rlat = c(1, 11, 21, 31))

# melt then gives us incorrect results:
long <- melt(raw,
             id.vars = c('id', 'street'),
             measure.vars = patterns('from', 'to', 'lat'),
             value.name = c('from', 'to', 'lat'))
# latitudes are associated with the wrong observation now:
long
   id   street variable from  to lat
1:  a birch st        1    1  99   2
2:  b  main st        1  101 199  12
3:  c birch st        1  201 299  22
4:  d   elm rd        1  301 399  32
5:  a birch st        2    2 100   1
6:  b  main st        2  102 200  11
7:  c birch st        2  202 300  21
8:  d   elm rd        2  302 400  31

# in a related, but less important issue, I'd prefer the variable column list l or r not 1 or 2

显然，对列重新排序是一种选择，但是我宁愿避免它，主要是因为我觉得依赖它是很脆弱的，因为程序随着时间的推移而变化，其他人在其中进行编辑等等，特别是考虑到在运行后检查结果是否正确是多么困难。融化。（在理想的情况下，我会在长数据集中有一个带有 l/r 前缀的变量。）

这似乎是一个已知问题，但是——我猜我误解了一些东西——我尝试使用新 measure 函数的解决方法不起作用（我刚刚收到一个错误，R 无法找到名为measure 的函数）。

此时，我认为我最好的选择是切换到使用 tidyr 中的 pivot_longer，但我想我会问是否有人可以指出我的情况（如果有的话）我做错了或者让我知道是否有更好/更有效的方法。

我真的很感激任何帮助。

原文

I have a (large, in reality) dataset with street blocks; it has house numbers for the beginning ("from", in the variables below) and end ("to") of the block, for both the right and left side. Here's an example:

library(data.table)

# raw address file
raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  rlat = c(1, 11, 21, 31),
                  llat = c(2, 12, 22, 32))
# for illustration only, hence the nonsensical coordinates

I want to reshape this long so that I have one observation for the left and right side of each block. I intended to use melt from data.table for this, which does give correct results:

long <- melt(raw,
             id.vars = c('id', 'street'),
             measure.vars = patterns('from', 'to', 'lat'),
             value.name = c('from', 'to', 'lat'))

# this produces the dataset I want:
long
   id   street variable from  to lat
1:  a birch st        1    1  99   1
2:  b  main st        1  101 199  11
3:  c birch st        1  201 299  21
4:  d   elm rd        1  301 399  31
5:  a birch st        2    2 100   2
6:  b  main st        2  102 200  12
7:  c birch st        2  202 300  22
8:  d   elm rd        2  302 400  32

However I realized that it's melting based on column position, not the contents of the column name (i.e. I want it to look at the "r" or "l" prefix in the name), because if, for example, you list one "left side" variable before the equivalent right one while keeping all other variables right before left, it returns incorrect results:

# now switch the order of the latitude variables:
raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  llat = c(2, 12, 22, 32),
                  rlat = c(1, 11, 21, 31))

# melt then gives us incorrect results:
long <- melt(raw,
             id.vars = c('id', 'street'),
             measure.vars = patterns('from', 'to', 'lat'),
             value.name = c('from', 'to', 'lat'))
# latitudes are associated with the wrong observation now:
long
   id   street variable from  to lat
1:  a birch st        1    1  99   2
2:  b  main st        1  101 199  12
3:  c birch st        1  201 299  22
4:  d   elm rd        1  301 399  32
5:  a birch st        2    2 100   1
6:  b  main st        2  102 200  11
7:  c birch st        2  202 300  21
8:  d   elm rd        2  302 400  31

# in a related, but less important issue, I'd prefer the variable column list l or r not 1 or 2

Obviously reordering columns is an option, but I'd prefer to avoid it, mainly because it strikes me as tenuous to rely on that as the program changes over time, other people make edits in it, etc, particularly given how difficult it is to check that the results are correct after the melting. (And in an ideal world I'd have a variable in the long dataset with the l/r prefix.)

It seems like this is a known issue, but--and I'm guessing I'm misunderstanding something--my attempts to use the workaround of the new measure function aren't working (I just get an error that R can't find a function called measure).

At this point I think my best option is to switch to using pivot_longer from tidyr, but I thought I'd ask if someone could point out what, if anything, I'm doing wrong or let me know if there's a better/more efficient way.

I'd really appreciate any help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

太傻旳人生 2025-01-23 12:07:06

我很确定您已经正确地识别了问题：街道数据的一侧（“左”或“右”）被编码在列名称中。这类似于嵌入虹膜数据列名称中的花朵部分（“花瓣”和“萼片”）。由于问题的性质，仅靠 melt 无法完成这项工作。

正如您所建议的，转换此类数据框并隔离列名称中的数据是一个旋转操作。我发现 cdata 包中开发的方法非常透明。然而，转换不是基于模式......相反，它使用精心设计且明确的控制表来管理转换。

为了说明这一点，我从第二个原始数据表开始。

raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  llat = c(2, 12, 22, 32),
                  rlat = c(1, 11, 21, 31))

构建一个控制表，其中包含指示街道一侧的新键列以及from、to和lat<的值列/代码>。

control_table <- wrapr::build_frame(
    "side" ,  "from",  "to",  "lat" |
    "right", "rfrom", "rto", "rlat" |
    "left" , "lfrom", "lto", "llat" )
setDT(control_table)

如果不想使用wrapper包，可以将控制表构建为常规数据表。

side <- c("right", "left")
from <- c("rfrom", "lfrom")
to   <- c("rto"  , "lto")
lat  <- c("rlat" , "llat")
control_table <- data.table(side, from, to, lat)

无论哪种情况，控制表都具有相同的结构。

control_table[]
#>     side  from  to  lat
#> 1: right rfrom rto rlat
#> 2:  left lfrom lto llat

第二步是使用 cdata 包中的 rowrecs_to_blocks() 对 raw 进行操作，其中 id 和 street列被标识为根据需要重复多次的值，以完成从行记录（宽）形式到块记录（长）形式的转换。

DT <- cdata::rowrecs_to_blocks(
  wideTable     = raw, 
  controlTable  = control_table,
  columnsToCopy = c("id", "street")
)
setDT(DT)

如果我正确理解你的问题，那么我认为这会产生你想要的表格，无论你开始的列顺序如何。请注意，街道数据现在是一个明确的“坐标”，有助于识别记录。

DT[]
#>    id   street  side from  to lat
#> 1:  a birch st right    1  99   1
#> 2:  a birch st  left    2 100   2
#> 3:  b  main st right  101 199  11
#> 4:  b  main st  left  102 200  12
#> 5:  c birch st right  201 299  21
#> 6:  c birch st  left  202 300  22
#> 7:  d   elm rd right  301 399  31
#> 8:  d   elm rd  left  302 400  32

当然其他包也可以产生类似的结果。我碰巧喜欢 cdata 转换的一致性。

I'm pretty sure you have identified the problem correctly: the side of the street data ("left" or "right") is encoded in the column names. This is similar to the flower part ("petal" and "sepal") embedded in the iris data column names. Because of the nature of the problem, melt alone can't do the job.

Transforming such data frames and isolating the data in the column names is a pivoting operation as you suggest. I have found the approach developed in the cdata package to be quite transparent. However, the transformation is not based on patterns...instead it uses a carefully designed and unambiguous control table to manage the transformation.

To illustrate, I start with your second raw data table.

raw <- data.table(id = letters[1:4],
                  rfrom = c(1, 101, 201, 301),
                  rto = c(99, 199, 299, 399),
                  lfrom = c(2, 102, 202, 302),
                  lto = c(100, 200, 300, 400),
                  street = c('birch st',
                             'main st',
                             'birch st',
                             'elm rd'),
                  llat = c(2, 12, 22, 32),
                  rlat = c(1, 11, 21, 31))

Build a control table with a new key column indicating the side of the street and the value columns from, to, and lat.

control_table <- wrapr::build_frame(
    "side" ,  "from",  "to",  "lat" |
    "right", "rfrom", "rto", "rlat" |
    "left" , "lfrom", "lto", "llat" )
setDT(control_table)

If you don't want to use the wrapr package, you can build the control table as a conventional data table.

side <- c("right", "left")
from <- c("rfrom", "lfrom")
to   <- c("rto"  , "lto")
lat  <- c("rlat" , "llat")
control_table <- data.table(side, from, to, lat)

In either case, the control table has the same structure.

control_table[]
#>     side  from  to  lat
#> 1: right rfrom rto rlat
#> 2:  left lfrom lto llat

The second step is to operate on raw using rowrecs_to_blocks() from the cdata package, where the id and street columns are identified as values to repeat as many times as necessary to complete the transformation from row-record (wide) form to block-record (long) form.

DT <- cdata::rowrecs_to_blocks(
  wideTable     = raw, 
  controlTable  = control_table,
  columnsToCopy = c("id", "street")
)
setDT(DT)

If I understand your question correctly, then I think this produces the form you want regardless of the column order you start with. Note that the side of the street data is now an explicit "coordinate" that helps identify records.

DT[]
#>    id   street  side from  to lat
#> 1:  a birch st right    1  99   1
#> 2:  a birch st  left    2 100   2
#> 3:  b  main st right  101 199  11
#> 4:  b  main st  left  102 200  12
#> 5:  c birch st right  201 299  21
#> 6:  c birch st  left  202 300  22
#> 7:  d   elm rd right  301 399  31
#> 8:  d   elm rd  left  302 400  32

Of course other packages can produce similar results. I happen to like the consistency of the cdata transformation.

回复收藏 0 原文

~没有更多了~