合并一段时间内的记录

发布于 12-16 23:25 字数 843 浏览 3 评论 0原文

首先我要说这个问题与 R(stat 编程语言)有关,但我对其他环境提出简单的建议。

目标是将数据帧 (df) A 的结果合并到 df B 中的子元素。这是一对多关系,但是,这是变化,一旦记录与键匹配,它们也具有匹配由开始时间和持续时间给出的特定时间帧。

例如,df A 中的一些记录:

    OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal

以及来自 df B:

    OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00  

合并的所需结果将是:

    OBS ID Time     Outcome  
    1   01 10:12:10 Normal 
    3   02 10:12:45 Weird 

所需结果:数据帧 B 的结果从 A 合并。请注意,观察 2 和 4 被删除,因为尽管它们与记录上的 ID 匹配在 A 中,它们不属于给定的任何时间间隔内。

问题

是否可以在 R 中执行此类操作以及如何开始?如果没有,您能推荐一个替代工具吗?

Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.

The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.

For example, a few records in df A:

    OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal

And from df B:

    OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00  

The desired outcome from the merge would be:

    OBS ID Time     Outcome  
    1   01 10:12:10 Normal 
    3   02 10:12:45 Weird 

Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.

Question

Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

浪漫之都2024-12-23 23:25:58

设置数据

首先设置输入数据框。我们创建两个版本的数据框:AB 仅使用字符列表示时间,AtBt > 使用 chron 包 "times" 类来表示时间(它比 "character" 类有优势,可以对它们进行加减运算):

LinesA <- "OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal"

LinesB <- "OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE, 
               colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE, 
               colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron) 

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

sqldf with times class

现在我们可以使用 sqldf 包。我们使用 method="raw" (它不会为输出分配类),因此我们必须将 "time" 类分配给输出 "Time"< /code> 列我们自己:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
   where Time between StartTime and StartTime + Duration",
   method = "raw")

out$Time <- times(as.numeric(out$Time))

结果是:

> out
      OBS ID     Time Outcome
1   1 01 10:12:10  Normal
2   3 02 10:12:45   Weird

使用 sqldf 的开发版本,无需使用 method="raw" 即可完成此操作,并且 "Time" 列将自动设置为 "times" 类sqldf 类分配启发式:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver 
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
    where Time between StartTime and StartTime + Duration")

具有字符类的 sqldf

实际上可以通过在 sqlite 中执行所有时间计算来使用“times”类使用 sqlite 的 strftime 函数的字符串。不幸的是,SQL 语句涉及更多一些:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
    where strftime('%s', Time) - strftime('%s', StartTime)
       between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

编辑:

一系列编辑修复了语法,添加了额外的方法并修复/改进了 read.table 语句。

编辑:

简化/改进的最终 sqldf 语句。

Set up data

First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):

LinesA <- "OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal"

LinesB <- "OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE, 
               colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE, 
               colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron) 

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

sqldf with times class

Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
   where Time between StartTime and StartTime + Duration",
   method = "raw")

out$Time <- times(as.numeric(out$Time))

The result is:

> out
      OBS ID     Time Outcome
1   1 01 10:12:10  Normal
2   3 02 10:12:45   Weird

With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver 
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
    where Time between StartTime and StartTime + Duration")

sqldf with character class

Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
    where strftime('%s', Time) - strftime('%s', StartTime)
       between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

EDIT:

A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.

EDIT:

Simplified/improved final sqldf statement.

北音执念2024-12-23 23:25:58

这是一个示例:

# first, merge by ID
z <- merge(A[, -1], B, by = "ID")

# convert string to POSIX time
z <- transform(z,
  s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
  dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) - 
    as.numeric(strptime("00:00:00", "%H:%M:%S")),
  tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))

# subset by time range
subset(z, s_t < tim & tim < s_t + dur)

输出:

  ID StartTime Duration Outcome OBS     Time        s_t dur        tim
1  1  10:12:06 00:00:10  Normal   1 10:12:10 1321665126  10 1321665130
2  1  10:12:06 00:00:10  Normal   2 10:12:15 1321665126  10 1321665135
7  2  10:12:30 00:00:30   Weird   3 10:12:45 1321665150  30 1321665165

OBS #2 看起来在范围内。这有意义吗?

here is an example:

# first, merge by ID
z <- merge(A[, -1], B, by = "ID")

# convert string to POSIX time
z <- transform(z,
  s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
  dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) - 
    as.numeric(strptime("00:00:00", "%H:%M:%S")),
  tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))

# subset by time range
subset(z, s_t < tim & tim < s_t + dur)

the output:

  ID StartTime Duration Outcome OBS     Time        s_t dur        tim
1  1  10:12:06 00:00:10  Normal   1 10:12:10 1321665126  10 1321665130
2  1  10:12:06 00:00:10  Normal   2 10:12:15 1321665126  10 1321665135
7  2  10:12:30 00:00:30   Weird   3 10:12:45 1321665150  30 1321665165

OBS #2 looks to be in the range. does it make sense?

浅暮の光2024-12-23 23:25:58

使用 merge() 将两个 data.frame 合并在一起。然后 subset() 得到结果 data.frame,条件为 time >= startTime & time <= startTime + Duration 或任何对您有意义的规则。

Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文