使用 R 处理入院数据(第二部分)

发布于 2024-10-01 23:34:44 字数 1105 浏览 3 评论 0原文

感谢大家就使用 R 处理入院数据,我对这个问题还有补充问题,其实应该是这个问题之前的任务。

现在我有一个像这样的数据集:

Patient_ID Date Ward
P001       1    A
P001       2    A
P001       3    A
P001       4    A
P001       4    B
P001       5    B
P001       6    B
P001       7    B
P001       7    C
P001       8    B
P001       9    B
P001       10   B

我需要将其转换为:

Patient_ID Date Ward
P001       1    A
P001       2    A
P001       3    A
P001       4    A;B
P001       5    B
P001       6    B
P001       7    B;C
P001       8    B
P001       9    B
P001       10   B

目前我已经使用 ddply 转换它,代码附在下面:

data <- ddply(data,
              c("Patient_ID", "Date"),
              function(df)
                {data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
                },
              .progress="text"
              )

这可以解决我的问题,但它非常慢(超过 20分钟),当数据集具有 8818 unique(Patients_ID) 和 1861 unique(Date) 时。我该如何改进呢?谢谢!

Thanks all for providing suggestion on the question processing of hospital admission data using R, I have addition question on this issue, actually, it should be the task before that question.

Now I have a dataset like this:

Patient_ID Date Ward
P001       1    A
P001       2    A
P001       3    A
P001       4    A
P001       4    B
P001       5    B
P001       6    B
P001       7    B
P001       7    C
P001       8    B
P001       9    B
P001       10   B

I need to convert it into:

Patient_ID Date Ward
P001       1    A
P001       2    A
P001       3    A
P001       4    A;B
P001       5    B
P001       6    B
P001       7    B;C
P001       8    B
P001       9    B
P001       10   B

Currently I have convert it using ddply, code is attached below:

data <- ddply(data,
              c("Patient_ID", "Date"),
              function(df)
                {data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
                },
              .progress="text"
              )

This can solve my problem, but it is VERY slow (more than 20 minutes on a P4 3.2 machine) when the dataset is having 8818 unique(Patients_ID) and 1861 unique(Date). How can I improve that? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

叹梦 2024-10-08 23:34:44

有效的方法是,假设您的数据位于对象 pdat

res <- with(pdat,
            aggregate(Ward, by = list(Date = Date, Patient_ID = Patient_ID),
                      FUN = paste, collapse = ";"))
names(res)[3] <- "Ward"
res <- res[, c(2,1,3)]

并给出:

> res
   Patient_ID Date Ward
1        P001    1    A
2        P001    2    A
3        P001    3    A
4        P001    4  A;B
5        P001    5    B
6        P001    6    B
7        P001    7  B;C
8        P001    8    B
9        P001    9    B
10       P001   10    B

它应该愉快地扩展到更多患者等,并且比您的 ddply() 快很多版本:

> system.time(replicate(1000,{
+ res <- with(pdat,
+             aggregate(Ward, by = list(Date = Date, Patient_ID = Patient_ID),
+                       FUN = paste, collapse = ";"))
+ names(res)[3] <- "Ward"
+ res <- res[, c(2,1,3)]
+ }))
   user  system elapsed 
  2.113   0.002   2.137

vs

> system.time(replicate(1000,{
+ ddply(pdat,
+       c("Patient_ID", "Date"),
+       function(df)
+       data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
+       )
+ }))
   user  system elapsed 
 12.862   0.006  12.966

但是,这并不意味着 ddply() 无法加速 - 我不熟悉这个包。

这两个版本是否以类似的方式扩展 - 即仅仅因为 aggregate() 版本在这些对简单数据的重复测试中更快,并不意味着您在应用于更大的任务 - 还有待观察,但我会让您在一小部分数据上与多个患者一起测试这两个版本,看看它们的扩展程度如何。


编辑:
一项快速测试 - 重复您提供给我们的患者数据以生成 4 名新患者(总共 5 名),所有患者都具有相同的数据,这表明聚合的规模更好一些。对于 1000 次重复,aggregate() 版本的执行时间高达 4.6 秒(大约翻倍),而 ddply() 版本的执行时间高达 52 秒(〜四倍)。

Something that works is this, assuming your data are in object pdat

res <- with(pdat,
            aggregate(Ward, by = list(Date = Date, Patient_ID = Patient_ID),
                      FUN = paste, collapse = ";"))
names(res)[3] <- "Ward"
res <- res[, c(2,1,3)]

and gives:

> res
   Patient_ID Date Ward
1        P001    1    A
2        P001    2    A
3        P001    3    A
4        P001    4  A;B
5        P001    5    B
6        P001    6    B
7        P001    7  B;C
8        P001    8    B
9        P001    9    B
10       P001   10    B

It should extend happily to more patients etc, and is quite a bit faster than your ddply() version:

> system.time(replicate(1000,{
+ res <- with(pdat,
+             aggregate(Ward, by = list(Date = Date, Patient_ID = Patient_ID),
+                       FUN = paste, collapse = ";"))
+ names(res)[3] <- "Ward"
+ res <- res[, c(2,1,3)]
+ }))
   user  system elapsed 
  2.113   0.002   2.137

vs

> system.time(replicate(1000,{
+ ddply(pdat,
+       c("Patient_ID", "Date"),
+       function(df)
+       data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
+       )
+ }))
   user  system elapsed 
 12.862   0.006  12.966

However, this doesn't mean that the ddply() cannot be speeded up - I'm not familiar with this package.

Whether the two versions scale in a similar manner - i.e. just because the aggregate() version is quicker in these repeated tests on simple data, doesn't mean you'll get the same benefit when applied to the much larger task - remains to be seen, but I'll leave you to test the two versions on small subsets of your data with more than a few patients to see how well they scale.


Edit:
A quick test - repeating the patient data you gave us to generate four new patients (giving 5 in total), all with same data, suggests that the aggregate one scales a bit better. Execution time for the aggregate() version went up to 4.6 second for the 1000 reps (~ a doubling) whereas the timing for the ddply() version went up to 52 seconds (~ a quadrupling).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文