使用 R 处理入院数据(第二部分)
感谢大家就使用 R 处理入院数据,我对这个问题还有补充问题,其实应该是这个问题之前的任务。
现在我有一个像这样的数据集:
Patient_ID Date Ward
P001 1 A
P001 2 A
P001 3 A
P001 4 A
P001 4 B
P001 5 B
P001 6 B
P001 7 B
P001 7 C
P001 8 B
P001 9 B
P001 10 B
我需要将其转换为:
Patient_ID Date Ward
P001 1 A
P001 2 A
P001 3 A
P001 4 A;B
P001 5 B
P001 6 B
P001 7 B;C
P001 8 B
P001 9 B
P001 10 B
目前我已经使用 ddply 转换它,代码附在下面:
data <- ddply(data,
c("Patient_ID", "Date"),
function(df)
{data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
},
.progress="text"
)
这可以解决我的问题,但它非常慢(超过 20分钟),当数据集具有 8818 unique(Patients_ID)
和 1861 unique(Date)
时。我该如何改进呢?谢谢!
Thanks all for providing suggestion on the question processing of hospital admission data using R, I have addition question on this issue, actually, it should be the task before that question.
Now I have a dataset like this:
Patient_ID Date Ward
P001 1 A
P001 2 A
P001 3 A
P001 4 A
P001 4 B
P001 5 B
P001 6 B
P001 7 B
P001 7 C
P001 8 B
P001 9 B
P001 10 B
I need to convert it into:
Patient_ID Date Ward
P001 1 A
P001 2 A
P001 3 A
P001 4 A;B
P001 5 B
P001 6 B
P001 7 B;C
P001 8 B
P001 9 B
P001 10 B
Currently I have convert it using ddply
, code is attached below:
data <- ddply(data,
c("Patient_ID", "Date"),
function(df)
{data.frame(Ward=paste(unique(df[,"Ward"]),collapse=";"))
},
.progress="text"
)
This can solve my problem, but it is VERY slow (more than 20 minutes on a P4 3.2 machine) when the dataset is having 8818 unique(Patients_ID)
and 1861 unique(Date)
. How can I improve that? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有效的方法是,假设您的数据位于对象
pdat
中并给出:
它应该愉快地扩展到更多患者等,并且比您的
ddply()
快很多版本:vs
但是,这并不意味着 ddply() 无法加速 - 我不熟悉这个包。
这两个版本是否以类似的方式扩展 - 即仅仅因为
aggregate()
版本在这些对简单数据的重复测试中更快,并不意味着您在应用于更大的任务 - 还有待观察,但我会让您在一小部分数据上与多个患者一起测试这两个版本,看看它们的扩展程度如何。编辑:
一项快速测试 - 重复您提供给我们的患者数据以生成 4 名新患者(总共 5 名),所有患者都具有相同的数据,这表明聚合的规模更好一些。对于 1000 次重复,
aggregate()
版本的执行时间高达 4.6 秒(大约翻倍),而ddply()
版本的执行时间高达 52 秒(〜四倍)。Something that works is this, assuming your data are in object
pdat
and gives:
It should extend happily to more patients etc, and is quite a bit faster than your
ddply()
version:vs
However, this doesn't mean that the
ddply()
cannot be speeded up - I'm not familiar with this package.Whether the two versions scale in a similar manner - i.e. just because the
aggregate()
version is quicker in these repeated tests on simple data, doesn't mean you'll get the same benefit when applied to the much larger task - remains to be seen, but I'll leave you to test the two versions on small subsets of your data with more than a few patients to see how well they scale.Edit:
A quick test - repeating the patient data you gave us to generate four new patients (giving 5 in total), all with same data, suggests that the aggregate one scales a bit better. Execution time for the
aggregate()
version went up to 4.6 second for the 1000 reps (~ a doubling) whereas the timing for theddply()
version went up to 52 seconds (~ a quadrupling).