解析PubMed数据并从多个文件中提取多个列
我有来自PubMed的多个XML
文件。多个文件在这里。
如何在单个数据框架中解析并获取这些列。 如果一篇文章有几位作者,我想让它们作为单独的行
预期输出(所有作者都应包括在内):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
I have multiple xml
files from PubMed. Several files are here.
How to parse it and get these columns in a single dataframe.
If an article has several authors, I want to have them as separate rows
Expected output (all authors should be included):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您想要的是可行的。这样的东西应该适用于您的第二个文件,您可以通过将代码用 loop包装> loop:
output(34671166.xml)添加其他文件:
说完这一切,我不确定每个作者在单独的行中的数据框架是您拥有的数据类型的最佳想法。在此示例中,由于您有8位合着者,因此不必要地重复了8次文章标题的信息。您可以给每个作者一组单独的列,但是您会遇到问题,其中文章有3或10个合着者...
First, what you want is doable. Something like this should work for your second file, and you could add other files by wrapping the code with a
for
loop:Output (for 34671166.xml):
Having said all that, I'm not sure a dataframe with each author in a separate line is the best idea for the type of data you have. In this example, since you have 8 co-authors, information such as the article title is repeated unnecessarily 8 times. You could give each author a separate set of columns, but then you'll have problems where articles have 3 or 10 co-authors...