解析PubMed数据并从多个文件中提取多个列

发布于 2025-01-29 10:13:32 字数 540 浏览 3 评论 0原文

我有来自PubMed的多个XML文件。多个文件在这里

如何在单个数据框架中解析并获取这些列。 如果一篇文章有​​几位作者,我想让它们作为单独的行

预期输出(所有作者都应包括在内):

Title  Year ArticleTitle     LastName ForeName
Nature 2021 Inter-mosaic ... Roy      Suva
Nature 2021 Inter-mosaic ... Pearson  John
Nature 2021 Neural dynamics  Pearson  John
Nature 2021 Neural dynamics  Mooney   Richard

I have multiple xml files from PubMed. Several files are here.

How to parse it and get these columns in a single dataframe.
If an article has several authors, I want to have them as separate rows

Expected output (all authors should be included):

Title  Year ArticleTitle     LastName ForeName
Nature 2021 Inter-mosaic ... Roy      Suva
Nature 2021 Inter-mosaic ... Pearson  John
Nature 2021 Neural dynamics  Pearson  John
Nature 2021 Neural dynamics  Mooney   Richard

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

行雁书 2025-02-05 10:13:32

首先,您想要的是可行的。这样的东西应该适用于您的第二个文件,您可以通过将代码用 loop包装> loop:

from lxml import etree
import pandas as pd

doc = etree.parse('file.xml')

columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
    last_name = auth.xpath(f'{columns[3]}/text()')[0]
    fore_name = auth.xpath(f'{columns[4]}/text()')[0]
    rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)

output(34671166.xml)添加其他文件:

    Title   ArticleDate     ArticleTitle    LastName    ForeName
0   Nature  2021    Neural dynamics underlying birdsong practice a...   Singh Alvarado  Jonnathan
1   Nature  2021    Neural dynamics underlying birdsong practice a...   Goffinet    Jack
2   Nature  2021    Neural dynamics underlying birdsong practice a...   Michael     Valerie
3   Nature  2021    Neural dynamics underlying birdsong practice a...   Liberti     William
4   Nature  2021    Neural dynamics underlying birdsong practice a...   Hatfield    Jordan
5   Nature  2021    Neural dynamics underlying birdsong practice a...   Gardner     Timothy
6   Nature  2021    Neural dynamics underlying birdsong practice a...   Pearson     John
7   Nature  2021    Neural dynamics underlying birdsong practice a...   Mooney  Richard

说完这一切,我不确定每个作者在单独的行中的数据框架是您拥有的数据类型的最佳想法。在此示例中,由于您有8位合着者,因此不必要地重复了8次文章标题的信息。您可以给每个作者一组单独的列,但是您会遇到问题,其中文章有3或10个合着者...

First, what you want is doable. Something like this should work for your second file, and you could add other files by wrapping the code with a for loop:

from lxml import etree
import pandas as pd

doc = etree.parse('file.xml')

columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
    last_name = auth.xpath(f'{columns[3]}/text()')[0]
    fore_name = auth.xpath(f'{columns[4]}/text()')[0]
    rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)

Output (for 34671166.xml):

    Title   ArticleDate     ArticleTitle    LastName    ForeName
0   Nature  2021    Neural dynamics underlying birdsong practice a...   Singh Alvarado  Jonnathan
1   Nature  2021    Neural dynamics underlying birdsong practice a...   Goffinet    Jack
2   Nature  2021    Neural dynamics underlying birdsong practice a...   Michael     Valerie
3   Nature  2021    Neural dynamics underlying birdsong practice a...   Liberti     William
4   Nature  2021    Neural dynamics underlying birdsong practice a...   Hatfield    Jordan
5   Nature  2021    Neural dynamics underlying birdsong practice a...   Gardner     Timothy
6   Nature  2021    Neural dynamics underlying birdsong practice a...   Pearson     John
7   Nature  2021    Neural dynamics underlying birdsong practice a...   Mooney  Richard

Having said all that, I'm not sure a dataframe with each author in a separate line is the best idea for the type of data you have. In this example, since you have 8 co-authors, information such as the article title is repeated unnecessarily 8 times. You could give each author a separate set of columns, but then you'll have problems where articles have 3 or 10 co-authors...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文