PubMed提取文章详细信息到Daframe

发布于 2025-01-24 04:50:03 字数 1507 浏览 2 评论 0原文

这是代码。

import pandas as pd
from pymed import PubMed
import numpy as np
pubmed = PubMed(tool="PubMedSearcher", email="[email protected]")


## PUT YOUR SEARCH TERM HERE ##
search_term = 'Charlie Brown'
results = pubmed.query(search_term, max_results=100000)
articleList = []
articleInfo = []

for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
    articleDict = article.toDict()
    articleList.append(articleDict)

# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
    pubmedId = article['pubmed_id'].partition('\n')[0]
    # Append article info to dictionary 
    articleInfo.append({u'pubmed_id':pubmedId,
                       u'publication_date':article['publication_date'], 
                       u'authors':article['authors']})

df=pd.json_normalize(articleInfo)

运行此代码将获取三列，PubMed_ID，publication_date和作者。

有没有办法可以不介意作者列并保留其他两个列？非常感谢。

原文

Here is the code.

import pandas as pd
from pymed import PubMed
import numpy as np
pubmed = PubMed(tool="PubMedSearcher", email="[email protected]")


## PUT YOUR SEARCH TERM HERE ##
search_term = 'Charlie Brown'
results = pubmed.query(search_term, max_results=100000)
articleList = []
articleInfo = []

for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
    articleDict = article.toDict()
    articleList.append(articleDict)

# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
    pubmedId = article['pubmed_id'].partition('\n')[0]
    # Append article info to dictionary 
    articleInfo.append({u'pubmed_id':pubmedId,
                       u'publication_date':article['publication_date'], 
                       u'authors':article['authors']})

df=pd.json_normalize(articleInfo)

Running this code would fetch three columns, pubmed_id, publication_date and authors.

Is there a way to unnest the authors column and keep the other two columns? Thanks so much in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

卷耳 2025-01-31 04:50:03

如果您想不愿意，那么您必须定义一些策略。例如，您可以使用LastName，FirstName将每个作者用;：

# New column to easily identify how many authors there are in the paper
df['n_authors'] = df['authors'].map(len)

# Unnest authors into a single string using the above-mentioned strategy
df['authors'] = df['authors'].map(lambda authors: ';'.join([f"{author['lastname']}, {author['firstname']}" for author in authors]))

输出：输出：

   pubmed_id publication_date                                            authors  n_authors  
0   35435469       2022-04-19  Easwaran, Raju;Khan, Moin;Sancheti, Parag;Shya...         41  
1   34480858       2021-09-05  Flaxman, Amy;Marchevsky, Natalie G;Jenkin, Dan...         38  
2   30857579       2019-03-13                                     Brown, Charlie          1  
3   28640023       2017-06-24  Thornton, Kevin C;Schwarz, Jennifer J;Gross, A...         12  
4   24195874       2013-11-08  Bicket, Mark C;Gupta, Anita;Brown, Charlie H;C...          4  
5   21741796       2011-07-12  Bird, Jonathan H;Carmont, Michael R;Dhillon, M...          7  
6   21324873       2011-02-18  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
7   20228712       2010-03-17  Cohen, Steven P;Kapoor, Shruti G;Nguyen, Cuong...          8  
8   20109957       2010-01-30  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
9   18248779       2008-02-06  Whitaker, Iain S;Duggan, Eileen M;Alloway, Rit...         10  
10  16917639       2006-08-19  Drayton, William;Brown, Charlie;Hillhouse, Karin          3  
11  16282488       2005-11-12  Mao, Hanwen;Lafont, Bernard A P;Igarashi, Tats...          9  
12  14581571       2003-10-29  Moniuszko, Marcin;Brown, Charlie;Pal, Ranajit;...          7  
13  12163382       2002-08-07  Williams, Kenneth;Schwartz, Annette;Corey, Sar...         10

If you want to unnest then, you have to define some strategy. For example, you can join the authors with lastname, firstname splitting each author with ;:

# New column to easily identify how many authors there are in the paper
df['n_authors'] = df['authors'].map(len)

# Unnest authors into a single string using the above-mentioned strategy
df['authors'] = df['authors'].map(lambda authors: ';'.join([f"{author['lastname']}, {author['firstname']}" for author in authors]))

Output:

   pubmed_id publication_date                                            authors  n_authors  
0   35435469       2022-04-19  Easwaran, Raju;Khan, Moin;Sancheti, Parag;Shya...         41  
1   34480858       2021-09-05  Flaxman, Amy;Marchevsky, Natalie G;Jenkin, Dan...         38  
2   30857579       2019-03-13                                     Brown, Charlie          1  
3   28640023       2017-06-24  Thornton, Kevin C;Schwarz, Jennifer J;Gross, A...         12  
4   24195874       2013-11-08  Bicket, Mark C;Gupta, Anita;Brown, Charlie H;C...          4  
5   21741796       2011-07-12  Bird, Jonathan H;Carmont, Michael R;Dhillon, M...          7  
6   21324873       2011-02-18  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
7   20228712       2010-03-17  Cohen, Steven P;Kapoor, Shruti G;Nguyen, Cuong...          8  
8   20109957       2010-01-30  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
9   18248779       2008-02-06  Whitaker, Iain S;Duggan, Eileen M;Alloway, Rit...         10  
10  16917639       2006-08-19  Drayton, William;Brown, Charlie;Hillhouse, Karin          3  
11  16282488       2005-11-12  Mao, Hanwen;Lafont, Bernard A P;Igarashi, Tats...          9  
12  14581571       2003-10-29  Moniuszko, Marcin;Brown, Charlie;Pal, Ranajit;...          7  
13  12163382       2002-08-07  Williams, Kenneth;Schwartz, Annette;Corey, Sar...         10

回复收藏 0 原文

~没有更多了~