在第二熊猫DF的新列中添加第一大熊猫DF的元素
我写了一个带有嵌套循环的python脚本,该脚本在第二熊猫DF的新列中添加了第一个pandas df的元素,其条件是第一个pandas df中的列元素在第二列pandas的两个元素之间DF。我的python脚本运行良好,但要完成很多时间。如果有人帮助我改进脚本,以便减少运行时间,那将是很棒的,因为我的数据集很大。期待答案。
这是第一个Pandas DF(GT)和第二Pandas DF(GC)的样本 -
import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'],
'POS': [23197, 23308, 634553, 727233]
'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'],
'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
'start': [14404,137682,185217,629062]
'end':[29570,137965,195411,629433]})
这是我在这些Pandas DF上运行的Python代码
import pandas as pd
with open("merged_file.txt","w") as fp:
Gene_ID=[]
Gene_Name=[]
Gene_Coordinate_start=[]
Gene_Coordinate_end=[]
for m in range(0,len(GT["POS"])-1):
for n in range(0,len(GC["start"])-1):
if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
Gene_ID.append(GC["Gene_ID"].iloc[n])
Gene_Name.append(GC["Gene_Name"].iloc[n])
Gene_Coordinate_start.append(GC["start"].iloc[n])
Gene_Coordinate_end.append(GC["end"].iloc[n])
print(m)
print(n)
GT["Gene_ID"]=Gene_ID
GT["Gene_Name"]=Gene_Name
GT["Gene_Coordinate_start"]=Gene_Coordinate_start
GT["Gene_Coordinate_end"]=Gene_Coordinate_end
GT.to_csv('merge_snp_gene.csv')
I wrote a python script with a nested for loop which adds elements of first pandas df in a new column of the second pandas df with a condition that element of a column in first pandas df is in between ANY elements of two columns of the second pandas df. My python script is running fine but it is taking a lot of time to complete. It will be great if someone help me to improve the script so that the run time is reduced as my dataset is very large. Looking forward for the answers.
Here is the sample of the first pandas df (GT) and second pandas df (GC) -
import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'],
'POS': [23197, 23308, 634553, 727233]
'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'],
'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
'start': [14404,137682,185217,629062]
'end':[29570,137965,195411,629433]})
and here is my Python code that I ran on these pandas df
import pandas as pd
with open("merged_file.txt","w") as fp:
Gene_ID=[]
Gene_Name=[]
Gene_Coordinate_start=[]
Gene_Coordinate_end=[]
for m in range(0,len(GT["POS"])-1):
for n in range(0,len(GC["start"])-1):
if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
Gene_ID.append(GC["Gene_ID"].iloc[n])
Gene_Name.append(GC["Gene_Name"].iloc[n])
Gene_Coordinate_start.append(GC["start"].iloc[n])
Gene_Coordinate_end.append(GC["end"].iloc[n])
print(m)
print(n)
GT["Gene_ID"]=Gene_ID
GT["Gene_Name"]=Gene_Name
GT["Gene_Coordinate_start"]=Gene_Coordinate_start
GT["Gene_Coordinate_end"]=Gene_Coordinate_end
GT.to_csv('merge_snp_gene.csv')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论