在第二熊猫DF的新列中添加第一大熊猫DF的元素

发布于 2025-01-18 03:14:02 字数 1827 浏览 1 评论 0原文

我写了一个带有嵌套循环的python脚本，该脚本在第二熊猫DF的新列中添加了第一个pandas df的元素，其条件是第一个pandas df中的列元素在第二列pandas的两个元素之间DF。我的python脚本运行良好，但要完成很多时间。如果有人帮助我改进脚本，以便减少运行时间，那将是很棒的，因为我的数据集很大。期待答案。

这是第一个Pandas DF（GT）和第二Pandas DF（GC）的样本 -

import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'], 
                 'POS': [23197, 23308, 634553, 727233]
                 'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'], 
                 'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
                 'start': [14404,137682,185217,629062]
                 'end':[29570,137965,195411,629433]})

这是我在这些Pandas DF上运行的Python代码

import pandas as pd
with open("merged_file.txt","w") as fp:
        Gene_ID=[]
        Gene_Name=[]
        Gene_Coordinate_start=[]
        Gene_Coordinate_end=[]
        for m in range(0,len(GT["POS"])-1):
                for n in range(0,len(GC["start"])-1):
                        if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
                                Gene_ID.append(GC["Gene_ID"].iloc[n])
                                Gene_Name.append(GC["Gene_Name"].iloc[n])
                                Gene_Coordinate_start.append(GC["start"].iloc[n])
                                Gene_Coordinate_end.append(GC["end"].iloc[n])
                                print(m)
                                print(n)
        GT["Gene_ID"]=Gene_ID
        GT["Gene_Name"]=Gene_Name
        GT["Gene_Coordinate_start"]=Gene_Coordinate_start
        GT["Gene_Coordinate_end"]=Gene_Coordinate_end
        GT.to_csv('merge_snp_gene.csv')

原文

I wrote a python script with a nested for loop which adds elements of first pandas df in a new column of the second pandas df with a condition that element of a column in first pandas df is in between ANY elements of two columns of the second pandas df. My python script is running fine but it is taking a lot of time to complete. It will be great if someone help me to improve the script so that the run time is reduced as my dataset is very large. Looking forward for the answers.

Here is the sample of the first pandas df (GT) and second pandas df (GC) -

import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'], 
                 'POS': [23197, 23308, 634553, 727233]
                 'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'], 
                 'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
                 'start': [14404,137682,185217,629062]
                 'end':[29570,137965,195411,629433]})

and here is my Python code that I ran on these pandas df

import pandas as pd
with open("merged_file.txt","w") as fp:
        Gene_ID=[]
        Gene_Name=[]
        Gene_Coordinate_start=[]
        Gene_Coordinate_end=[]
        for m in range(0,len(GT["POS"])-1):
                for n in range(0,len(GC["start"])-1):
                        if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
                                Gene_ID.append(GC["Gene_ID"].iloc[n])
                                Gene_Name.append(GC["Gene_Name"].iloc[n])
                                Gene_Coordinate_start.append(GC["start"].iloc[n])
                                Gene_Coordinate_end.append(GC["end"].iloc[n])
                                print(m)
                                print(n)
        GT["Gene_ID"]=Gene_ID
        GT["Gene_Name"]=Gene_Name
        GT["Gene_Coordinate_start"]=Gene_Coordinate_start
        GT["Gene_Coordinate_end"]=Gene_Coordinate_end
        GT.to_csv('merge_snp_gene.csv')

分享到QQ

分享到微博