根据多个数据帧列值查找重叠的范围重叠

发布于 2025-02-07 22:02:42 字数 1648 浏览 1 评论 0原文

我的TSV看起来如下：

chr_1   start_1 chr_2   start_2
11  69633786    14  105884873
12  81940993    X   137690551
13  29782093    12  97838049
14  105864244   11  69633799
17  33207000    20  9992701
17  38446991    20  2102271
17  38447482    17  29623333
20  9992701 17  33207000
20  10426599    17  33094167
20  13765533    17  29469669
22  27415959    8   36197094
22  37191634    8   38983042
22  44464751    18  74004141
8   36197054    22  23130534
8   36197054    22  23131537
8   36197054    8   23130539

这将被称为TransDiffStartendChr，这是一个数据框架。

我正在研究一个将此TSV作为输入的程序，并输出具有相同CHR_1和CHR_2的行，以及一个为+/- 1000的start_1和start_2。

理想的输出看起来像：

chr_1   start_1 chr_2   start_2

8   36197054    8   23130539
8   36197054    22  23131537

有可能基于每个命中的组来创建基于每个命中的组CHR_1和CHR_2。

我当前的脚本/想法：

transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')


#I will extract rows first by chr_1, in this case I'm doing a test case for 17. 
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]

#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
    for index2, row2 in rowsStartChr17.iterrows():
        if index == index2:
            continue
        elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
            if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
                print(f'Row: {index} Match: {index2}')

任何想法都受到赞赏。

原文

I have a TSV that looks as follows:

chr_1   start_1 chr_2   start_2
11  69633786    14  105884873
12  81940993    X   137690551
13  29782093    12  97838049
14  105864244   11  69633799
17  33207000    20  9992701
17  38446991    20  2102271
17  38447482    17  29623333
20  9992701 17  33207000
20  10426599    17  33094167
20  13765533    17  29469669
22  27415959    8   36197094
22  37191634    8   38983042
22  44464751    18  74004141
8   36197054    22  23130534
8   36197054    22  23131537
8   36197054    8   23130539

This will be referred to as transDiffStartEndChr, which is a Dataframe.

I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.

Ideal output would look like:

chr_1   start_1 chr_2   start_2

8   36197054    8   23130539
8   36197054    22  23131537

Potentially creating groups for every hit based on chr_1 and chr_2.

My current script/thoughts:

transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')


#I will extract rows first by chr_1, in this case I'm doing a test case for 17. 
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]

#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
    for index2, row2 in rowsStartChr17.iterrows():
        if index == index2:
            continue
        elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
            if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
                print(f'Row: {index} Match: {index2}')

Any thoughts are appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨夜星沙 2025-02-14 22:02:42

可以玩Numpy和Pandas来滤除与您的要求不符的组。

>>> df.groupby(['chr_1', 'chr_2'])\
      .filter(lambda s: len(np.array(np.where(
                                     np.tril(
                                     np.abs(
                                     np.subtract.outer(s['start_2'].values, 
                                                       s['start_2'].values)) < 1500  , -1)))\
                                       .flatten()) > 0)

逻辑是groupby chr_1和chr_2，并执行ofter在start_2值之间的扣除，以检查我们是否可以值值下面1500（我使用的阈值）。

Can play with numpy and pandas to filter out the groups that don't match your requirements.

>>> df.groupby(['chr_1', 'chr_2'])\
      .filter(lambda s: len(np.array(np.where(
                                     np.tril(
                                     np.abs(
                                     np.subtract.outer(s['start_2'].values, 
                                                       s['start_2'].values)) < 1500  , -1)))\
                                       .flatten()) > 0)

The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).

回复收藏 0 原文

~没有更多了~