如何在 python 中比较两个 CSV 中的列？

发布于 2024-11-07 10:23:38 字数 916 浏览 4 评论 0原文

你好我有两个 CSV 文件作为输入，例如：

file1 :

AK163828 chr5 s1 + e1 cttt 4
AK163828 chr5 s2 + e2 gtca 4
AK168688 chr6 s3 + e3 ggcg 4
AK168688 chr6 s4 + e4 tctg 4

file2 :

chr6s3+e3 ggcg
chr5s1+e1 cttt
chr6s4+e4 tata
chr5s2+e2 ggcg

#as you can see the file2 is randomly sorted

我想将 file2 的第 1 列与 file1 的合并后的第 2、3、4、5 列进行比较，同时将 file2 的第 2 列与文件的第 6 列进行比较1、仅选择匹配的行。

所需的输出是

chr6s3+e3 ggcg
chr5s1+e1 cttt

我尝试使用此代码：

import csv

reader1 = csv.reader(open(file1), dialect='excel-tab' )
reader2 = csv.reader(open(file2), dialect='excel-tab' )

     for row1, row2 in zip(reader1,reader2):
            F1 = row1[1] + row1[2] + row1[3] + row1[4] + '\t' row1[5]
            F2 = row2[0] + '\t' + row2[1]
            print set(F1) & set(F2)

但它不起作用。你能帮我修复我的代码或给我其他方法吗？感谢您的帮助！

原文

Hi
I have two CSV files as input, for example:

file1 :

AK163828 chr5 s1 + e1 cttt 4
AK163828 chr5 s2 + e2 gtca 4
AK168688 chr6 s3 + e3 ggcg 4
AK168688 chr6 s4 + e4 tctg 4

file2 :

chr6s3+e3 ggcg
chr5s1+e1 cttt
chr6s4+e4 tata
chr5s2+e2 ggcg

#as you can see the file2 is randomly sorted

I want to compare column 1 of file2 with column 2, 3, 4, 5 merged of file1 and at the same time the column 2 of file2 with column 6 of file 1, and select only the matching lines.

The desired output is

chr6s3+e3 ggcg
chr5s1+e1 cttt

I tried to use this code:

import csv

reader1 = csv.reader(open(file1), dialect='excel-tab' )
reader2 = csv.reader(open(file2), dialect='excel-tab' )

     for row1, row2 in zip(reader1,reader2):
            F1 = row1[1] + row1[2] + row1[3] + row1[4] + '\t' row1[5]
            F2 = row2[0] + '\t' + row2[1]
            print set(F1) & set(F2)

But it doesn't work. Can you help me to fix my code or give me an other way to do it?
Thanks for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们的影子 2024-11-14 10:23:38

快速而肮脏：

import csv

file1 = 'C:/Users/Me/Desktop/file1'
file2 = 'C:/Users/Me/Desktop/file2'

reader1 = csv.reader(open(file1))
reader2 = csv.reader(open(file2))

F1 = set(''.join(row1[0].strip().split()[1:6]) for row1 in reader1)
F2 = set(''.join(row2[0].strip().split()) for row2 in reader2)

for sequence in F1.intersection(F2):
    print(sequence[:-4]),
    print('\t'),
    print(sequence[-4:])

输出：

chr6s3+e3   ggcg
chr5s1+e1   cttt

Quick and dirty:

import csv

file1 = 'C:/Users/Me/Desktop/file1'
file2 = 'C:/Users/Me/Desktop/file2'

reader1 = csv.reader(open(file1))
reader2 = csv.reader(open(file2))

F1 = set(''.join(row1[0].strip().split()[1:6]) for row1 in reader1)
F2 = set(''.join(row2[0].strip().split()) for row2 in reader2)

for sequence in F1.intersection(F2):
    print(sequence[:-4]),
    print('\t'),
    print(sequence[-4:])

Output:

chr6s3+e3   ggcg
chr5s1+e1   cttt

回复收藏 0 原文

浅紫色的梦幻 2024-11-14 10:23:38

怎么样：

import csv

reader1 = csv.reader(open('file1.tsv'), dialect='excel-tab' )
reader2 = csv.reader(open('file2.tsv'), dialect='excel-tab' )

keys = set()

for row in reader1:
    keys.add((''.join(row[1:5]), row[5]))

for row in reader2:
    if (row[0], row[1]) in keys:
        print '\t'.join(row)

顺便说一句：您使用的格式 (dialect='excel-tab') 通常称为 TSV，而不是 CSV，尽管它是 CSV 的变体。您还必须确保您的值是用制表符分隔的，而不是像您的帖子中那样用空格分隔的。我猜是的，但由于 Stack Overflow 格式问题，您只有空格？

How about this:

import csv

reader1 = csv.reader(open('file1.tsv'), dialect='excel-tab' )
reader2 = csv.reader(open('file2.tsv'), dialect='excel-tab' )

keys = set()

for row in reader1:
    keys.add((''.join(row[1:5]), row[5]))

for row in reader2:
    if (row[0], row[1]) in keys:
        print '\t'.join(row)

By the way: the format you're using (dialect='excel-tab') is usually called TSV, and not CSV, although it is a variant of CSV. You also have to make sure your values are separated by tabs and not by spaces, like in your post. I guess they are, and you only have spaces beacause of Stack Overflow formatting issues?

回复收藏 0 原文

~没有更多了~