查找从 CSV 文件读取的多个列表中的重复项 (Python)

发布于 2024-11-05 18:42:45 字数 656 浏览 1 评论 0原文

标题似乎令人困惑,但假设我正在使用以下 CSV 文件(“names.csv”)。

    name1,name2,name3
    Bob,Jane,Joe
    Megan,Tom,Jane
    Jane,Joe,Rob

我的问题是,我将如何编写返回至少出现 3 次的字符串的代码。所以输出应该是“Jane”,因为这种情况至少发生了 3 次。这里真的很困惑..也许一些示例代码可以帮助我更好地理解?

到目前为止,我已经:

    import csv
    reader = csv.DictReader(open("names.csv"))

    for row in reader:
        names = [row['name1'], row['name2'], row['name3']]
        print names

这返回:

    ['Bob', 'Jane', 'Joe']
    ['Megan', 'Tom', 'Jane']
    ['Jane', 'Joe', 'Rob']

我从这里去哪里?或者我会做错事吗?我对Python真的很陌生(嗯,完全是编程),所以我几乎不知道我在做什么......

干杯

Title seems confusing, but let's say I'm working with the following CSV file ('names.csv').

    name1,name2,name3
    Bob,Jane,Joe
    Megan,Tom,Jane
    Jane,Joe,Rob

My question is, how would I go about making code that returns the string that occurs at least 3 times. So the output should be 'Jane', because that occurs at least 3 times. Really confused here.. perhaps some sample code would help me better understand?

So far I have:

    import csv
    reader = csv.DictReader(open("names.csv"))

    for row in reader:
        names = [row['name1'], row['name2'], row['name3']]
        print names

This returns:

    ['Bob', 'Jane', 'Joe']
    ['Megan', 'Tom', 'Jane']
    ['Jane', 'Joe', 'Rob']

Where do I go from here? Or am I going about this wrong? I'm really new to Python (well, programming altogether), so I have close to no clue what I'm doing..

Cheers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

绿光 2024-11-12 18:42:45

总而言之(并显示正确的 csv.reader 用法):

import csv
import collections
d = collections.defaultdict(int)
with open("names.csv", "rb") as f: # Python 3.x: use newline="" instead of "rb"
    reader = csv.reader(f):
    reader.next() # ignore useless heading row
    for row in reader:
        for name in row:
            name = name.strip()
            if name:
                d[name] += 1
 morethan3 = [(name, count) for name, count in d.iteritems() if count >= 3]
 morethan3.sort(key=lambda x: x[1], reverse=True)
 for name, count in morethan3:
    print name, count

更新以响应评论:

无论您是否使用 DictReader 方法,您都需要通读整个 CSV 文件。例如,如果您想忽略“name2”列(而不是行),则忽略它。您不需要像使用变量名称“rows”所建议的那样保存所有数据。以下是更通用方法的代码,该方法不依赖于特定顺序的列标题,并允许选择/拒绝特定列。

    reader = csv.DictReader(f):
    required_columns = ['name1', 'name3'] #### adjust this line as needed ####
    for row in reader:
        for col in required_columns:
            name = row[col].strip()
            if name:
                d[name] += 1

Putting it altogether (and showing proper csv.reader usage):

import csv
import collections
d = collections.defaultdict(int)
with open("names.csv", "rb") as f: # Python 3.x: use newline="" instead of "rb"
    reader = csv.reader(f):
    reader.next() # ignore useless heading row
    for row in reader:
        for name in row:
            name = name.strip()
            if name:
                d[name] += 1
 morethan3 = [(name, count) for name, count in d.iteritems() if count >= 3]
 morethan3.sort(key=lambda x: x[1], reverse=True)
 for name, count in morethan3:
    print name, count

Update in response to comment:

You need to read through the whole CSV file whether you use the DictReader approach or not. If you want to e.g. ignore the 'name2' column (not row), then ignore it. You don't need to save all the data as your use of the variable name "rows" suggests. Here is code for a more general approach that doesn't rely on the column headings being in a particular order and allows selection/rejection of particular columns.

    reader = csv.DictReader(f):
    required_columns = ['name1', 'name3'] #### adjust this line as needed ####
    for row in reader:
        for col in required_columns:
            name = row[col].strip()
            if name:
                d[name] += 1
桃扇骨 2024-11-12 18:42:45

我会这样做:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> rows = [['Bob', 'Jane', 'Joe'],
... ['Megan', 'Tom', 'Jane'],
... ['Jane', 'Joe', 'Rob']]
...
>>> for row in rows:
...     for name in row:
...         d[name] += 1
... 
>>> filter(lambda x: x[1] >= 3, d.iteritems())
[('Jane', 3)]

它使用默认值为 0 的字典来计算每个名称在文件中出现的次数,然后根据条件过滤字典(计数> = 3)。

I'd do it like this:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> rows = [['Bob', 'Jane', 'Joe'],
... ['Megan', 'Tom', 'Jane'],
... ['Jane', 'Joe', 'Rob']]
...
>>> for row in rows:
...     for name in row:
...         d[name] += 1
... 
>>> filter(lambda x: x[1] >= 3, d.iteritems())
[('Jane', 3)]

It uses dict with default value of 0 to count how many times each name happens in the file, and then it filters the dict with according condition (count >= 3).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文