在 Python 中将列表项与大文件中的行进行匹配的最有效方法是什么？

发布于 2024-12-03 19:46:05 字数 1001 浏览 1 评论 0原文

我有一个名为 my_file 的大文件 (5Gb)。我有一个名为 my_list 的列表。读取文件中每一行的最有效方法是什么，如果 my_list 中的某个项目与 my_file 中某行的项目匹配，则创建一个名为 的新列表>matches 包含 my_file 中的行中的项目以及发生匹配的 my_list 中的项目。这是我想要做的：

def calc(my_file, my_list)
    matches = []
    my_file.seek(0,0)
    for i in my_file:
        i = list(i.rstrip('\n').split('\t'))
        for v in my_list:
            if v[1] == i[2]:
                item = v[0], i[1], i[3]
                matches.append(item)
    return matches

这是 my_file 中的一些行：

lion    4    blue    ch3
sheep   1    red     pq2
frog    9    green   xd7
donkey  2    aqua    zr8

这是 my_list 中的一些项目

intel    yellow
amd      green
msi      aqua

所需的输出，一个列表的列表，在上面的示例中将be：

[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]

我的代码目前可以工作，尽管速度很慢。使用生成器或序列化有帮助吗？谢谢。

原文

I have a large file (5Gb) called my_file. I have a list called my_list. What is the most efficient way to read each line in the file and, if an item from my_list matches an item from a line in my_file, create a new list called matches that contains items from the lines in my_file AND items from my_list where a match occurred. Here is what I am trying to do:

def calc(my_file, my_list)
    matches = []
    my_file.seek(0,0)
    for i in my_file:
        i = list(i.rstrip('\n').split('\t'))
        for v in my_list:
            if v[1] == i[2]:
                item = v[0], i[1], i[3]
                matches.append(item)
    return matches

here are some lines in my_file:

lion    4    blue    ch3
sheep   1    red     pq2
frog    9    green   xd7
donkey  2    aqua    zr8

here are some items in my_list

intel    yellow
amd      green
msi      aqua

The desired output, a list of lists, in the above example would be:

[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]

My code is currently work, albeit really slow. Would using a generator or serialization help? Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安穩 2024-12-10 19:46:05

您可以构建一个字典来查找 v。我添加了进一步的小优化：

def calc(my_file, my_list)

    vd = dict( (v[1],v[0]) for v in my_list)

    my_file.seek(0,0)
    for line in my_file:
        f0, f1, f2, f3 = line[:-1].split('\t')
        v0 = vd.get(f2)
        if v0 is not None:
           yield (v0, f1, f3)

对于大型 my_list 来说，这应该要快得多。

使用 get 比检查 i[2] 是否在 vd 中 + 访问 vd[i[2]]

为了获得超出这些优化的更多加速，我建议 http://www.cython.org

You could build a dictonary for looking up v. I added further little optimizations:

def calc(my_file, my_list)

    vd = dict( (v[1],v[0]) for v in my_list)

    my_file.seek(0,0)
    for line in my_file:
        f0, f1, f2, f3 = line[:-1].split('\t')
        v0 = vd.get(f2)
        if v0 is not None:
           yield (v0, f1, f3)

This should be much faster for a large my_list.

Using get is faster than checking if i[2] is in vd + accessing vd[i[2]]

For getting more speedup beyond these optimizations I recommend http://www.cython.org

回复收藏 0 原文

过期以后 2024-12-10 19:46:05

将项目保存在字典中而不是列表中（我们称之为items）。现在，按照您的操作迭代您的文件，并选出要查找的键 (i[2])，然后检查它是否在 items 中。

项目将是。

dict (yellow = "intel", green = "amd", aqua = "msi")

所以检查部分是。

if i[2] in items:
  yield [[items[i[2]], i[1], i[3]]

由于您只是创建列表并返回它，因此使用生成器可能有助于记忆程序的特征，而不是将整个事物放入列表中并返回它。

Keep the items in a dictional rather than a list (let's call it items). Now iterate through your file as you're doing and pick out the key to look for (i[2]) and then check if it's there in the in items.

items would be.

dict (yellow = "intel", green = "amd", aqua = "msi")

So the checking part would be.

if i[2] in items:
  yield [[items[i[2]], i[1], i[3]]

Since you're just creating the list and returning it, using a generator might help memory characteristics of the program rather than putting the whole thing into a list and returning it.

回复收藏 0 原文

冷︶言冷语的世界 2024-12-10 19:46:05

对于读取文件的开销，您无能为力，但根据您的示例代码，您可以通过将列表存储为字典（以目标字段作为键）来加快匹配速度。

这是一个示例，其中进行了一些额外的优化调整：

mylist = {
    "yellow" : "intel",
    "green" : "amd",
    # ....
}

matches = []
for line in my_file:
    i = line[:-1].split("\t")
    try:  # faster to ask for forgiveness than permission
        matches.append([mylist[i[2]], i[1], i[3]])
    except NameError:
        pass

但请再次注意，大部分性能瓶颈将出现在文件的读取中，并且此级别的优化可能不会对运行时产生足够大的影响。

There isn't much you can do with the overheads of reading the file in, but based on your example code, you can speed up the matching by storing your list as a dict (with the target field as the key).

Here's an example, with a few extra optimisation tweaks:

mylist = {
    "yellow" : "intel",
    "green" : "amd",
    # ....
}

matches = []
for line in my_file:
    i = line[:-1].split("\t")
    try:  # faster to ask for forgiveness than permission
        matches.append([mylist[i[2]], i[1], i[3]])
    except NameError:
        pass

But again, do note that most of your performance bottleneck will be in the reading of the file and optimisation at this level may not have a big enough impact on the runtime.

回复收藏 0 原文

oО清风挽发oО 2024-12-10 19:46:05

这是 @rocksportrocker 的答案使用 csv 模块：

import csv

def calc_csv(lines, lst):
    d = dict((v[1], v[0]) for v in lst) # use dict to speed up membership test
    return ((d[f2], f1, f3)
            for _, f1, f2, f3 in csv.reader(lines, dialect='excel-tab')
            if f2 in d) # assume that intersection is much less than the file

示例：

def test():
    my_file = """\
lion    4   blue    ch3
sheep   1   red pq2
frog    9   green   xd7
donkey  2   aqua    zr8
""".splitlines()

    my_list = [
    ("intel",    "yellow"),
    ("amd",      "green"),
    ("msi",      "aqua"),
    ]    

    res = list(calc_csv(my_file, my_list))
    assert [('amd', '9', 'xd7'), ('msi', '2', 'zr8')] == res


if __name__=="__main__":
   test()

Here's a variation on @rocksportrocker's answer using csv module:

import csv

def calc_csv(lines, lst):
    d = dict((v[1], v[0]) for v in lst) # use dict to speed up membership test
    return ((d[f2], f1, f3)
            for _, f1, f2, f3 in csv.reader(lines, dialect='excel-tab')
            if f2 in d) # assume that intersection is much less than the file

Example:

def test():
    my_file = """\
lion    4   blue    ch3
sheep   1   red pq2
frog    9   green   xd7
donkey  2   aqua    zr8
""".splitlines()

    my_list = [
    ("intel",    "yellow"),
    ("amd",      "green"),
    ("msi",      "aqua"),
    ]    

    res = list(calc_csv(my_file, my_list))
    assert [('amd', '9', 'xd7'), ('msi', '2', 'zr8')] == res


if __name__=="__main__":
   test()

回复收藏 0 原文

~没有更多了~