Python 2 列出比较优化

发布于 2024-11-29 05:44:02 字数 1981 浏览 1 评论 0原文

鉴于: 两个 csv 文件(每个 1.8 MB):AllData_1、AllData_2。每个都有约 8,000 行。每行由 8 列组成。 [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]

目标: 根据 txt_0 (或 AllData_1[0] == AllData_2 )的匹配,比较这些单独行的接下来 4 列的内容。如果数据不相等,则根据不同的列将每组数据的整行放入列表中,并将列表保存到输出文件。如果 txt_0 是一个数据集而不是另一个数据集,则将其直接保存到输出文件。

示例:

AllData_1 行 x 包含:[a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 行 y 包含: [a1, b2, c33c, d44d, e5, f6, g7, h8]

程序将所有行 x 和 y 保存在对应于 ListCol2 和 ListCol3 的列表中。所有比较完成后,列表将保存到文件中。

如何使我的代码更快或将我的代码更改为更快的算法?

i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []

#create list out of column 0
for y in AllData_2:
    y0list.append(y[0])

for entry in AllData_1:
    x0list.append(entry[0])
    if entry[0] not in y0list:
        #code to save the line to file...

for y0 in AllData_2:
    if y0[0] not in x0list:
        #code to save the line to file...

for yrow in AllData_2:
    i+=1

    for xrow in AllData_1:
        foundit = 0
        if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
            if yrow[1] != xrow[1]:
                col1_diff.append(yrow)
                col1_diff.append(xrow)
                foundit = 1

            elif yrow[2] != xrow[2]:
                col2_diff.append(yrow)
                col2_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) < len(xrow[3]):
                col3a_diff.append(yrow)
                col3a_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) >= len(xrow[3]):
                col3b_diff.append(yrow)
                col3b_diff.append(xrow)
                foundit = 1

            else:
                #col4 is actually a catch-all for any other differences between lines if [0]s are equal
                col4_diff.append(yrow)
                col4_diff.append(xrow)
                foundit = 1

Given:
Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns.
[txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]

Goal:
Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being different and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.

Example:

AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8]
AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]

Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.

How can I make my code faster or change my code to a faster algorithm?

i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []

#create list out of column 0
for y in AllData_2:
    y0list.append(y[0])

for entry in AllData_1:
    x0list.append(entry[0])
    if entry[0] not in y0list:
        #code to save the line to file...

for y0 in AllData_2:
    if y0[0] not in x0list:
        #code to save the line to file...

for yrow in AllData_2:
    i+=1

    for xrow in AllData_1:
        foundit = 0
        if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
            if yrow[1] != xrow[1]:
                col1_diff.append(yrow)
                col1_diff.append(xrow)
                foundit = 1

            elif yrow[2] != xrow[2]:
                col2_diff.append(yrow)
                col2_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) < len(xrow[3]):
                col3a_diff.append(yrow)
                col3a_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) >= len(xrow[3]):
                col3b_diff.append(yrow)
                col3b_diff.append(xrow)
                foundit = 1

            else:
                #col4 is actually a catch-all for any other differences between lines if [0]s are equal
                col4_diff.append(yrow)
                col4_diff.append(xrow)
                foundit = 1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

傲世九天 2024-12-06 05:44:02

在顶部的右边,你可以把它做得小很多。

y0list = []
for y in AllData_2:
    y0list.append(y[0])

只是一种详细的说法

y0list = [y[0] for y in AllData_2]

,您可以在内置比较中使用。
下面

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

可以表示为

yrow[1:] != xrow[1:]

更不容易出现复制/粘贴错误。

为了使其更快,您可以避免进行 O(n**2) 比较。由于您只关心第一个列元素何时相同,因此您可以按第一个元素将它们捆绑在一起。

index = {}
for yrow in AllData_2:
    key = yrow[0]
    list = index.get(key)
    if list is None:
        list = []
        index[key] = list
    list.append(yrow)

for xrow in AllData_1:
    list = index.get(xrow[0])
    if list is None: continue
    for yrow in list:
        # Do all your comparison here

Right of the top, you can make this a lot smaller.

y0list = []
for y in AllData_2:
    y0list.append(y[0])

is just a verbose way of saying

y0list = [y[0] for y in AllData_2]

And you can use in builtin comparisons.
The below

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be expressed as

yrow[1:] != xrow[1:]

which is much less prone to copy/paste errors.

To make it faster, you can avoid doing O(n**2) comparisons. Since you only care when the first column element is the same, you can just bundle them by first element.

index = {}
for yrow in AllData_2:
    key = yrow[0]
    list = index.get(key)
    if list is None:
        list = []
        index[key] = list
    list.append(yrow)

for xrow in AllData_1:
    list = index.get(xrow[0])
    if list is None: continue
    for yrow in list:
        # Do all your comparison here
夏雨凉 2024-12-06 05:44:02

如果您预计给定文件中没有两行在第 0 列中具有相同的数据,则可以使用一些 dict 显着改进您的代码。而不是您将使用的行

x0list.append(entry[0])
y0list.append(y[0])

x0dict[entry[0]] = entry
y0dict[y[0]] = y

x0dicty0dict 初始化为 {} 后。然后,您不必再次循环遍历两组完整的数据,而是可以仅循环其中一个字典:

for x0, xrow in x0dict:
    if x0 in y0dict:
        yrow = y0dict[x0]
        # Do the col{1,2,3,4}_diff stuff here

作为奖励,第二个和第三个循环中的 not in 的作用是相同的。


该行

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

可以替换为更好看的

yrow[1:5] != xrow[1:5]

正如您现在的代码所示,从未使用过 i ,但如果您需要该计数,它最终与仅说 i = len 相同(AllData_2),因为它在 AllData_2 上的循环中每次运行仅递增一次。


最后,您的 foundit 变量目前没有任何作用。它仅用于在将其设置为 0 后立即使用 foundit == 0 来控制流程,因此将始终评估为 True 并且设置它无效。

If you can expect no two lines in a given file to have the same data in column 0, you can significantly improve your code with a few dicts. Instead of the lines

x0list.append(entry[0])
y0list.append(y[0])

You would use:

x0dict[entry[0]] = entry
y0dict[y[0]] = y

after initializing x0dict and y0dict to {}. Then, instead of looping through both complete sets of data again, you can loop over just one of the dicts:

for x0, xrow in x0dict:
    if x0 in y0dict:
        yrow = y0dict[x0]
        # Do the col{1,2,3,4}_diff stuff here

As a bonus, the not in in your second and third loops works the same.


The line

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be replaced with the nicer-looking

yrow[1:5] != xrow[1:5]

As your code stands right now, i is never used, but if you need that count, it ends up being identical to just saying i = len(AllData_2), since it only increments once per run in a loop over AllData_2.


Finally, your foundit variable currently serves no purpose. It is only used to control the flow with foundit == 0, immediately after setting it to 0, so that will always evaluate to True and setting it has no effect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文