Python 2 列出比较优化
鉴于: 两个 csv 文件(每个 1.8 MB):AllData_1、AllData_2。每个都有约 8,000 行。每行由 8 列组成。 [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]
目标: 根据 txt_0 (或 AllData_1[0] == AllData_2 )的匹配,比较这些单独行的接下来 4 列的内容。如果数据不相等,则根据不同的列将每组数据的整行放入列表中,并将列表保存到输出文件。如果 txt_0 是一个数据集而不是另一个数据集,则将其直接保存到输出文件。
示例:
AllData_1 行 x 包含:[a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 行 y 包含: [a1, b2, c33c, d44d, e5, f6, g7, h8]
程序将所有行 x 和 y 保存在对应于 ListCol2 和 ListCol3 的列表中。所有比较完成后,列表将保存到文件中。
如何使我的代码更快或将我的代码更改为更快的算法?
i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []
#create list out of column 0
for y in AllData_2:
y0list.append(y[0])
for entry in AllData_1:
x0list.append(entry[0])
if entry[0] not in y0list:
#code to save the line to file...
for y0 in AllData_2:
if y0[0] not in x0list:
#code to save the line to file...
for yrow in AllData_2:
i+=1
for xrow in AllData_1:
foundit = 0
if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
if yrow[1] != xrow[1]:
col1_diff.append(yrow)
col1_diff.append(xrow)
foundit = 1
elif yrow[2] != xrow[2]:
col2_diff.append(yrow)
col2_diff.append(xrow)
foundit = 1
elif len(yrow[3]) < len(xrow[3]):
col3a_diff.append(yrow)
col3a_diff.append(xrow)
foundit = 1
elif len(yrow[3]) >= len(xrow[3]):
col3b_diff.append(yrow)
col3b_diff.append(xrow)
foundit = 1
else:
#col4 is actually a catch-all for any other differences between lines if [0]s are equal
col4_diff.append(yrow)
col4_diff.append(xrow)
foundit = 1
Given:
Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns.
[txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]
Goal:
Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being different and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.
Example:
AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8]
AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]
Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.
How can I make my code faster or change my code to a faster algorithm?
i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []
#create list out of column 0
for y in AllData_2:
y0list.append(y[0])
for entry in AllData_1:
x0list.append(entry[0])
if entry[0] not in y0list:
#code to save the line to file...
for y0 in AllData_2:
if y0[0] not in x0list:
#code to save the line to file...
for yrow in AllData_2:
i+=1
for xrow in AllData_1:
foundit = 0
if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
if yrow[1] != xrow[1]:
col1_diff.append(yrow)
col1_diff.append(xrow)
foundit = 1
elif yrow[2] != xrow[2]:
col2_diff.append(yrow)
col2_diff.append(xrow)
foundit = 1
elif len(yrow[3]) < len(xrow[3]):
col3a_diff.append(yrow)
col3a_diff.append(xrow)
foundit = 1
elif len(yrow[3]) >= len(xrow[3]):
col3b_diff.append(yrow)
col3b_diff.append(xrow)
foundit = 1
else:
#col4 is actually a catch-all for any other differences between lines if [0]s are equal
col4_diff.append(yrow)
col4_diff.append(xrow)
foundit = 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在顶部的右边,你可以把它做得小很多。
只是一种详细的说法
,您可以在内置比较中使用。
下面
可以表示为
更不容易出现复制/粘贴错误。
为了使其更快,您可以避免进行 O(n**2) 比较。由于您只关心第一个列元素何时相同,因此您可以按第一个元素将它们捆绑在一起。
Right of the top, you can make this a lot smaller.
is just a verbose way of saying
And you can use in builtin comparisons.
The below
can be expressed as
which is much less prone to copy/paste errors.
To make it faster, you can avoid doing O(n**2) comparisons. Since you only care when the first column element is the same, you can just bundle them by first element.
如果您预计给定文件中没有两行在第 0 列中具有相同的数据,则可以使用一些
dict
显着改进您的代码。而不是您将使用的行:
将
x0dict
和y0dict
初始化为{}
后。然后,您不必再次循环遍历两组完整的数据,而是可以仅循环其中一个字典:作为奖励,第二个和第三个循环中的
not in
的作用是相同的。该行
可以替换为更好看的
正如您现在的代码所示,从未使用过 i ,但如果您需要该计数,它最终与仅说 i = len 相同(AllData_2),因为它在
AllData_2
上的循环中每次运行仅递增一次。最后,您的
foundit
变量目前没有任何作用。它仅用于在将其设置为 0 后立即使用foundit == 0
来控制流程,因此将始终评估为True
并且设置它无效。If you can expect no two lines in a given file to have the same data in column 0, you can significantly improve your code with a few
dict
s. Instead of the linesYou would use:
after initializing
x0dict
andy0dict
to{}
. Then, instead of looping through both complete sets of data again, you can loop over just one of the dicts:As a bonus, the
not in
in your second and third loops works the same.The line
can be replaced with the nicer-looking
As your code stands right now,
i
is never used, but if you need that count, it ends up being identical to just sayingi = len(AllData_2)
, since it only increments once per run in a loop overAllData_2
.Finally, your
foundit
variable currently serves no purpose. It is only used to control the flow withfoundit == 0
, immediately after setting it to 0, so that will always evaluate toTrue
and setting it has no effect.