使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法? (Python)
我需要将 CSV 读入元组列表,同时根据值 (>=0.75) 调节列表并将列更改为不同的类型。 请注意你不能!!使用 pandas,NO PANDAS
我正在尝试找出如何以最快的方法做到这一点。
这就是我所做的(我认为它效率不高):
def load_csv_to_list(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
count = 0
for row in table[1:]:
if float(row[2]) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date,int(row[1]),float(row[2]))
lst.append(row)
return (lst)
start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)
答案:0.00013872199997422285
I need to read a CSV into a list of tuples while conditioning the list on a value (>=0.75) and changing the columns to different typing.
Please note you cannot!! use pandas, NO PANDAS
I'm trying to figure out how to do it the FASTEST method possible.
This is how I did it (put i think it is not efficient):
def load_csv_to_list(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
count = 0
for row in table[1:]:
if float(row[2]) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date,int(row[1]),float(row[2]))
lst.append(row)
return (lst)
start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)
answer : 0.00013872199997422285
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
原始代码执行两次相同的
float(row[2])
转换。在我的测试中,将转换后的值分配给变量并稍后重用它会带来轻微的性能提升。利用 Python 3.8 中引入的海象运算符:=
可以进一步改进。使用批处理或内存映射数据文件可提供最佳性能。加载包含 1,000,000 行的 csv 文件的时间:
作为进一步的实验,我实现了一个批量处理数据的函数。
更新了计时信息:
Python 的
mmap
模块提供内存映射文件 I/O。它利用较低级别的操作系统功能来读取文件,就好像它们是一个大字符串/数组一样。此版本的函数在创建csv.reader
之前使用decode("utf-8")
将mmapped_file
内容解码为字符串。更新的计时信息:
用于生成 1,000,000 行 csv 数据的代码:
The original code performs the same
float(row[2])
conversion twice. In my testing, assigning the converted value to a variable and reusing it later gives a slight performance gain. Utilising the walrus operator:=
, introduced in Python 3.8 gives a further improvement. Using batch processing or memory-mapping the data file gives the best performance.Timings to load a csv file with 1,000,000 rows:
As a further experiment I implemented a function to batch process the data.
Updated timing information:
Python's
mmap
module provides memory-mapped file I/O. It takes advantage of lower-level operating system functionality to read files as if they were one large string/array. This version of the function decodes themmapped_file
content into a string usingdecode("utf-8")
before creating thecsv.reader
.Updated timing information:
Code used to generate 1,000,000 rows of csv data: