使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法? (Python)

发布于 2025-01-09 10:35:23 字数 649 浏览 1 评论 0原文

我需要将 CSV 读入元组列表,同时根据值 (>=0.75) 调节列表并将列更改为不同的类型。 请注意你不能!!使用 pandas,NO PANDAS

我正在尝试找出如何以最快的方法做到这一点。

这就是我所做的(我认为它效率不高):

def load_csv_to_list(path):
  with open(path) as csv_file:
    table = list(reader(csv_file))
  lst = [table[0]]
  count = 0
  for row in table[1:]:
    if float(row[2]) >= 0.75:
      date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
      row = (date,int(row[1]),float(row[2]))
      lst.append(row)
  return (lst)

start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)

答案:0.00013872199997422285

I need to read a CSV into a list of tuples while conditioning the list on a value (>=0.75) and changing the columns to different typing.
Please note you cannot!! use pandas, NO PANDAS

I'm trying to figure out how to do it the FASTEST method possible.

This is how I did it (put i think it is not efficient):

def load_csv_to_list(path):
  with open(path) as csv_file:
    table = list(reader(csv_file))
  lst = [table[0]]
  count = 0
  for row in table[1:]:
    if float(row[2]) >= 0.75:
      date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
      row = (date,int(row[1]),float(row[2]))
      lst.append(row)
  return (lst)

start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)

answer : 0.00013872199997422285

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

烟燃烟灭 2025-01-16 10:35:23

原始代码执行两次相同的 float(row[2]) 转换。在我的测试中,将转换后的值分配给变量并稍后重用它会带来轻微的性能提升。利用 Python 3.8 中引入的海象运算符 := 可以进一步改进。使用批处理或内存映射数据文件可提供最佳性能。

def load_variable(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        float_two = float(row[2])
        if float_two >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

def load_walrus(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        if (float_two := float(row[2])) >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

加载包含 1,000,000 行的 csv 文件的时间:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |

作为进一步的实验,我实现了一个批量处理数据的函数。

def batch_walrus(path, batch_size=1000):
    lst = []
    with open(path) as csv_file:
        csv_reader = reader(csv_file)
        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list
        batch = []
        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                batch.append((date, int(row[1]), two))
            # If batch size is reached or end of file, process the batch
            if len(batch) == batch_size or not row:
                lst.extend(batch)
                batch = []
    return lst

更新了计时信息:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |

Python 的 mmap 模块提供内存映射文件 I/O。它利用较低级别的操作系统功能来读取文件,就好像它们是一个大字符串/数组一样。此版本的函数在创建 csv.reader 之前使用 decode("utf-8")mmapped_file 内容解码为字符串。

from csv import reader
from datetime import datetime
import mmap

def load_mmap_walrus(path):
    lst = []
    with open(path, "r") as csv_file:
        # Memory-map the file, size 0 means the entire file
        with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
            # Decode the bytes-like object to a string
            content = mmapped_file.read().decode("utf-8")

        # Create a CSV reader from the decoded string
        csv_reader = reader(content.splitlines())

        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list

        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                lst.append((date, int(row[1]), two))

        # Close the memory-mapped file
        mmapped_file.close()

    return lst

更新的计时信息:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |
load_mmap_walrus | 5.49s   | 5.68s   | 5.57s   |

用于生成 1,000,000 行 csv 数据的代码:

import csv
import random
from datetime import datetime, timedelta

# Function to generate a random date within a range
def random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randint(0, delta.days)
    return start_date + timedelta(days=random_days)

# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

with open("sample_data.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Date", "Integer", "Float"])
    for _ in range(1_000_000):
        date = random_date(start_date, end_date).strftime("%d/%m/%Y")
        integer = random.randint(0, 100)
        float_num = round(random.uniform(0, 1), 2)
        writer.writerow([date, integer, float_num])

The original code performs the same float(row[2]) conversion twice. In my testing, assigning the converted value to a variable and reusing it later gives a slight performance gain. Utilising the walrus operator :=, introduced in Python 3.8 gives a further improvement. Using batch processing or memory-mapping the data file gives the best performance.

def load_variable(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        float_two = float(row[2])
        if float_two >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

def load_walrus(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        if (float_two := float(row[2])) >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

Timings to load a csv file with 1,000,000 rows:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |

As a further experiment I implemented a function to batch process the data.

def batch_walrus(path, batch_size=1000):
    lst = []
    with open(path) as csv_file:
        csv_reader = reader(csv_file)
        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list
        batch = []
        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                batch.append((date, int(row[1]), two))
            # If batch size is reached or end of file, process the batch
            if len(batch) == batch_size or not row:
                lst.extend(batch)
                batch = []
    return lst

Updated timing information:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |

Python's mmap module provides memory-mapped file I/O. It takes advantage of lower-level operating system functionality to read files as if they were one large string/array. This version of the function decodes the mmapped_file content into a string using decode("utf-8") before creating the csv.reader.

from csv import reader
from datetime import datetime
import mmap

def load_mmap_walrus(path):
    lst = []
    with open(path, "r") as csv_file:
        # Memory-map the file, size 0 means the entire file
        with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
            # Decode the bytes-like object to a string
            content = mmapped_file.read().decode("utf-8")

        # Create a CSV reader from the decoded string
        csv_reader = reader(content.splitlines())

        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list

        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                lst.append((date, int(row[1]), two))

        # Close the memory-mapped file
        mmapped_file.close()

    return lst

Updated timing information:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |
load_mmap_walrus | 5.49s   | 5.68s   | 5.57s   |

Code used to generate 1,000,000 rows of csv data:

import csv
import random
from datetime import datetime, timedelta

# Function to generate a random date within a range
def random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randint(0, delta.days)
    return start_date + timedelta(days=random_days)

# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

with open("sample_data.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Date", "Integer", "Float"])
    for _ in range(1_000_000):
        date = random_date(start_date, end_date).strftime("%d/%m/%Y")
        integer = random.randint(0, 100)
        float_num = round(random.uniform(0, 1), 2)
        writer.writerow([date, integer, float_num])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文