在语句中返回发电机

发布于 2025-02-09 18:43:12 字数 1157 浏览 1 评论 0原文

我想通过pandas.read_csv创建一个包装器函数,以更改默认的分隔符,并以特定方式格式化文件。这是我拥有的代码:

def custom_read(path, sep="|", **kwargs):
    if not kwargs.get("chunksize", False):
        df_ = pd.read_csv(path, sep=sep, **kwargs)
        return format_df(df_, path)
    else:
        with pd.read_csv(path, sep=sep, **kwargs) as reader:
            return (format_df(chunk, path) for chunk in reader)

事实证明,这种segfault在使用时会时:

L = [chunk.iloc[:10, :] for chunk in custom_read(my_file)]

根据我在回溯之外的理解,创建了生成器,然后关闭文件,并且当发电机试图从现在关闭的情况下读取时,就会发生segfault文件。

我可以避免使用次要重构的Segfault:

def custom_read(path, sep="|", **kwargs):
    if not kwargs.get("chunksize", False):
        df_ = pd.read_csv(path, sep=sep, **kwargs)
        return format_df(df_, path)
    else:
        reader = pd.read_csv(path, sep=sep, **kwargs)
        return (format_df(chunk, path) for chunk in reader)

我找不到特定的发电机用户中的任何内容,这是可以避免的吗?这是否应该不起作用,还是某种错误?

有没有办法避免此错误,但仍在使用语句的鼓励使用?

I wanted to create a wrapper function over pandas.read_csv to change the default separator and format the file a specific way. This is the code I had :

def custom_read(path, sep="|", **kwargs):
    if not kwargs.get("chunksize", False):
        df_ = pd.read_csv(path, sep=sep, **kwargs)
        return format_df(df_, path)
    else:
        with pd.read_csv(path, sep=sep, **kwargs) as reader:
            return (format_df(chunk, path) for chunk in reader)

It turns out that this segfaults when used like so :

L = [chunk.iloc[:10, :] for chunk in custom_read(my_file)]

From what I understood off the backtrace, the generator is created, then the file is closed and the segfault happens when the generator tries to read from the now closed file.

I could avoid the segfault with a minor refactoring :

def custom_read(path, sep="|", **kwargs):
    if not kwargs.get("chunksize", False):
        df_ = pd.read_csv(path, sep=sep, **kwargs)
        return format_df(df_, path)
    else:
        reader = pd.read_csv(path, sep=sep, **kwargs)
        return (format_df(chunk, path) for chunk in reader)

I couldn't find anything on the particular usecase of generators in with clauses, is it something to avoid ? Is this supposed not to work or is this a bug of some kind ?

Is there a way to avoid this error but still use the encouraged with statement ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

许一世地老天荒 2025-02-16 18:43:12

您可以使用保持文件打开的生成器。请参阅以下示例:

import os

def lines_format(lines):
    return "\n".join(f"*{line.strip()}*" for line in lines)

def chunk_gen(file, chunksize):
    with open(file, mode='r') as f:
        while True:
            lines = f.readlines(chunksize)
            if not lines:
                break
            yield lines_format(lines)
    
def get_formatted_pages(file, chunksize=0):
    if chunksize > 0:
        return chunk_gen(file, chunksize)
    else:
        with open(file, mode='r') as f:
            lines = f.readlines()
            return [lines_format(lines)]
                
with open("abc.txt", mode='w') as f:
    f.write(os.linesep.join('abc'))
    
pages = get_formatted_pages("abc.txt")
for i, page in enumerate(pages, start=1):
    print(f"Page {i}")
    print(page)
    
pages = get_formatted_pages("abc.txt", chunksize=2)
for i, page in enumerate(pages, start=1):
    print(f"Page {i}")
    print(page)

编辑:
在您的pandas.read_csv用例中,这看起来像

import pandas as pd

df = pd.DataFrame({'char': list('abc'), "num": range(3)})
df.to_csv('abc.csv')

def gen_chunk(file, chunksize):
    with pd.read_csv(file, chunksize=chunksize, index_col=0) as reader:
        for chunk in reader:
            yield format_df(chunk)
            
def format_df(df):
    # do something
    df['char'] = df['char'].str.capitalize()
    return df
    
def get_formatted_pages(file, chunksize=0):
    if chunksize > 0:
        return gen_chunk(file, chunksize)
    else:
        return [format_df(pd.read_csv(file, index_col=0))]
    
list(get_formatted_pages('abc.csv', chunksize=2))

You could use a generator which keeps the file open. See the following example:

import os

def lines_format(lines):
    return "\n".join(f"*{line.strip()}*" for line in lines)

def chunk_gen(file, chunksize):
    with open(file, mode='r') as f:
        while True:
            lines = f.readlines(chunksize)
            if not lines:
                break
            yield lines_format(lines)
    
def get_formatted_pages(file, chunksize=0):
    if chunksize > 0:
        return chunk_gen(file, chunksize)
    else:
        with open(file, mode='r') as f:
            lines = f.readlines()
            return [lines_format(lines)]
                
with open("abc.txt", mode='w') as f:
    f.write(os.linesep.join('abc'))
    
pages = get_formatted_pages("abc.txt")
for i, page in enumerate(pages, start=1):
    print(f"Page {i}")
    print(page)
    
pages = get_formatted_pages("abc.txt", chunksize=2)
for i, page in enumerate(pages, start=1):
    print(f"Page {i}")
    print(page)

Edit:
In your pandas.read_csv use case, this would look like

import pandas as pd

df = pd.DataFrame({'char': list('abc'), "num": range(3)})
df.to_csv('abc.csv')

def gen_chunk(file, chunksize):
    with pd.read_csv(file, chunksize=chunksize, index_col=0) as reader:
        for chunk in reader:
            yield format_df(chunk)
            
def format_df(df):
    # do something
    df['char'] = df['char'].str.capitalize()
    return df
    
def get_formatted_pages(file, chunksize=0):
    if chunksize > 0:
        return gen_chunk(file, chunksize)
    else:
        return [format_df(pd.read_csv(file, index_col=0))]
    
list(get_formatted_pages('abc.csv', chunksize=2))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文