pyarrow read_csv - 如何用空值填充尾随可选列

发布于 2025-01-09 16:30:03 字数 550 浏览 0 评论 0原文

我找不到使用 pyarrow.csv.read_csv 的选项或解决方法，并且还有许多其他原因导致使用 pandas 对我们不起作用。

我们有 csv 文件，其最后一列实际上是可选的，并且源数据并不总是包含它们的空单元格，例如：

name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B

此操作失败，并出现类似 pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 的错误：CSV parse error: Expected 5 columns, got 3:

我必须假设 pyarrow 可以处理这个问题，但我不知道如何处理。即使无效的行处理程序似乎也不让我返回“适当的”值，而只是“跳过”这些行。如果我可以保存它们并稍后追加，那就更好了，但由于箭头表是不可变的，似乎应该有一种更直接的方法来处理这些情况。

原文

I can't find an option or workaround for this using pyarrow.csv.read_csv and there are many other reasons why using pandas doesn't work for us.

We have csv files with final columns that are effectively optional, and the source data doesn't always include empty cells for them, for example:

name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B

This fails with an error like pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 3:

I've got to assume that pyarrow can handle this, but I can't see how. Even the invalid row handler doesn't appear to let me return the "appropriate" value, only to "skip" these rows. That would even be okay if I could save them and append later, but as arrow tables are immutable, it just seems like there should be a more straightforward way to handle these cases.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一身软味 2025-01-16 16:30:03

正如 Pace 所指出的，不幸的是，目前 pyarrow 中尚不支持此功能。实际上，我们在从 zip 中提取后即时处理“csv”文件，因此创建中间文件不是一个选项。

对于其他需要快速方法来处理此问题的人，我可以通过创建一个类来包装流并覆盖 read(size, *args, **kwargs) 来解决这个问题（以及其他几个问题） ) 快速执行剥离。即使使用中间人类，它也比尝试加载 pandas 更快（并且有
我们在这里不使用 pandas 的其他几个原因）。

这是一个模板示例：

class StreamWrapper():
    
    def __init__(self, obj=None):
        self.delegation = obj
        self.header = None

    def __getattr__(self, *args, **kwargs):
        # any other call is delegated to the stream/object
        return(self.delegation.__getattribute__(*args, **kwargs))

    def read(self, size=None, *args, **kwargs):
        bytedata = self.delegation.read(size, *args, **kwargs)          
        # .. the rest of the logic pre-processes the byte data, 
        # identifies the header and number of columns (which are retained persistently),
        # and then strips out extra columns when encountered
        return(bytedata)

这允许我们的调用是：

    df = pyarrow.csv.read_csv(
        StreamWrapper(zipfile_stream),
        parse_options = ...
    )

As Pace noted, this is unfortunately not presently available in pyarrow. We actually process the "csv" files on the fly after extracting from a zip, and so creating intermediate files wasn't an option.

For anyone else needing a quick-ish way to handle this, I was able to get around this (and a couple other issues) by creating a class to wrap the stream and overwrite read(size, *args, **kwargs) to quickly perform the stripping. Even with the middleman class, it's faster than attempting to load in pandas (and there are
several other reasons why we aren't using pandas here).

Here's a template example:

class StreamWrapper():
    
    def __init__(self, obj=None):
        self.delegation = obj
        self.header = None

    def __getattr__(self, *args, **kwargs):
        # any other call is delegated to the stream/object
        return(self.delegation.__getattribute__(*args, **kwargs))

    def read(self, size=None, *args, **kwargs):
        bytedata = self.delegation.read(size, *args, **kwargs)          
        # .. the rest of the logic pre-processes the byte data, 
        # identifies the header and number of columns (which are retained persistently),
        # and then strips out extra columns when encountered
        return(bytedata)

This allow our call to be:

    df = pyarrow.csv.read_csv(
        StreamWrapper(zipfile_stream),
        parse_options = ...
    )

回复收藏 0 原文

~没有更多了~

关于作者

梅窗月明清似水

暂无简介

文章

501 人气

关注发私信

友情链接

文江博客

pyarrow read_csv - 如何用空值填充尾随可选列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

pyarrow read_csv - 如何用空值填充尾随可选列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。