pyarrow read_csv - 如何用空值填充尾随可选列
我找不到使用 pyarrow.csv.read_csv 的选项或解决方法,并且还有许多其他原因导致使用 pandas 对我们不起作用。
我们有 csv 文件,其最后一列实际上是可选的,并且源数据并不总是包含它们的空单元格,例如:
name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B
此操作失败,并出现类似 pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 的错误:CSV parse error: Expected 5 columns, got 3:
我必须假设 pyarrow 可以处理这个问题,但我不知道如何处理。即使无效的行处理程序似乎也不让我返回“适当的”值,而只是“跳过”这些行。如果我可以保存它们并稍后追加,那就更好了,但由于箭头表是不可变的,似乎应该有一种更直接的方法来处理这些情况。
I can't find an option or workaround for this using pyarrow.csv.read_csv
and there are many other reasons why using pandas doesn't work for us.
We have csv files with final columns that are effectively optional, and the source data doesn't always include empty cells for them, for example:
name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B
This fails with an error like pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 3:
I've got to assume that pyarrow can handle this, but I can't see how. Even the invalid row handler doesn't appear to let me return the "appropriate" value, only to "skip" these rows. That would even be okay if I could save them and append later, but as arrow tables are immutable, it just seems like there should be a more straightforward way to handle these cases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 Pace 所指出的,不幸的是,目前
pyarrow
中尚不支持此功能。实际上,我们在从 zip 中提取后即时处理“csv”文件,因此创建中间文件不是一个选项。对于其他需要快速方法来处理此问题的人,我可以通过创建一个类来包装流并覆盖
read(size, *args, **kwargs) 来解决这个问题(以及其他几个问题) )
快速执行剥离。即使使用中间人类,它也比尝试加载 pandas 更快(并且有我们在这里不使用 pandas 的其他几个原因)。
这是一个模板示例:
这允许我们的调用是:
As Pace noted, this is unfortunately not presently available in
pyarrow
. We actually process the "csv" files on the fly after extracting from a zip, and so creating intermediate files wasn't an option.For anyone else needing a quick-ish way to handle this, I was able to get around this (and a couple other issues) by creating a class to wrap the stream and overwrite
read(size, *args, **kwargs)
to quickly perform the stripping. Even with the middleman class, it's faster than attempting to load in pandas (and there areseveral other reasons why we aren't using pandas here).
Here's a template example:
This allow our call to be: