是否可以将行追加到现有的 Arrow (PyArrow) 表中?
我知道“许多 Arrow 对象是不可变的:一旦构造,它们的逻辑属性就不能再改变”(文档)。在 Arrow 创建者之一的这篇博文中,据说
Arrow C++ 中的表列可以分块,因此附加到表是零复制操作,不需要重要的计算或内存分配。
但是,我无法在文档中找到如何将行追加到表中。 pyarrow.concat_tables(tables, Promotion=False) 做了类似的事情,但我的理解是它会生成一个新的 Table 对象,而不是向现有对象添加块。
我不确定这个操作是否完全可能/有意义(在这种情况下我想知道如何进行)或者是否没有(在这种情况下,pyarrow.concat_tables
正是我需要)。
类似的问题:
- In PyArrow, how to追加表的行到内存映射文件?专门询问内存映射文件。我一般询问任何
Table
对象。可能来自read_csv
操作或手动构建。 - 使用 pyarrow 如何附加到 parquet 文件? 谈论 Parquet 文件。见上文。
- Pyarrow Write/Append Columns Arrow File 谈论列,但我谈论的是行。
- https://github.com/apache/arrow/issues/3622 问同样的问题,但它没有令人满意的答案(在我看来)。
I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said
Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.
However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False)
does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.
I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables
is exactly what I need).
Similar questions:
- In PyArrow, how to append rows of a table to a memory mapped file? asks specifically about memory-mapped files. I am asking generally about any
Table
object. Could be coming from aread_csv
operation or be manually constructed. - Using pyarrow how do you append to parquet file? talks about Parquet files. See above.
- Pyarrow Write/Append Columns Arrow File talks about columns, but I'm talking about rows.
- https://github.com/apache/arrow/issues/3622 asks this same question, but it doesn't have a satisfying answer (in my opinion).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
基本上,PyArrow/Arrow C++ 中的表实际上并不是数据本身,而是由指向数据的指针组成的容器。它的工作原理是:
因此,您可以通过仅复制指针来使用 pyarrow.concat_tables “零复制”连接两个表。但是您不能“零复制”连接两个 RecordBatch,因为您必须连接数组,然后必须将数据复制出缓冲区。
Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:
Hence, you can concantenate two Tables "zero copy" with
pyarrow.concat_tables
, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.