是否可以将行追加到现有的 Arrow (PyArrow) 表中?

发布于 2025-01-13 01:38:59 字数 1154 浏览 3 评论 0原文

我知道“许多 Arrow 对象是不可变的:一旦构造,它们的逻辑属性就不能再改变”(文档)。在 Arrow 创建者之一的这篇博文中,据说

Arrow C++ 中的表列可以分块,因此附加到表是零复制操作,不需要重要的计算或内存分配。

但是,我无法在文档中找到如何将行追加到表中。 pyarrow.concat_tables(tables, Promotion=False) 做了类似的事情,但我的理解是它会生成一个新的 Table 对象,而不是向现有对象添加块。

我不确定这个操作是否完全可能/有意义(在这种情况下我想知道如何进行)或者是否没有(在这种情况下,pyarrow.concat_tables正是我需要)。

类似的问题:

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said

Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.

However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.

I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).

Similar questions:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

南风几经秋 2025-01-20 01:38:59

基本上,PyArrow/Arrow C++ 中的表实际上并不是数据本身,而是由指向数据的指针组成的容器。它的工作原理是:

  • Buffer 代表实际的、单一的分配。换句话说,缓冲区是连续的、句号的。它们可能是可变的或不可变的。
  • 数组包含 0+ 个缓冲区,并向它们强加某种语义。 (例如,整数数组或字符串数​​组。)数组是“连续的”,因为每个缓冲区都是连续的,并且从概念上讲,“列”不会跨多个缓冲区“拆分”。 (这对于嵌套数组来说真的很模糊:从某种意义上说,结构数组确实将其数据分割到多个缓冲区中!我需要对此提出更好的措辞,并将其贡献给上游文档。但我希望我的意思这里相当清楚。)
  • ChunkedArray 包含 0+ 个数组。 ChunkedArray 在逻辑上不连续。它有点像数据块的链接列表。两个 ChunkedArray 可以“零复制”连接,即底层缓冲区不会被复制。
  • 一个表包含 0+ 个分块数组。表是一种二维数据结构(包括列和行)。
  • RecordBatch 包含 0+ 个数组。 RecordBatch 也是一种 2D 数据结构。

因此,您可以通过仅复制指针来使用 pyarrow.concat_tables “零复制”连接两个表。但是您不能“零复制”连接两个 RecordBatch,因为您必须连接数组,然后必须将数据复制出缓冲区。

Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:

  • A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
  • An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
  • A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
  • A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
  • A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.

Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文