箭头中表和数据集 API 之间的差异
从文档中,我了解到 arrow 提供了 datasets API 来处理比内存更大的数据。两者都具有自动谓词/投影下推功能(这使得它处理的数据无论如何都大于内存中的数据,因为它只带来了所需的数据),并读取分区文件。 table
API 附带了许多计算函数,但不适用于数据集
。
但我试图了解使用数据集和表 API 之间的真正区别。 datasets
可以读取多个文件,而 table
则不能。但仅此而已?另外,如果没有太大差异,为什么它会上升为两个独立的实体,tables
和datasets
,或者将来,它们是否会合并为一个统一的元素?
From the documentation, I understand that arrow provides the datasets
API to deal with the bigger data than memory. Both have the capability for the automatic predicate/projection pushdown features (which makes it deal with greater than in-memory data anyways as it brings just what is needed), and read partitioned files. table
API is shipped with lot of compute functions, but not for datasets
.
But I am trying to understand the real difference between working with datasets and table API. datasets
can read multiple files while table
can't. But that's all? Also, if there is no big difference, why is it rising as 2 separate entities, tables
and datasets
, or in the future, will these both be merged to a unified element?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论