有没有一种方法可以向后穿过DASK数据框架?
我想read_parquet
,但要从您开始的位置向后读取(假设一个排序索引)。我不想将整个镶木读成记忆,因为这打败了使用它的全部点。有一个不错的方法吗?
I want to read_parquet
but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设索引数据框,则可以作为两个步骤进行索引的倒置:倒入分区顺序并在每个分区中反转索引:
Assuming that the dataframe is indexed, the inversion of the index can be done as a two step process: invert the order of partitions and invert the index within each partition:
如果最后一个n行都在最后一个分区中,则可以使用
dask.dataframe.tail
。如果没有,您可以使用dask.dataframe.partitions
属性。这不是特别聪明,如果您要求太多行,它将炸毁您的内存,但是它应该可以解决问题:
例如,这里有一个带有20行和5个分区的数据框架:
您可以将上述功能调用任何数量的行要在尾部中获得那么多行:
请求比数据框中的行更多的行要计算整个数据框架:
请注意,这会迭代地请求数据,因此,如果您的图形很复杂并且涉及很多混音,则可能非常效率。
If the last N rows are all in the last partition, you can use
dask.dataframe.tail
. If not, you can iterate backwards using thedask.dataframe.partitions
attribute. This isn't particularly smart and will blow up your memory if you request too many rows, but it should do the trick:For example, here's a dataframe with 20 rows and 5 partitions:
You can call the above function with any number of rows to get that many rows in the tail:
Requesting more rows than are in the dataframe just computes the whole dataframe:
Note that this requests the data iteratively, so may be very inefficient if your graph is complex and involves lots of shuffles.