有没有一种方法可以向后穿过DASK数据框架？

发布于 2025-02-10 03:23:43 字数 98 浏览 1 评论 0原文

我想read_parquet，但要从您开始的位置向后读取（假设一个排序索引）。我不想将整个镶木读成记忆，因为这打败了使用它的全部点。有一个不错的方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枉心 2025-02-17 03:23:43

假设索引数据框，则可以作为两个步骤进行索引的倒置：倒入分区顺序并在每个分区中反转索引：

from dask.datasets import timeseries

ddf = timeseries()

ddf_inverted = (
    ddf
    .partitions[::-1]
    .map_partitions(lambda df: df.sort_index(ascending=False))
)

Assuming that the dataframe is indexed, the inversion of the index can be done as a two step process: invert the order of partitions and invert the index within each partition:

from dask.datasets import timeseries

ddf = timeseries()

ddf_inverted = (
    ddf
    .partitions[::-1]
    .map_partitions(lambda df: df.sort_index(ascending=False))
)

回复收藏 0 原文

罪#恶を代价 2025-02-17 03:23:43

如果最后一个n行都在最后一个分区中，则可以使用 dask.dataframe.tail 。如果没有，您可以使用 dask.dataframe.partitions属性。这不是特别聪明，如果您要求太多行，它将炸毁您的内存，但是它应该可以解决问题：

def get_last_n(n, df):
    read = []
    lines_read = 0
    for i in range(df.npartitions - 1, -1, -1):
        p = df.partitions[i].tail(n - lines_read)

        read.insert(0, p)
        lines_read += len(p)
        if lines_read >= n:
            break

    return pd.concat(read, axis=0)

例如，这里有一个带有20行和5个分区的数据框架：

import dask.dataframe, pandas as pd, numpy as np, dask

df = dask.dataframe.from_pandas(pd.DataFrame({'A': np.arange(20)}), npartitions=5)

您可以将上述功能调用任何数量的行要在尾部中获得那么多行：

In [4]: get_last_n(4, df)
Out[4]:
     A
16  16
17  17
18  18
19  19

In [5]: get_last_n(10, df)
Out[5]:
     A
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19

请求比数据框中的行更多的行要计算整个数据框架：

In [6]: get_last_n(1000, df)
Out[6]:
     A
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19

请注意，这会迭代地请求数据，因此，如果您的图形很复杂并且涉及很多混音，则可能非常效率。

If the last N rows are all in the last partition, you can use dask.dataframe.tail. If not, you can iterate backwards using the dask.dataframe.partitions attribute. This isn't particularly smart and will blow up your memory if you request too many rows, but it should do the trick:

def get_last_n(n, df):
    read = []
    lines_read = 0
    for i in range(df.npartitions - 1, -1, -1):
        p = df.partitions[i].tail(n - lines_read)

        read.insert(0, p)
        lines_read += len(p)
        if lines_read >= n:
            break

    return pd.concat(read, axis=0)

For example, here's a dataframe with 20 rows and 5 partitions:

import dask.dataframe, pandas as pd, numpy as np, dask

df = dask.dataframe.from_pandas(pd.DataFrame({'A': np.arange(20)}), npartitions=5)

You can call the above function with any number of rows to get that many rows in the tail:

In [4]: get_last_n(4, df)
Out[4]:
     A
16  16
17  17
18  18
19  19

In [5]: get_last_n(10, df)
Out[5]:
     A
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19

Requesting more rows than are in the dataframe just computes the whole dataframe:

In [6]: get_last_n(1000, df)
Out[6]:
     A
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19

Note that this requests the data iteratively, so may be very inefficient if your graph is complex and involves lots of shuffles.

回复收藏 0 原文

~没有更多了~