Dask Dataframe 形状属性给出了错误的形状

发布于 2025-01-16 10:45:20 字数 740 浏览 7 评论 0原文

我正在尝试找到较大 dask 数据帧的子集数据帧的形状。但是我没有得到正确的形状（行数），而是得到了错误的值

在示例中，我将前 3 行存储到一个新的数据框中，当我尝试查找形状 [0] 时，输出是4而不是3。有什么办法可以解决这个问题吗？

data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
    
    
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)

编辑：

我怎么错过了？对这个不好的例子表示歉意。

我正在处理 csv 文件中存储在 dask 数据帧（23 个分区）中的约 24700000 行的真实数据。我通过将 .loc[:100,:] 索引到原始 dask 数据帧来创建示例 dask 数据帧，但是当我尝试查找形状时，我得到 2323 作为数字行。

我可以知道这是如何计算的吗？数据如何分布在所有分区中？

原文

I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value

In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?

data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
    
    
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)

EDIT:

How did I miss that?
Apologies about the bad example.

I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.

Can I know how this was calculated? How is the data distributed among all the partitions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为你拒绝所有暧昧 2025-01-23 10:45:20

您观察到不同行数的原因是 .loc 将选择最多并包括提供的索引。因此，这一行

only_3 = ddf.loc[:3,:] # this will select 4 rows

选择 4 行，即索引为 0、1、2 和 3 的行。

这是基于 pandas API：

带有标签 'a':'f' 的切片对象（请注意，与通常的 Python 切片相反，当索引中存在时，起始点和终止点都包含在内！请参阅使用标签进行切片，并且包含端点。）< /p>

因此，您的代码原则上似乎是正确的，只需注意这个特定于 pandas 的索引切片语法即可。

更新：如果dask数据帧是通过读取csv文件（或以其他不生成唯一索引的方式）构建的，那么每个分区将有自己的索引。

这意味着，调用 .loc[:3] 将从每个分区中生成最多 4 行。例如，如果有 5 个分区，每个分区有 10 行，则调用 .loc[:4].compute() 将生成一个包含 25 行的数据帧（感谢 @darthbith 的更正）。

如果这是不可取的，有一种方法可以为 dask 数据框中的每一行生成唯一索引，请参阅此答案。

The reason you observe a different number of rows is that .loc will select up to and including the index provided. So this line

only_3 = ddf.loc[:3,:] # this will select 4 rows

is selecting 4 rows, those with index 0,1,2, and 3.

This is based on the pandas API:

A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

Hence, your code appears to be correct in principle, just take note of this particular pandas-specific index slicing syntax.

Update: if the dask dataframe is constructed by reading a csv file (or in another way that does not generate unique index), then each partition will have its own index.

That means, that calling .loc[:3] will yield at most 4 rows from every partition. For example, if there are 5 partitions and each has 10 rows, then calling .loc[:4].compute() will yield a dataframe with 25 rows (thanks to @darthbith for the correction).

If this is not desirable, there is a way to generate a unique index for every row in the dask dataframe, see this answer.

回复收藏 0 原文

~没有更多了~