Dask Dataframe 形状属性给出了错误的形状
我正在尝试找到较大 dask 数据帧的子集数据帧的形状。但是我没有得到正确的形状(行数),而是得到了错误的值
在示例中,我将前 3 行存储到一个新的数据框中,当我尝试查找形状 [0] 时,输出是4而不是3。有什么办法可以解决这个问题吗?
data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)
编辑:
我怎么错过了? 对这个不好的例子表示歉意。
我正在处理 csv 文件中存储在 dask 数据帧(23 个分区)中的约 24700000 行的真实数据。我通过将 .loc[:100,:] 索引到原始 dask 数据帧来创建示例 dask 数据帧,但是当我尝试查找形状时,我得到 2323 作为数字行。
我可以知道这是如何计算的吗?数据如何分布在所有分区中?
I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value
In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?
data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)
EDIT:
How did I miss that?
Apologies about the bad example.
I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.
Can I know how this was calculated? How is the data distributed among all the partitions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您观察到不同行数的原因是
.loc
将选择最多并包括提供的索引。因此,这一行选择 4 行,即索引为 0、1、2 和 3 的行。
这是基于
pandas API
:因此,您的代码原则上似乎是正确的,只需注意这个特定于 pandas 的索引切片语法即可。
更新:如果dask数据帧是通过读取csv文件(或以其他不生成唯一索引的方式)构建的,那么每个分区将有自己的索引。
这意味着,调用
.loc[:3]
将从每个分区中生成最多 4 行。例如,如果有 5 个分区,每个分区有 10 行,则调用.loc[:4].compute()
将生成一个包含 25 行的数据帧(感谢 @darthbith 的更正)。如果这是不可取的,有一种方法可以为 dask 数据框中的每一行生成唯一索引,请参阅此答案 。
The reason you observe a different number of rows is that
.loc
will select up to and including the index provided. So this lineis selecting 4 rows, those with index 0,1,2, and 3.
This is based on the
pandas API
:Hence, your code appears to be correct in principle, just take note of this particular
pandas
-specific index slicing syntax.Update: if the dask dataframe is constructed by reading a csv file (or in another way that does not generate unique index), then each partition will have its own index.
That means, that calling
.loc[:3]
will yield at most 4 rows from every partition. For example, if there are 5 partitions and each has 10 rows, then calling.loc[:4].compute()
will yield a dataframe with 25 rows (thanks to @darthbith for the correction).If this is not desirable, there is a way to generate a unique index for every row in the dask dataframe, see this answer.