将 Dask Dataframe 中的列拆分为 n 列
在 Dask 数据框中的一列中,我有这样的字符串:
column_name_1 | column_name_2 |
---|---|
a^b^c | j |
e^f^g | k^l |
h^i | m |
我需要将这些字符串拆分为以下列: 数据框,就像
column_name_1column_name_2column_name_1_1column_name_1_2column_name_1_3column_name_2_1column_name_2_2a | b | ^ | c | 的 | 相同 | 这样 |
---|---|---|---|---|---|---|
^ | j | a | b | c | j | |
e^f^g | k^l | e | f | g | k | l |
h^i | m | h | i | m |
如果事先不知道数据中出现了多少次分隔符,我无法弄清楚如何执行此操作。此外,数据框中有数十列需要单独保留,因此我需要能够像这样指定要拆分的列。
我最大的努力要么包括类似的东西
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
但它失败了
ValueError:计算数据中的列与提供的元数据中的列不匹配
In a column in a Dask Dataframe, I have strings like this:
column_name_1 | column_name_2 |
---|---|
a^b^c | j |
e^f^g | k^l |
h^i | m |
I need to split these strings into columns in the same data frame, like this
column_name_1 | column_name_2 | column_name_1_1 | column_name_1_2 | column_name_1_3 | column_name_2_1 | column_name_2_2 |
---|---|---|---|---|---|---|
a^b^c | j | a | b | c | j | |
e^f^g | k^l | e | f | g | k | l |
h^i | m | h | i | m |
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有 2 个解决方案,无需
stack
,但对选定的列名称使用循环:或者修改另一个解决方案:
Here are 2 solutions working without
stack
but with loop for selected columns names:Or modify another solution:
不幸的是使用
dask.dataframe.Series Dask 尚不支持带有
和未知数量的分割,以下返回expand=True
的 .str.splitNotImplementedError
:通常当一个pandas 等效项尚未在 Dask 中实现,
map_partitions
可用于在每个 DataFrame 分区上应用 Python 函数。然而,在这种情况下,Dask 仍然需要知道需要多少列才能延迟生成 Dask DataFrame,如meta
参数。这使得使用 Dask 来完成这项任务具有挑战性。相关地,发生ValueError
是因为column_name_2
仅需要 1 次分割,并返回具有 2 列的 Dask DataFrame,但 Dask 期望具有 3 列的 DataFrame。如果您提前知道分割数,这里有一个解决方案(根据 @Fontanka16 的答案构建):
Unfortunately using
dask.dataframe.Series.str.split
withexpand=True
and an unknown number of splits is not yet supported in Dask, the following returns aNotImplementedError
:Usually when a pandas equivalent has not yet been implemented in Dask,
map_partitions
can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with ameta
argument. This makes using Dask for this task challenging. Relatedly, theValueError
occurs becausecolumn_name_2
requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.Here is one solution (building from @Fontanka16's answer) if you do know the number of splits ahead of time: