Python 中 DataFrame 中数组值的 One-hot 编码

发布于 2025-01-12 18:40:57 字数 628 浏览 0 评论 0原文

我正在尝试对这些集群数据帧进行 one-hot 编码。（按长度分组）。一直在尝试使用 sklearn 的编码器，但它似乎将每一行视为一个类别而不是多个类别。

输入示例：

 ID                    trace  length
 3              [A, B, C, C]       4
 4           [A, B, C, C, D]       5
 5        [A, B, C, C, D, E]       6
 24             [A, B, C, C]       4
 25          [A, B, C, C, D]       5
 ...                     ...     ...

预期输出：

ID     A  B  C  D  E    length
3      1  1  1  0  0         4
4      1  1  1  1  0         5
5      1  1  1  1  1         6
24     1  1  1  0  0         4
25     1  1  1  1  0         5
.... ..... .. ......

原文

I am trying to do one-hot encoding for these clustered data frames. (grouped by length). Been trying to use sklearn's encoder but it seems to regard each individual row as one category instead of multiple.

Example input:

 ID                    trace  length
 3              [A, B, C, C]       4
 4           [A, B, C, C, D]       5
 5        [A, B, C, C, D, E]       6
 24             [A, B, C, C]       4
 25          [A, B, C, C, D]       5
 ...                     ...     ...

Expected output :

ID     A  B  C  D  E    length
3      1  1  1  0  0         4
4      1  1  1  1  0         5
5      1  1  1  1  1         6
24     1  1  1  0  0         4
25     1  1  1  1  0         5
.... ..... .. ......

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眸中客 2025-01-19 18:40:57

IIUC，如果目标包含列表，您可以执行以下操作：

(df.drop('trace',1)
   .join(df['trace']
         .apply('|'.join)
         .str.get_dummies()
        )
 )

或就地修改 df：

df = (df.join(df.pop('trace')
              .apply('|'.join)
              .str.get_dummies())
      )

或使用explode 和pivot_table：

(df.explode('trace')
   .assign(x=1)
   .pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
   .fillna(0, downcast='infer')
   .reset_index()
 )

输出：

   ID  length  A  B  C  D  E
0   3       4  1  1  1  0  0
1   4       5  1  1  1  1  0
2   5       6  1  1  1  1  1
3  24       4  1  1  1  0  0
4  25       5  1  1  1  1  0

IIUC, and if target contains lists, you could do:

(df.drop('trace',1)
   .join(df['trace']
         .apply('|'.join)
         .str.get_dummies()
        )
 )

or for in place modification of df:

df = (df.join(df.pop('trace')
              .apply('|'.join)
              .str.get_dummies())
      )

Or using explode and pivot_table:

(df.explode('trace')
   .assign(x=1)
   .pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
   .fillna(0, downcast='infer')
   .reset_index()
 )

Output:

   ID  length  A  B  C  D  E
0   3       4  1  1  1  0  0
1   4       5  1  1  1  1  0
2   5       6  1  1  1  1  1
3  24       4  1  1  1  0  0
4  25       5  1  1  1  1  0

回复收藏 0 原文

~没有更多了~