Pandas 中的 LabelEncoding 在具有跨行字符串列表的列上

发布于 2025-01-09 20:18:25 字数 1389 浏览 1 评论 0原文

我想对 pandas 中的一列进行 LabelEncode,其中每行包含一个字符串列表。由于相似的字符串/文本在行中具有相同的含义,因此编码应该尊重这一点,并且最好使用唯一的数字对其进行编码。想象一下:

import pandas as pd

df =pd.DataFrame({
                  'A':[['OK', 'NG', 'Repair', 'Peace'],['Sky', 'NG', 'Fixed', 'Conflict'],['Crossed', 'OK', 'Engine', 'Peace'],['OK', 'Beats', 'RPi', 'Country']]
                  })

# df
                              A
0       [OK, NG, Repair, Peace]
1    [Sky, NG, Fixed, Conflict]
2  [Crossed, OK, Engine, Peace]
3     [OK, Beats, RPi, Country]

当我执行以下操作时:

le = LabelEncoder()
df['LabelEncodedA'] = df['A'].apply(le.fit_transform)

它返回:

                              A LabelEncodedA
0       [OK, NG, Repair, Peace]  [1, 0, 3, 2]
1    [Sky, NG, Fixed, Conflict]  [1, 3, 2, 0]
2  [Crossed, OK, Engine, Peace]  [0, 2, 1, 3]
3     [OK, Beats, RPi, Country]  [2, 0, 3, 1]

这不是预期的结果。这里每一行都是独立的 LabelEncoded。并且第一行中的字符串(例如“OK”)不会被编码为第三行或第四行中的字符串。理想情况下,我希望它们跨行进行全局编码。也许一种方法是从该列创建一个语料库,并使用 Tokenization 或 LabelEncoding 获取映射以手动对列表进行编码?那么如何将包含字符串列表的 pandas 列转换为语料库文本?或者还有其他更好的方法吗?

预期结果(假设):

                              A LabelEncodedA
0       [OK, NG, Repair, Peace]  [0, 1, 2, 3]
1    [Sky, NG, Fixed, Conflict]  [4, 1, 5, 6]
2  [Crossed, OK, Engine, Peace]  [7, 0, 8, 9]
3     [OK, Beats, RPi, Country]  [0, 10, 11, 12]

I would like to LabelEncode a column in pandas where each row contains a list of strings. Since a similar string/text carries a same meaning across rows, encoding should respect that, and ideally encode it with a unique number. Imagine:

import pandas as pd

df =pd.DataFrame({
                  'A':[['OK', 'NG', 'Repair', 'Peace'],['Sky', 'NG', 'Fixed', 'Conflict'],['Crossed', 'OK', 'Engine', 'Peace'],['OK', 'Beats', 'RPi', 'Country']]
                  })

# df
                              A
0       [OK, NG, Repair, Peace]
1    [Sky, NG, Fixed, Conflict]
2  [Crossed, OK, Engine, Peace]
3     [OK, Beats, RPi, Country]

when I do the the followings:

le = LabelEncoder()
df['LabelEncodedA'] = df['A'].apply(le.fit_transform)

it returns:

                              A LabelEncodedA
0       [OK, NG, Repair, Peace]  [1, 0, 3, 2]
1    [Sky, NG, Fixed, Conflict]  [1, 3, 2, 0]
2  [Crossed, OK, Engine, Peace]  [0, 2, 1, 3]
3     [OK, Beats, RPi, Country]  [2, 0, 3, 1]

Which is not the intended result. Here each row is LabelEncoded in isolation. And a string e.g. 'OK' in the first row is not encoded to as the one in third or fourth row. Ideally I would like to have them encoded globally across rows. Perhaps one way may be to create a corpus out of that column, and using Tokenization or LabelEncoding obtain a mapping to encode manually the lists? How to convert then in pandas column containing list of strings to a corpus text? Or are there any better approaches?

Expected result (hypothetical):

                              A LabelEncodedA
0       [OK, NG, Repair, Peace]  [0, 1, 2, 3]
1    [Sky, NG, Fixed, Conflict]  [4, 1, 5, 6]
2  [Crossed, OK, Engine, Peace]  [7, 0, 8, 9]
3     [OK, Beats, RPi, Country]  [0, 10, 11, 12]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

残疾 2025-01-16 20:18:25

一种方法是分解列,然后因式分解将该列编码为分类变量,然后对编码列进行分组并使用list

a = df['A'].explode()
a[:] = a.factorize()[0]
df['Encoded'] = a.groupby(level=0).agg(list)

聚合

                              A         Encoded
0       [OK, NG, Repair, Peace]    [0, 1, 2, 3]
1    [Sky, NG, Fixed, Conflict]    [4, 1, 5, 6]
2  [Crossed, OK, Engine, Peace]    [7, 0, 8, 3]
3     [OK, Beats, RPi, Country]  [0, 9, 10, 11]

One approach would be to explode the column, then factorize to encode the column as categorical variable, then group the encoded column and aggregate using list

a = df['A'].explode()
a[:] = a.factorize()[0]
df['Encoded'] = a.groupby(level=0).agg(list)

Result

                              A         Encoded
0       [OK, NG, Repair, Peace]    [0, 1, 2, 3]
1    [Sky, NG, Fixed, Conflict]    [4, 1, 5, 6]
2  [Crossed, OK, Engine, Peace]    [7, 0, 8, 3]
3     [OK, Beats, RPi, Country]  [0, 9, 10, 11]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文