Pandas 中的 LabelEncoding 在具有跨行字符串列表的列上
我想对 pandas 中的一列进行 LabelEncode,其中每行包含一个字符串列表。由于相似的字符串/文本在行中具有相同的含义,因此编码应该尊重这一点,并且最好使用唯一的数字对其进行编码。想象一下:
import pandas as pd
df =pd.DataFrame({
'A':[['OK', 'NG', 'Repair', 'Peace'],['Sky', 'NG', 'Fixed', 'Conflict'],['Crossed', 'OK', 'Engine', 'Peace'],['OK', 'Beats', 'RPi', 'Country']]
})
# df
A
0 [OK, NG, Repair, Peace]
1 [Sky, NG, Fixed, Conflict]
2 [Crossed, OK, Engine, Peace]
3 [OK, Beats, RPi, Country]
当我执行以下操作时:
le = LabelEncoder()
df['LabelEncodedA'] = df['A'].apply(le.fit_transform)
它返回:
A LabelEncodedA
0 [OK, NG, Repair, Peace] [1, 0, 3, 2]
1 [Sky, NG, Fixed, Conflict] [1, 3, 2, 0]
2 [Crossed, OK, Engine, Peace] [0, 2, 1, 3]
3 [OK, Beats, RPi, Country] [2, 0, 3, 1]
这不是预期的结果。这里每一行都是独立的 LabelEncoded。并且第一行中的字符串(例如“OK”)不会被编码为第三行或第四行中的字符串。理想情况下,我希望它们跨行进行全局编码。也许一种方法是从该列创建一个语料库,并使用 Tokenization 或 LabelEncoding 获取映射以手动对列表进行编码?那么如何将包含字符串列表的 pandas 列转换为语料库文本?或者还有其他更好的方法吗?
预期结果(假设):
A LabelEncodedA
0 [OK, NG, Repair, Peace] [0, 1, 2, 3]
1 [Sky, NG, Fixed, Conflict] [4, 1, 5, 6]
2 [Crossed, OK, Engine, Peace] [7, 0, 8, 9]
3 [OK, Beats, RPi, Country] [0, 10, 11, 12]
I would like to LabelEncode a column in pandas where each row contains a list of strings. Since a similar string/text carries a same meaning across rows, encoding should respect that, and ideally encode it with a unique number. Imagine:
import pandas as pd
df =pd.DataFrame({
'A':[['OK', 'NG', 'Repair', 'Peace'],['Sky', 'NG', 'Fixed', 'Conflict'],['Crossed', 'OK', 'Engine', 'Peace'],['OK', 'Beats', 'RPi', 'Country']]
})
# df
A
0 [OK, NG, Repair, Peace]
1 [Sky, NG, Fixed, Conflict]
2 [Crossed, OK, Engine, Peace]
3 [OK, Beats, RPi, Country]
when I do the the followings:
le = LabelEncoder()
df['LabelEncodedA'] = df['A'].apply(le.fit_transform)
it returns:
A LabelEncodedA
0 [OK, NG, Repair, Peace] [1, 0, 3, 2]
1 [Sky, NG, Fixed, Conflict] [1, 3, 2, 0]
2 [Crossed, OK, Engine, Peace] [0, 2, 1, 3]
3 [OK, Beats, RPi, Country] [2, 0, 3, 1]
Which is not the intended result. Here each row is LabelEncoded in isolation. And a string e.g. 'OK' in the first row is not encoded to as the one in third or fourth row. Ideally I would like to have them encoded globally across rows. Perhaps one way may be to create a corpus out of that column, and using Tokenization or LabelEncoding obtain a mapping to encode manually the lists? How to convert then in pandas column containing list of strings to a corpus text? Or are there any better approaches?
Expected result (hypothetical):
A LabelEncodedA
0 [OK, NG, Repair, Peace] [0, 1, 2, 3]
1 [Sky, NG, Fixed, Conflict] [4, 1, 5, 6]
2 [Crossed, OK, Engine, Peace] [7, 0, 8, 9]
3 [OK, Beats, RPi, Country] [0, 10, 11, 12]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一种方法是
分解
列,然后因式分解
将该列编码为分类变量,然后对编码列进行分组并使用list
聚合
One approach would be to
explode
the column, thenfactorize
to encode the column as categorical variable, then group the encoded column and aggregate usinglist
Result