用于目标编码的Numpy Groupby(又称均值编码)
我正在尝试基于目标0-1数组 y 的数组 x 的分类列进行目标编码。强> x_i 具有该级别目标的平均值(IE数为1)。
以下代码可能会效率低下,因为两个两个循环可以模仿小组。是否有改进此类实施的空间(避免慢慢的熊猫小组)?谢谢
import numpy as np
np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))
#learn encoding
for colum in range(X.shape[1]):
c = X[:,colum]
if c.dtype.kind=="U":
unique = np.unique(c)
tmap_num={}
for uni in unique:
tmap_num[uni]=y[c==uni].mean()
maps_num[str(colum)] = tmap_num
#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
vals = np.full(X.shape[0], np.nan)
for val, mean_target in tmap.items():
vals[X[:,int(col)]==val] = mean_target
X[:,int(col)] = vals
I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.
The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you
import numpy as np
np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))
#learn encoding
for colum in range(X.shape[1]):
c = X[:,colum]
if c.dtype.kind=="U":
unique = np.unique(c)
tmap_num={}
for uni in unique:
tmap_num[uni]=y[c==uni].mean()
maps_num[str(colum)] = tmap_num
#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
vals = np.full(X.shape[0], np.nan)
for val, mean_target in tmap.items():
vals[X[:,int(col)]==val] = mean_target
X[:,int(col)] = vals
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论