用于目标编码的Numpy Groupby（又称均值编码）

发布于 2025-02-06 18:58:10 字数 843 浏览 1 评论 0原文

我正在尝试基于目标0-1数组 y 的数组 x 的分类列进行目标编码。强> x_i 具有该级别目标的平均值（IE数为1）。

以下代码可能会效率低下，因为两个两个循环可以模仿小组。是否有改进此类实施的空间（避免慢慢的熊猫小组）？谢谢

import numpy as np

np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))

#learn encoding
for colum in range(X.shape[1]):
    c = X[:,colum]
    if c.dtype.kind=="U":
        unique = np.unique(c)
        tmap_num={}
        for uni in unique:
            tmap_num[uni]=y[c==uni].mean()
        maps_num[str(colum)] = tmap_num

#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
    vals = np.full(X.shape[0], np.nan)
    for val, mean_target in tmap.items():
        vals[X[:,int(col)]==val] = mean_target
    X[:,int(col)] = vals

原文

I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.

The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you

import numpy as np

np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))

#learn encoding
for colum in range(X.shape[1]):
    c = X[:,colum]
    if c.dtype.kind=="U":
        unique = np.unique(c)
        tmap_num={}
        for uni in unique:
            tmap_num[uni]=y[c==uni].mean()
        maps_num[str(colum)] = tmap_num

#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
    vals = np.full(X.shape[0], np.nan)
    for val, mean_target in tmap.items():
        vals[X[:,int(col)]==val] = mean_target
    X[:,int(col)] = vals

分享到QQ

分享到微博