组和计数列具有多个值

发布于 2025-02-02 08:13:58 字数 1933 浏览 2 评论 0 原文

给定此数据框架:

df = pd.DataFrame({
    "names": [["Kevin, Jack"], ["Antoine, Mary, Johanne, Iv"], ["Ali"]],
    "commented": [["Kevin, Antoine, Iv"], ["Antoine, Mary, Ali"], ["Mary, Jack"]],
}, index=["1", "2", "3"])

看起来像这样:

    names   commented
1   [Kevin, Jack]   [Kevin, Antoine, Iv]
2   [Antoine, Mary, Johanne, Iv]    [Antoine, Mary, Ali]
3   [Ali]   [Mary, Jack]

我想获得一个新的数据框架,以计算所有人所做的所有评论。像:

Kevin Jack Antoine Mary Johanne IV Ali
Kevin 1 0 1 0 1 0 1 0
JATC 0 0 1 0 1 0 ANTOINE 0 0 0 1 1 1 0 0
1 1 1 0 1 1 0 1
JOHANNE 0 0 1 1 1 1 1 0 0 1
0 0 1 JOHANNE 0 0 0 0 0 1 1 1 IV 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 0 0 1
ali 0 1 0 1 0 0 0

这个数据框可能太小了,无法使其有意义,但是我的原始数据框架是100k行,数字将高于0和1。

我查看了各种选项使用Pivot_table和组的几种变体,但我似乎无法弄清楚。

df.pivot_table(index = 'names', columns= 'commented', aggfunc= 'count')

df.groupby('names').commented.apply(list).reset_index()
df.explode('names')['commented'].value_counts()

df.set_index('names').apply(pd.Series.explode).reset_index()

我尝试过的几乎所有解决方案都会给我错误: typeError:不可用的类型:'list'

Given this dataframe:

df = pd.DataFrame({
    "names": [["Kevin, Jack"], ["Antoine, Mary, Johanne, Iv"], ["Ali"]],
    "commented": [["Kevin, Antoine, Iv"], ["Antoine, Mary, Ali"], ["Mary, Jack"]],
}, index=["1", "2", "3"])

that'll look like this:

    names   commented
1   [Kevin, Jack]   [Kevin, Antoine, Iv]
2   [Antoine, Mary, Johanne, Iv]    [Antoine, Mary, Ali]
3   [Ali]   [Mary, Jack]

I want to get a new dataframe that will count all comments all people made. Something like:

Kevin Jack Antoine Mary Johanne Iv Ali
Kevin 1 0 1 0 0 1 0
Jack 1 0 1 0 0 1 0
Antoine 0 0 1 1 0 0 1
Mary 0 0 1 1 0 0 1
Johanne 0 0 1 1 0 0 1
Iv 0 0 1 1 0 0 1
Ali 0 1 0 1 0 0 0

This dataframe might be too small for it to make sense, but my original dataframe is 100k rows and there will be numbers higher than 0 and 1.

I've looked at various options using pivot_table and several variations of group by but I can't seem to figure this out.

df.pivot_table(index = 'names', columns= 'commented', aggfunc= 'count')

df.groupby('names').commented.apply(list).reset_index()
df.explode('names')['commented'].value_counts()

df.set_index('names').apply(pd.Series.explode).reset_index()

Almost all solutions I tried give me the error: TypeError: unhashable type: 'list'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情痴 2025-02-09 08:13:58

您可以尝试尝试将字符串列表爆炸为行列表,然后使用 .crosstab

df = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(', '))
      .explode('names')
      .explode('commented'))

out = pd.crosstab(df['names'], df['commented'])
print(df)

     names commented
1    Kevin     Kevin
1    Kevin   Antoine
1    Kevin        Iv
1     Jack     Kevin
1     Jack   Antoine
1     Jack        Iv
2  Antoine   Antoine
2  Antoine      Mary
2  Antoine       Ali
2     Mary   Antoine
2     Mary      Mary
2     Mary       Ali
2  Johanne   Antoine
2  Johanne      Mary
2  Johanne       Ali
2       Iv   Antoine
2       Iv      Mary
2       Iv       Ali
3      Ali      Mary
3      Ali      Jack

print(out)

commented  Ali  Antoine  Iv  Jack  Kevin  Mary
names
Ali          0        0   0     1      0     1
Antoine      1        1   0     0      0     1
Iv           1        1   0     0      0     1
Jack         0        1   1     0      1     0
Johanne      1        1   0     0      0     1
Kevin        0        1   1     0      1     0
Mary         1        1   0     0      0     1

You can try explode the list of strings to rows then use pandas.crosstab

df = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(', '))
      .explode('names')
      .explode('commented'))

out = pd.crosstab(df['names'], df['commented'])
print(df)

     names commented
1    Kevin     Kevin
1    Kevin   Antoine
1    Kevin        Iv
1     Jack     Kevin
1     Jack   Antoine
1     Jack        Iv
2  Antoine   Antoine
2  Antoine      Mary
2  Antoine       Ali
2     Mary   Antoine
2     Mary      Mary
2     Mary       Ali
2  Johanne   Antoine
2  Johanne      Mary
2  Johanne       Ali
2       Iv   Antoine
2       Iv      Mary
2       Iv       Ali
3      Ali      Mary
3      Ali      Jack

print(out)

commented  Ali  Antoine  Iv  Jack  Kevin  Mary
names
Ali          0        0   0     1      0     1
Antoine      1        1   0     0      0     1
Iv           1        1   0     0      0     1
Jack         0        1   1     0      1     0
Johanne      1        1   0     0      0     1
Kevin        0        1   1     0      1     0
Mary         1        1   0     0      0     1
水水月牙 2025-02-09 08:13:58

在您的示例输入中, name 注释列中的每个元素是一个仅具有1个元素(字符串)的数组。不确定您的真实数据是这种情况。

您可以将每个字符串划分为逗号,然后爆炸并旋转数据框:

split = lambda x: x[0].split(", ")
(
    df.assign(
        names=df["names"].apply(split),
        commented=df["commented"].apply(split),
        dummy=1
    )
    .explode("names")
    .explode("commented")
    .pivot_table(index="names", columns="commented", values="dummy", aggfunc="count", fill_value=0)
)

In your sample input, each element in the names and commented columns is an array with only 1 element (a string). Not sure if that is the case with your real data.

You can split each string by the comma, and then explode and pivot the dataframe:

split = lambda x: x[0].split(", ")
(
    df.assign(
        names=df["names"].apply(split),
        commented=df["commented"].apply(split),
        dummy=1
    )
    .explode("names")
    .explode("commented")
    .pivot_table(index="names", columns="commented", values="dummy", aggfunc="count", fill_value=0)
)
土豪我们做朋友吧 2025-02-09 08:13:58

这是使用 str.get_dummies()

(df.assign(names = df['names'].str[0].str.split(', '))
.explode('names')
.set_index('names')
.squeeze()
.str[0]
.str.get_dummies(sep=', '))

输出的另一种方法:

         Ali  Antoine  Iv  Jack  Kevin  Mary
names                                       
Kevin      0        1   1     0      1     0
Jack       0        1   1     0      1     0
Antoine    1        1   0     0      0     1
Mary       1        1   0     0      0     1
Johanne    1        1   0     0      0     1
Iv         1        1   0     0      0     1
Ali        0        0   0     1      0     1

Here is another way using str.get_dummies()

(df.assign(names = df['names'].str[0].str.split(', '))
.explode('names')
.set_index('names')
.squeeze()
.str[0]
.str.get_dummies(sep=', '))

Output:

         Ali  Antoine  Iv  Jack  Kevin  Mary
names                                       
Kevin      0        1   1     0      1     0
Jack       0        1   1     0      1     0
Antoine    1        1   0     0      0     1
Mary       1        1   0     0      0     1
Johanne    1        1   0     0      0     1
Iv         1        1   0     0      0     1
Ali        0        0   0     1      0     1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文