使用自定义功能的PANDAS列聚合用于重复值

发布于 2025-01-26 22:06:37 字数 2650 浏览 1 评论 0原文

我有一个数据框，我想在列中汇总类似的ID。

X_train['freq_qd1'] = X_train.groupby('qid1')['qid1'].transform('count')

X_train['freq_qd2'] = X_train.groupby('qid2')['qid2'].transform('count')

我理解的上述代码，但我想自定义构建一个函数以应用多个列。

我已经附加了数据框的快照以供参考。在此数据框架上，我尝试在QID1和QID2上应用自定义功能。我尝试了以下代码：

def frequency(qid):
        freq = []
        for i in str(qid):
            if i not in freq:
                freq.append(i)
                ids = set()
            if i not in ids:
                ids.add(i)
                freq.append(ids)
        return freq


def extract_simple_feat(fe) :
    fe['question1'] = fe['question1'].fillna(' ')
    fe['question2'] = fe['question2'].fillna(' ')
    fe['qid1'] = fe['qid1']
    fe['qid2'] = fe['qid2']

    token_feat = fe.apply(lambda x : get_simple_features(x['question1'], 
                          x['question2']), axis = 1)

fe['q1_len'] = list(map(lambda x : x[0], token_feat))
fe['q2_len'] = list(map(lambda x : x[1], token_feat))
fe['freq_qd1'] = fe.apply(lambda x: frequency(x['qid1']), axis = 1)
fe['freq_qd2'] = fe.apply(lambda x: frequency(x['qid2']), axis = 1)
fe['q1_n_words'] = list(map(lambda x : x[2], token_feat))
fe['q2_n_words'] = list(map(lambda x : x[3], token_feat))
fe['word_common'] = list(map(lambda x : x[4], token_feat))
fe['word_total'] = list(map(lambda x : x[5], token_feat))
fe['word_share'] = list(map(lambda x : x[6], token_feat))

return fe


X_train = extract_simple_feat(X_train)

应用了自己的实现后，我无法获得所需的结果。我正在为结果附加一个快照。

所需的结果所需的结果如下：

如果有人可以帮助我，因为我真的被困并且无法正确纠正它。

这是一个小文本输入：

qid1     qid2 
  23       24
  25       26
  27       28
  318830   318831
  359558   318831
  384105   318831
  413505   318831
  451953   318831
  530151   318831

我希望聚合输出为：

qid1      qid2    freq_qid1  freq_id2
 23        24        1          1
 25        26        1          1
 27        28        1          1
 318830    318831    1          6
 359558              1          6
 384105              1          6
 413505              1          6
 451953              1          6
 530151              1          6

原文

I have a dataframe and I want to aggregate the similar ids in column.

X_train['freq_qd1'] = X_train.groupby('qid1')['qid1'].transform('count')

X_train['freq_qd2'] = X_train.groupby('qid2')['qid2'].transform('count')

The above code I understand but i want to custom build a function to apply on multiple columns.

I have attached a snapshot of the dataframe for reference. On this dataframe i tried to apply a custom function on qid1 and qid2.
I tried the below code :

def frequency(qid):
        freq = []
        for i in str(qid):
            if i not in freq:
                freq.append(i)
                ids = set()
            if i not in ids:
                ids.add(i)
                freq.append(ids)
        return freq


def extract_simple_feat(fe) :
    fe['question1'] = fe['question1'].fillna(' ')
    fe['question2'] = fe['question2'].fillna(' ')
    fe['qid1'] = fe['qid1']
    fe['qid2'] = fe['qid2']

    token_feat = fe.apply(lambda x : get_simple_features(x['question1'], 
                          x['question2']), axis = 1)

fe['q1_len'] = list(map(lambda x : x[0], token_feat))
fe['q2_len'] = list(map(lambda x : x[1], token_feat))
fe['freq_qd1'] = fe.apply(lambda x: frequency(x['qid1']), axis = 1)
fe['freq_qd2'] = fe.apply(lambda x: frequency(x['qid2']), axis = 1)
fe['q1_n_words'] = list(map(lambda x : x[2], token_feat))
fe['q2_n_words'] = list(map(lambda x : x[3], token_feat))
fe['word_common'] = list(map(lambda x : x[4], token_feat))
fe['word_total'] = list(map(lambda x : x[5], token_feat))
fe['word_share'] = list(map(lambda x : x[6], token_feat))

return fe


X_train = extract_simple_feat(X_train)

after applying my own implementation i am not getting the desired result. i am attaching a snapshot for the result i got.

The desired result wanted is below:

if someone can help me because i am really stuck and not able to rectify it properly.

here's a small text input :

qid1     qid2 
  23       24
  25       26
  27       28
  318830   318831
  359558   318831
  384105   318831
  413505   318831
  451953   318831
  530151   318831

I want aggregation output as :

qid1      qid2    freq_qid1  freq_id2
 23        24        1          1
 25        26        1          1
 27        28        1          1
 318830    318831    1          6
 359558              1          6
 384105              1          6
 413505              1          6
 451953              1          6
 530151              1          6

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少女情怀诗 2025-02-02 22:06:37

给定：（我为边缘案例添加了一个额外的行）

     qid1    qid2
0      23      24
1      25      26
2      27      28
3  318830  318831
4  359558  318831
5  384105  318831
6  413505  318831
7  451953  318831
8  530151  318831
9  495894    4394

执行：

def get_freqs(df, cols):
    temp_df = df.copy()
    for col in cols:
        temp_df['freq_' + col] = temp_df.groupby(col)[col].transform('count')
        temp_df.loc[temp_df[col].duplicated(), col] = ''
    return temp_df

df = get_freqs(df, ['qid1', 'qid2'])
print(df)

输出：

     qid1    qid2  qid1_freq  qid2_freq
0      23      24          1          1
1      25      26          1          1
2      27      28          1          1
3  318830  318831          1          6
4  359558                  1          6
5  384105                  1          6
6  413505                  1          6
7  451953                  1          6
8  530151                  1          6
9  495894    4394          1          1

如果我想做更多您正在做的事情...

给出：

   id  qid1  qid2                       question1                question2  is_duplicate
0   0     1     2            Why is the sky blue?  Why isn't the sky blue?             0
1   1     3     4  Why is the sky blue and green?  Why isn't the sky pink?             0
2   2     5     6                   Where are we?     Moon landing a hoax?             0
3   3     7     8                      Am I real?    Chickens aren't real.             0
4   4     9    10     If this Fake, surely it is?     Oops I did it again.             0

做：

def do_stuff(df):
    t_df = df.copy()
    quids = [x for x in t_df.columns if 'qid' in x]
    questions = [x for x in t_df.columns if 'question' in x]
    for col in quids:
            t_df['freq_' + col] = t_df.groupby(col)[col].transform('count')
            t_df.loc[t_df[col].duplicated(), col] = ''
    for i, col in enumerate(questions):
            t_df[f'q{i+1}_len'] = t_df[col].str.len()
            t_df[f'q{i+1}_no_words'] = t_df[col].str.split(' ').apply(lambda x: len(x))
    return t_df

df = do_stuff(df)
print(df)

输出：

   id qid1 qid2                       question1                question2  is_duplicate  freq_qid1  freq_qid2  q1_len  q1_n_words  q2_len  q2_n_words
0   0    1    2            Why is the sky blue?  Why isn't the sky blue?             0          1          1      20           5      23           5
1   1    3    4  Why is the sky blue and green?  Why isn't the sky pink?             0          1          1      30           7      23           5
2   2    5    6                   Where are we?     Moon landing a hoax?             0          1          1      13           3      20           4
3   3    7    8                      Am I real?    Chickens aren't real.             0          1          1      10           3      21           3
4   4    9   10     If this Fake, surely it is?     Oops I did it again.             0          1          1      27           6      20           5

Given: (I added an extra row for an edge case)

     qid1    qid2
0      23      24
1      25      26
2      27      28
3  318830  318831
4  359558  318831
5  384105  318831
6  413505  318831
7  451953  318831
8  530151  318831
9  495894    4394

Doing:

def get_freqs(df, cols):
    temp_df = df.copy()
    for col in cols:
        temp_df['freq_' + col] = temp_df.groupby(col)[col].transform('count')
        temp_df.loc[temp_df[col].duplicated(), col] = ''
    return temp_df

df = get_freqs(df, ['qid1', 'qid2'])
print(df)

Output:

     qid1    qid2  qid1_freq  qid2_freq
0      23      24          1          1
1      25      26          1          1
2      27      28          1          1
3  318830  318831          1          6
4  359558                  1          6
5  384105                  1          6
6  413505                  1          6
7  451953                  1          6
8  530151                  1          6
9  495894    4394          1          1

If I wanted to do more of what you're doing...

Given:

   id  qid1  qid2                       question1                question2  is_duplicate
0   0     1     2            Why is the sky blue?  Why isn't the sky blue?             0
1   1     3     4  Why is the sky blue and green?  Why isn't the sky pink?             0
2   2     5     6                   Where are we?     Moon landing a hoax?             0
3   3     7     8                      Am I real?    Chickens aren't real.             0
4   4     9    10     If this Fake, surely it is?     Oops I did it again.             0

Doing:

def do_stuff(df):
    t_df = df.copy()
    quids = [x for x in t_df.columns if 'qid' in x]
    questions = [x for x in t_df.columns if 'question' in x]
    for col in quids:
            t_df['freq_' + col] = t_df.groupby(col)[col].transform('count')
            t_df.loc[t_df[col].duplicated(), col] = ''
    for i, col in enumerate(questions):
            t_df[f'q{i+1}_len'] = t_df[col].str.len()
            t_df[f'q{i+1}_no_words'] = t_df[col].str.split(' ').apply(lambda x: len(x))
    return t_df

df = do_stuff(df)
print(df)

Output:

   id qid1 qid2                       question1                question2  is_duplicate  freq_qid1  freq_qid2  q1_len  q1_n_words  q2_len  q2_n_words
0   0    1    2            Why is the sky blue?  Why isn't the sky blue?             0          1          1      20           5      23           5
1   1    3    4  Why is the sky blue and green?  Why isn't the sky pink?             0          1          1      30           7      23           5
2   2    5    6                   Where are we?     Moon landing a hoax?             0          1          1      13           3      20           4
3   3    7    8                      Am I real?    Chickens aren't real.             0          1          1      10           3      21           3
4   4    9   10     If this Fake, surely it is?     Oops I did it again.             0          1          1      27           6      20           5

回复收藏 0 原文

~没有更多了~