操作 pandas 数据框值的有效方法

发布于 2025-01-11 05:27:01 字数 1955 浏览 0 评论 0原文

我正在处理 2 个巨大的数据帧,我需要执行特定操作来检索第一个数据帧中每个唯一 id 的两个数据帧之一的最频繁值。我将通过一个例子更好地解释它。

假设我必须数据帧,第一个将被称为df_id,第二个将被称为df_values

df_values:

    ids values
0   id1 10
1   id2 20
2   id1 10
3   id1 30
4   id2 40
...

df_id:

    ids desc
0   id1 a product
1   id2 a product
2   id3 a product

在真实的数据帧中,我有额外的列,但为了清楚起见,我省略了它们。

现在,df_id 包含我需要的所有 id 的引用,而 df_values 包含与每个 id 关联的(多个)值。

我的范围是创建一个字典,对于每个不同的 id,报告最常见的值。如果它没有出现在 df_values 中,那么我会编写一个 None 而不是值。

values_dict

{'id1': 10, 'id2': 20, 'id3': None}

我试图通过类似的方法来解决这个问题:

from collections import Counter
import numpy as np

def Most_Common(lst):
    data = Counter(lst)
    return data.most_common(1)[0][0]

dict_val = {}

for ar in pd.unique(df_id['ids']):
    df_art = df_values.loc[df_values['ids'] == ar]
    print("Done", ar)
    val = Most_Common(df_art['values']) if not df_art.empty else None
    dict_val[ar] = val

Code of Most_Common function from this answer

但是,如果我的数据帧非常大(这就是我的情况),这个解决方案似乎无法很好地升级。我什至尝试使用 multiprocessing 库:

from collections import Counter
import multiprocessing as mp
import numpy as np

dict_val = {}

unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)

def register_value(ids):
    for ar in ids:
        df_art = df_values.loc[df_values['ids'] == ar]
        print("Done", ar)
        val = Most_Common(df_art['values']) if not df_art.empty else None
        dict_val[ar] = val

with mp.Pool(processes = 5) as p:
        p.map(register_value, unique_ids)

但代码仍然运行了大约 40 分钟......也许我做错了什么。是否有一个有效的解决方案,也可以应用于多处理的情况?谢谢。

I am dealing with 2 huge dataframes and I need to perform a specific operation to retrieve the most frequent value of one of the two dataframe for each unique id in the first one. I will explain it better with an example.

Suppose I have to dataframes, the first one will be called df_id, the second one df_values

df_values:

    ids values
0   id1 10
1   id2 20
2   id1 10
3   id1 30
4   id2 40
...

df_id:

    ids desc
0   id1 a product
1   id2 a product
2   id3 a product

In the real dataframes I have additional columns, but for the sake of clarity I omitted them.

Now, df_id contains references of all the ids I need, while df_values contains the (multiple) values associated to each id.

My scope is to create a dictionary that, for each distinct id, report the most frequent value. In case it does not appear in df_values, then I write a None instead of a value.

values_dict

{'id1': 10, 'id2': 20, 'id3': None}

My attempt to solve this was through something like that:

from collections import Counter
import numpy as np

def Most_Common(lst):
    data = Counter(lst)
    return data.most_common(1)[0][0]

dict_val = {}

for ar in pd.unique(df_id['ids']):
    df_art = df_values.loc[df_values['ids'] == ar]
    print("Done", ar)
    val = Most_Common(df_art['values']) if not df_art.empty else None
    dict_val[ar] = val

Code of Most_Common function from this answer

However, this solution seems to not escalate well if my dataframes are VERY large, which is my case. I even tried to use multiprocessing library:

from collections import Counter
import multiprocessing as mp
import numpy as np

dict_val = {}

unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)

def register_value(ids):
    for ar in ids:
        df_art = df_values.loc[df_values['ids'] == ar]
        print("Done", ar)
        val = Most_Common(df_art['values']) if not df_art.empty else None
        dict_val[ar] = val

with mp.Pool(processes = 5) as p:
        p.map(register_value, unique_ids)

But the code is still running for about 40 minutes... maybe I have done something wrong. Is there an efficient solution, which may be also applied in case of multiprocessing? Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尝蛊 2025-01-18 05:27:01

我的建议:

  1. 在 df_values 中查找最常见的值,按 id 分组。如果有多个最频繁的值,则取第一个:

    most_freq = df_values.groupby('ids').agg(lambda x: pd.Series.mode(x)[0])['values'].to_dict()
    
  2. 根据 df_id 中的 ids 创建字典,并将 None 分配给每个索引:

    dict_val = dict.fromkeys(df_id['ids'])
    
  3. 用第一个字典更新第二个字典。所有“无”将替换为最常见的值:

    dict_val.update(most_freq)
    

输出:

{'id1': 10, 'id2': 20, 'id3': None}

这只是一个示例,因此很难猜测它可以提高性能多少。它应该快得多,因为我们没有迭代所有元素。请检查并告诉我。

My suggestion:

  1. Find most frequent value in df_values, grouped by ids. If there is more than one most frequent value, take the first one:

    most_freq = df_values.groupby('ids').agg(lambda x: pd.Series.mode(x)[0])['values'].to_dict()
    
  2. Create a dictionary from ids in df_id and assign None to each index:

    dict_val = dict.fromkeys(df_id['ids'])
    
  3. Update second dictionary with the first one. All 'None' will be replaced with most frequent value:

    dict_val.update(most_freq)
    

Output:

{'id1': 10, 'id2': 20, 'id3': None}

It's only a sample, so it's hard to guess how much it can improve the performance. It should be much faster as we aren't iterating over all elements. Please check and let me know.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文