操作 pandas 数据框值的有效方法
我正在处理 2 个巨大的数据帧,我需要执行特定操作来检索第一个数据帧中每个唯一 id 的两个数据帧之一的最频繁值。我将通过一个例子更好地解释它。
假设我必须数据帧,第一个将被称为df_id
,第二个将被称为df_values
df_values:
ids values
0 id1 10
1 id2 20
2 id1 10
3 id1 30
4 id2 40
...
df_id:
ids desc
0 id1 a product
1 id2 a product
2 id3 a product
在真实的数据帧中,我有额外的列,但为了清楚起见,我省略了它们。
现在,df_id
包含我需要的所有 id 的引用,而 df_values
包含与每个 id 关联的(多个)值。
我的范围是创建一个字典,对于每个不同的 id,报告最常见的值。如果它没有出现在 df_values 中,那么我会编写一个 None 而不是值。
values_dict
{'id1': 10, 'id2': 20, 'id3': None}
我试图通过类似的方法来解决这个问题:
from collections import Counter
import numpy as np
def Most_Common(lst):
data = Counter(lst)
return data.most_common(1)[0][0]
dict_val = {}
for ar in pd.unique(df_id['ids']):
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
Code of Most_Common
function from this answer
但是,如果我的数据帧非常大(这就是我的情况),这个解决方案似乎无法很好地升级。我什至尝试使用 multiprocessing
库:
from collections import Counter
import multiprocessing as mp
import numpy as np
dict_val = {}
unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)
def register_value(ids):
for ar in ids:
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
with mp.Pool(processes = 5) as p:
p.map(register_value, unique_ids)
但代码仍然运行了大约 40 分钟......也许我做错了什么。是否有一个有效的解决方案,也可以应用于多处理的情况?谢谢。
I am dealing with 2 huge dataframes and I need to perform a specific operation to retrieve the most frequent value of one of the two dataframe for each unique id in the first one. I will explain it better with an example.
Suppose I have to dataframes, the first one will be called df_id
, the second one df_values
df_values:
ids values
0 id1 10
1 id2 20
2 id1 10
3 id1 30
4 id2 40
...
df_id:
ids desc
0 id1 a product
1 id2 a product
2 id3 a product
In the real dataframes I have additional columns, but for the sake of clarity I omitted them.
Now, df_id
contains references of all the ids I need, while df_values
contains the (multiple) values associated to each id.
My scope is to create a dictionary that, for each distinct id, report the most frequent value. In case it does not appear in df_values
, then I write a None
instead of a value.
values_dict
{'id1': 10, 'id2': 20, 'id3': None}
My attempt to solve this was through something like that:
from collections import Counter
import numpy as np
def Most_Common(lst):
data = Counter(lst)
return data.most_common(1)[0][0]
dict_val = {}
for ar in pd.unique(df_id['ids']):
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
Code of Most_Common
function from this answer
However, this solution seems to not escalate well if my dataframes are VERY large, which is my case. I even tried to use multiprocessing
library:
from collections import Counter
import multiprocessing as mp
import numpy as np
dict_val = {}
unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)
def register_value(ids):
for ar in ids:
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
with mp.Pool(processes = 5) as p:
p.map(register_value, unique_ids)
But the code is still running for about 40 minutes... maybe I have done something wrong. Is there an efficient solution, which may be also applied in case of multiprocessing? Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的建议:
在 df_values 中查找最常见的值,按 id 分组。如果有多个最频繁的值,则取第一个:
根据 df_id 中的 ids 创建字典,并将 None 分配给每个索引:
用第一个字典更新第二个字典。所有“无”将替换为最常见的值:
输出:
这只是一个示例,因此很难猜测它可以提高性能多少。它应该快得多,因为我们没有迭代所有元素。请检查并告诉我。
My suggestion:
Find most frequent value in df_values, grouped by ids. If there is more than one most frequent value, take the first one:
Create a dictionary from ids in df_id and assign None to each index:
Update second dictionary with the first one. All 'None' will be replaced with most frequent value:
Output:
It's only a sample, so it's hard to guess how much it can improve the performance. It should be much faster as we aren't iterating over all elements. Please check and let me know.