什么是组合数据框架矢量创建的最快方法?
如何通过某个矢量化过程创建Master
dataframe?如果不可能,执行此操作的时间效率最高(不关心内存)方法是什么?
可以更换前面的循环吗?
如您所见,Compinations
很快产生了很大的数量,因此我需要一种快速的方法来生成此数据框架。
请参阅下面的最低可再现示例:
%%time
import pandas as pd
import string
import numpy as np
from itertools import combinations
# create dummy data
cols = list(string.ascii_uppercase)
dummy = pd.DataFrame()
for col in cols:
dummy = dummy.append([[col, 0] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 1] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 2] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy.columns=['name', 'id', 'v1', 'v2', 'v3', 'v4', 'v5', 'v1', 'v6', 'v7', 'v8', 'v9']
# create all possible unique combinations
combos = list(combinations(cols, 2))
# generate DataFrame with all combinations
master = pd.DataFrame()
for i, combo in enumerate(combos):
A = dummy[dummy.name == combo[0]]
B = dummy[dummy.name == combo[1]]
joined = pd.merge(A, B, on=["id"], suffixes=('_A', '_B'))
joined = joined.sort_values("id")
joined['pair_id'] = i
master = pd.concat([master, joined])
输出:
CPU times: total: 1.8 s
Wall time: 1.8 s
谢谢!
How can I create the master
DataFrame through some vectorised process? If it's not possible, what's the most time efficient (not concerned about memory) method to execute this operation?
Can the for-loop be replaced for something more efficient?
As you can see, combinations
very quickly produces very large number, thus I need a fast way to produce this DataFrame.
Please see below a minimum reproducible example:
%%time
import pandas as pd
import string
import numpy as np
from itertools import combinations
# create dummy data
cols = list(string.ascii_uppercase)
dummy = pd.DataFrame()
for col in cols:
dummy = dummy.append([[col, 0] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 1] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy = dummy.append([[col, 2] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy.columns=['name', 'id', 'v1', 'v2', 'v3', 'v4', 'v5', 'v1', 'v6', 'v7', 'v8', 'v9']
# create all possible unique combinations
combos = list(combinations(cols, 2))
# generate DataFrame with all combinations
master = pd.DataFrame()
for i, combo in enumerate(combos):
A = dummy[dummy.name == combo[0]]
B = dummy[dummy.name == combo[1]]
joined = pd.merge(A, B, on=["id"], suffixes=('_A', '_B'))
joined = joined.sort_values("id")
joined['pair_id'] = i
master = pd.concat([master, joined])
Output:
CPU times: total: 1.8 s
Wall time: 1.8 s
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您的数据是结构性的,因此您可以掉落到Numpy来利用矢量化操作。
原始代码:4s。新代码:7ms
Since your data is structural, you can drop down to numpy to take advantage of vectorized operations.
Original code: 4s. New code: 7ms