什么是组合数据框架矢量创建的最快方法?

发布于 2025-01-23 14:41:16 字数 1290 浏览 0 评论 0原文

如何通过某个矢量化过程创建Master dataframe?如果不可能,执行此操作的时间效率最高(不关心内存)方法是什么?

可以更换前面的循环吗?

如您所见,Compinations很快产生了很大的数量,因此我需要一种快速的方法来生成此数据框架。

请参阅下面的最低可再现示例:

%%time

import pandas as pd
import string
import numpy as np
from itertools import combinations

# create dummy data
cols = list(string.ascii_uppercase)
dummy = pd.DataFrame()
for col in cols:
    dummy = dummy.append([[col, 0] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
    dummy = dummy.append([[col, 1] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
    dummy = dummy.append([[col, 2] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy.columns=['name', 'id', 'v1', 'v2', 'v3', 'v4', 'v5', 'v1', 'v6', 'v7', 'v8', 'v9']

# create all possible unique combinations
combos = list(combinations(cols, 2))

# generate DataFrame with all combinations
master = pd.DataFrame()
for i, combo in enumerate(combos):
    A = dummy[dummy.name == combo[0]]
    B = dummy[dummy.name == combo[1]]
    joined = pd.merge(A, B, on=["id"], suffixes=('_A', '_B'))
    joined = joined.sort_values("id")
    joined['pair_id'] = i
    master = pd.concat([master, joined])

输出:

CPU times: total: 1.8 s
Wall time: 1.8 s

谢谢!

How can I create the master DataFrame through some vectorised process? If it's not possible, what's the most time efficient (not concerned about memory) method to execute this operation?

Can the for-loop be replaced for something more efficient?

As you can see, combinations very quickly produces very large number, thus I need a fast way to produce this DataFrame.

Please see below a minimum reproducible example:

%%time

import pandas as pd
import string
import numpy as np
from itertools import combinations

# create dummy data
cols = list(string.ascii_uppercase)
dummy = pd.DataFrame()
for col in cols:
    dummy = dummy.append([[col, 0] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
    dummy = dummy.append([[col, 1] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
    dummy = dummy.append([[col, 2] + np.random.randint(2, 100, size=(1, 10)).tolist()[0]])
dummy.columns=['name', 'id', 'v1', 'v2', 'v3', 'v4', 'v5', 'v1', 'v6', 'v7', 'v8', 'v9']

# create all possible unique combinations
combos = list(combinations(cols, 2))

# generate DataFrame with all combinations
master = pd.DataFrame()
for i, combo in enumerate(combos):
    A = dummy[dummy.name == combo[0]]
    B = dummy[dummy.name == combo[1]]
    joined = pd.merge(A, B, on=["id"], suffixes=('_A', '_B'))
    joined = joined.sort_values("id")
    joined['pair_id'] = i
    master = pd.concat([master, joined])

Output:

CPU times: total: 1.8 s
Wall time: 1.8 s

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

奈何桥上唱咆哮 2025-01-30 14:41:16

由于您的数据是结构性的,因此您可以掉落到Numpy来利用矢量化操作。

names = list(string.ascii_uppercase)
ids = [0, 1, 2]
columns = pd.Series(["v1", "v2", "v3", "v4", "v5", "v1", "v6", "v7", "v8", "v9"])

# Generate the random data
data = np.random.randint(2, 100, (len(names), len(ids), len(columns)))

# Pair data for every 2-combination of names
arr = [np.hstack([data[i], data[j]]) for i,j in combinations(range(len(names)), 2)]

# Assembling the data to final dataframe
idx = pd.MultiIndex.from_tuples([
    (p,a,b,i) for p, (a, b) in enumerate(combinations(names,2)) for i in ids
], names=["pair_id", "name_A", "name_B", "id"])
cols = pd.concat([columns + "_A", columns + "_B"])

master = pd.DataFrame(np.vstack(arr), index=idx, columns=cols)

原始代码:4s。新代码:7ms

Since your data is structural, you can drop down to numpy to take advantage of vectorized operations.

names = list(string.ascii_uppercase)
ids = [0, 1, 2]
columns = pd.Series(["v1", "v2", "v3", "v4", "v5", "v1", "v6", "v7", "v8", "v9"])

# Generate the random data
data = np.random.randint(2, 100, (len(names), len(ids), len(columns)))

# Pair data for every 2-combination of names
arr = [np.hstack([data[i], data[j]]) for i,j in combinations(range(len(names)), 2)]

# Assembling the data to final dataframe
idx = pd.MultiIndex.from_tuples([
    (p,a,b,i) for p, (a, b) in enumerate(combinations(names,2)) for i in ids
], names=["pair_id", "name_A", "name_B", "id"])
cols = pd.concat([columns + "_A", columns + "_B"])

master = pd.DataFrame(np.vstack(arr), index=idx, columns=cols)

Original code: 4s. New code: 7ms

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文