是否可以根据生成的组合提高 pandas 中附加文件的速度？

发布于 2025-01-09 23:11:18 字数 2197 浏览 1 评论 0原文

我需要一些帮助来找出提高以下 python 代码性能的方法。

我正在根据某些组合计算一组 Excel 文件之间的重复项数。为了实现这一点，我使用了一个 python 脚本来生成所有可能的组合，将 10 个文件分为 5 组。然后，根据生成的组合，我继续使用脚本附加这些文件并计算每组中的重复项数量。

示例：

整个脚本的示例输出（5组中的10个文件的组合只是一个示例）

我的脚本的第一部分生成所有可能的组合，10 个 Excel 文件可以分为 5 组，例如：

Combination 1: (File 1) (File 2) (File 3) (File 4) (File 5 File 6 File 7 File 8 File 9 File 10)
Combination 2: (File 1 File 2) (File 3) (File 4) (File 5) (File 6 File 7 File 8 File 9 File 10)
Combination 3: etc...

def sorted_k_partitions(seq, k):

n = len(seq)
groups = []  # a list of lists, currently empty

def generate_partitions(i):
    if i >= n:
        yield list(map(tuple, groups))
    else:
        if n - i > k - len(groups):
            for group in groups:
                group.append(seq[i])
                yield from generate_partitions(i + 1)
                group.pop()

        if len(groups) < k:
            groups.append([seq[i]])
            yield from generate_partitions(i + 1)
            groups.pop()

result = generate_partitions(0)

我的代码的第二部分附加文件并根据每个组合计算重复项总数：

for k in 1, 5: #k is the number of groups i want to distribute the files into
for groups in sorted_k_partitions(seq, k):  #sorted_k_partitions gets the combinations
    groupnumberofduplicates = 0
    for group in groups:                 #groups is each combination(ex: Combination 1), group is each group in a combination (ex: Group 1 in ss)
        ids = pd.DataFrame()
        for file in group:
            ids = ids.append(pd.read_excel(file))
        numberofduplicates = ids.duplicated().sum()
        groupnumberofduplicates = groupnumberofduplicates + numberofduplicates
    grouplist = grouplist.append({"k":k, "groupname" : groups, "groupnumberofduplicates": groupnumberofduplicates}, ignore_index =True )

考虑到每个文件都有1 列 100,000 行，脚本确实很慢，但它正在工作。在我上面展示的代码的第二部分中，它变得很慢，主要是因为脚本要进入每个组合并获取重复项的数量有很多组合。第二个原因是每个文件中的行数很大。有没有办法让它更快？

谢谢您，并非常感谢任何可以提供帮助的人。

更新：行数不再是速度的问题。找到了解决这个问题的方法。循环是现在处理时间长的唯一原因。

原文

I need some help in figuring out a way to increase the performance of the below python code.

I am calculating the number of duplicates between a group of excel files based on certain combinations. To achieve this I have used a python script that generates all possible combinations that 10 files can be grouped in 5 groups. Then based on the generated combinations I have continued the script to append those files and calculate the number of duplicates in each group.

Example:

Sample output of the entire script ( combinations for 10 files combined in 5 groups is only a sample)

The first part of my script generates all possible combinations the 10 excel files can be grouped together in 5 groups, example:

Combination 1: (File 1) (File 2) (File 3) (File 4) (File 5 File 6 File 7 File 8 File 9 File 10)
Combination 2: (File 1 File 2) (File 3) (File 4) (File 5) (File 6 File 7 File 8 File 9 File 10)
Combination 3: etc...

def sorted_k_partitions(seq, k):

n = len(seq)
groups = []  # a list of lists, currently empty

def generate_partitions(i):
    if i >= n:
        yield list(map(tuple, groups))
    else:
        if n - i > k - len(groups):
            for group in groups:
                group.append(seq[i])
                yield from generate_partitions(i + 1)
                group.pop()

        if len(groups) < k:
            groups.append([seq[i]])
            yield from generate_partitions(i + 1)
            groups.pop()

result = generate_partitions(0)

The second part of my code appends the files and calculates the total duplicates based on each combination:

for k in 1, 5: #k is the number of groups i want to distribute the files into
for groups in sorted_k_partitions(seq, k):  #sorted_k_partitions gets the combinations
    groupnumberofduplicates = 0
    for group in groups:                 #groups is each combination(ex: Combination 1), group is each group in a combination (ex: Group 1 in ss)
        ids = pd.DataFrame()
        for file in group:
            ids = ids.append(pd.read_excel(file))
        numberofduplicates = ids.duplicated().sum()
        groupnumberofduplicates = groupnumberofduplicates + numberofduplicates
    grouplist = grouplist.append({"k":k, "groupname" : groups, "groupnumberofduplicates": groupnumberofduplicates}, ignore_index =True )

Taking into consideration that each file has 1 column with 100,000 rows the script is really slow but it's working. Its getting slow in the second part of the code I showed above mainly because there are many combinations that the script is going into each one and getting the number of duplicates. The second reason is that the number of rows in each file is huge. Is there a way to make this faster?

Thank you and really appreciated anyone that could help.

Update: number of rows is no longer an issue for the speed. Found a way to fix that. The loop is the only reason for long processing time now.

分享到QQ

分享到微博