当前位置：文江博客话题详情

Python optimization dataframe

如何制作一个巨大的循环以检查数据帧中的速度更快地在Mac中运行

发布于 2025-02-08 16:34:23 字数 1288 浏览 2 评论 0 原文

我需要通过检查具有11 mil行的数据框中的三个条件来计算一个巨大的表值（157954行和365列）。您是否有任何方法可以加快计算时间，现在花费了10个多小时？

我总共有367个站。

for station in stations:
    no_pickup_array = []
    for time_point in data_matrix['Timestamp']:
        time_point_2 = time_point + timedelta(minutes=15)
        no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) 
                        & (dataframe['departure_name'] == station)])
        no_pickup_array.append(no_pickup)
    print(f"Station name: {station}")
    data_matrix[station] = no_pickup_array

感谢您的任何帮助。

@ to all：谢谢您的评论，我为我的问题添加了更多信息。

每行数据框架都是每辆租赁自行车的信息。我想创建一个矩阵，每个车站都会在每15分钟的间隔内拾取自行车的数量。然后，我还想计算平均速度，平均时间。

@jérômeRichard的解决方案可以减少计算数量，但是我仍然很难理解和实施索引步骤并应用对数搜索或二进制搜索。

 index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended

原文

I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows. Do you have any way to speed up the calculation, which is taking more than 10 hours now?

I have 367 stations in total.

for station in stations:
    no_pickup_array = []
    for time_point in data_matrix['Timestamp']:
        time_point_2 = time_point + timedelta(minutes=15)
        no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) 
                        & (dataframe['departure_name'] == station)])
        no_pickup_array.append(no_pickup)
    print(f"Station name: {station}")
    data_matrix[station] = no_pickup_array

I appreciate any of your help.

@ To all: Thank you for your comments, I add more info for my problem.

Each row of dataframe is info of each renting bike. I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval. Then I also want to calculate the average speed, average time,.. as well.

The solution from @Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search.

 index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清醇 2025-02-15 16:34:23

主要问题是 no_pickup 分配表达式的右手侧面，该表达式算法效率低下，因为它可以进行 linear search ，而a 对数搜索是可能的。

首先要做的是 dataframe 进行 groupby ，以构建 index 启用具有给定名称的数据框子集。然后，您可以通过出发进行 sort 每个数据帧子集，以便能够执行二进制搜索使您能够知道适合条件的项目数量。

可以使用类似的内容构建索引：

index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}

最后，您可以使用两个 np.searchsorted on index [station] ：一个知道起始索引和一个知道结局索引。您可以通过简单的两个减法来获得长度。

请注意，您可能需要进行一些调整，因为我不确定以上代码在您的数据集中是否有效，但是如果没有代码生成输入的示例，就很难知道。

The main problem is the right-hand-side of the no_pickup assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.

The first thing to do is to do a groupby of dataframe so to build an index enabling to fetch the dataframe subset having a given name. Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.

The index can be built with something like:

index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}

Finally, you can do the binary search with two np.searchsorted on index[station]: one to know the starting index and one to know the ending index. You can get the length with a simple subtraction of the two.

Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.

回复收藏 0 原文

溺ぐ爱和你が 2025-02-15 16:34:23

您正在用布尔值（将是零或一个，因此您只会获得第一个或第二个元素的长度），而不是数字。它将像这样进行评估：

len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])

这可能不是您所追求的行为。（让我知道您在评论中要做什么，我将尝试提供更多帮助。）

专门的代码速度，＆amp; 是在Python中的“和”布尔运算符以和，或以及而不是而写出。在这里使用和将加快您的代码，因为Python仅评估需要的部分布尔表达式，例如

from time import sleep
def slow_function():
    sleep(3)
    return False

# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant. 
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()

# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0

# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()

# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()

You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number. It's going to get evaluated like so:

len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])

This probably isn't the behavior you're after. (let me know what you're trying to do in a comment and I'll try to help out more.)

In terms of code speed specifically, & is bitwise "AND", in python the boolean operators are written out as and, or, and not. Using and here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, e.g.

from time import sleep
def slow_function():
    sleep(3)
    return False

# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant. 
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()

# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0

# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()

# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()

回复收藏 0 原文

~没有更多了~