我需要通过检查具有11 mil行的数据框中的三个条件来计算一个巨大的表值(157954行和365列)。您是否有任何方法可以加快计算时间,现在花费了10个多小时?
我总共有367个站。
for station in stations:
no_pickup_array = []
for time_point in data_matrix['Timestamp']:
time_point_2 = time_point + timedelta(minutes=15)
no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2)
& (dataframe['departure_name'] == station)])
no_pickup_array.append(no_pickup)
print(f"Station name: {station}")
data_matrix[station] = no_pickup_array
感谢您的任何帮助。
@ to all:谢谢您的评论,我为我的问题添加了更多信息。
每行数据框架都是每辆租赁自行车的信息。我想创建一个矩阵,每个车站都会在每15分钟的间隔内拾取自行车的数量。然后,我还想计算平均速度,平均时间。
@jérômeRichard的解决方案可以减少计算数量,但是我仍然很难理解和实施索引步骤并应用对数搜索或二进制搜索。
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended
I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows. Do you have any way to speed up the calculation, which is taking more than 10 hours now?

I have 367 stations in total.
for station in stations:
no_pickup_array = []
for time_point in data_matrix['Timestamp']:
time_point_2 = time_point + timedelta(minutes=15)
no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2)
& (dataframe['departure_name'] == station)])
no_pickup_array.append(no_pickup)
print(f"Station name: {station}")
data_matrix[station] = no_pickup_array
I appreciate any of your help.
@ To all: Thank you for your comments, I add more info for my problem.

Each row of dataframe is info of each renting bike. I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval. Then I also want to calculate the average speed, average time,.. as well.
The solution from @Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search.
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended
发布评论
评论(2)
主要问题是
no_pickup
分配表达式的右手侧面,该表达式算法效率低下,因为它可以进行 linear search ,而a 对数搜索是可能的。首先要做的是
dataframe
进行groupby
,以构建 index 启用具有给定名称的数据框子集。然后,您可以通过出发进行 sort 每个数据帧子集,以便能够执行二进制搜索使您能够知道适合条件的项目数量。可以使用类似的内容构建索引:
最后,您可以使用两个
np.searchsorted
onindex [station]
:一个知道起始索引和一个知道结局索引。您可以通过简单的两个减法来获得长度。请注意,您可能需要进行一些调整,因为我不确定以上代码在您的数据集中是否有效,但是如果没有代码生成输入的示例,就很难知道。
The main problem is the right-hand-side of the
no_pickup
assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.The first thing to do is to do a
groupby
ofdataframe
so to build an index enabling to fetch the dataframe subset having a given name. Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.The index can be built with something like:
Finally, you can do the binary search with two
np.searchsorted
onindex[station]
: one to know the starting index and one to know the ending index. You can get the length with a simple subtraction of the two.Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.
您正在用布尔值(将是零或一个,因此您只会获得第一个或第二个元素的长度),而不是数字。它将像这样进行评估:
这可能不是您所追求的行为。 (让我知道您在评论中要做什么,我将尝试提供更多帮助。)
专门的代码速度,
&amp;
是在Python中的“和”布尔运算符以和
,或
以及而不是
而写出。在这里使用和
将加快您的代码,因为Python仅评估需要的部分布尔表达式,例如You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number. It's going to get evaluated like so:
This probably isn't the behavior you're after. (let me know what you're trying to do in a comment and I'll try to help out more.)
In terms of code speed specifically,
&
is bitwise "AND", in python the boolean operators are written out asand
,or
, andnot
. Usingand
here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, e.g.