用圈时间创建列给定的总时间

发布于 2025-02-10 23:38:35 字数 1000 浏览 1 评论 0原文

我有四个不同的人,每个人都跑了,我需要为每个人计算每个圈的圈速。从一开始,我就可以从每圈结束时从一开始就获得了总的时间。我可以使用哪种类型的pyspark/sql/pandas语法有效地计算圈速?

示例:

每行代表一个跑步一圈的人。

persyid总时间经过(SEC)圈时间(SEC)
1200200
13001 100
1550250
2100 1100
2150 250100
2250150
3150150
3500350
4100100 4 100
4100 4 300200
435050
4460110

我只需要使用前两列创建第三列,圈时间。我可以写一个最终有效的循环,但是到达那里的最佳/高效pyspark/sql/pandas的方法是什么?

I have four different people, each run laps and I need to calculate lap time for each lap for each person. I am given the total elapsed time starting from the very beginning at the end of each lap. What kind of PySpark/SQL/Pandas syntax could I use to calculate lap times efficiently?

Example:

Each row represents one person running one lap.

PersonIDTotal Time Elapsed (sec)Lap Time (sec)
1200200
1300100
1550250
2100100
215050
2250100
3150150
3500350
4100100
4300200
435050
4460110

I need to create the 3rd column, Lap Time, using only the first two columns. I could write a for loop that would eventually work but what is the best/efficient PySpark/SQL/Pandas way to get there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

还不是爱你 2025-02-17 23:38:35

窗口功能可以做到。

  • pyspark:

     来自pyspark.sql导入函数f,window as w win
    df = spark.createDataframe(
        [(1,200),(1,300),(1,550),(2,100),(2,150),(2,250),
         (3,150),(3,500),(4,100),(4,300),(4,350),(4,460)],
        ['persyid','total_time_elapsed'])
    
    w = w.partitionby('persyid')。
    df = df.withcolumn(
        'lap_time',
        F.Coalesce(
            f.col('tocul_time_elapsed') -  f.lag('tocul_time_elapsed')。
            'total_time_elapsed'))
    df.show()
    #+--------+-----------------------+---------+
    #| themid | total_time_elapsed | lap_time |
    #+--------+-----------------------+---------+
    #| 1 | 200 | 200 |
    #| 1 | 300 | 100 |
    #| 1 | 550 | 250 |
    #| 2 | 100 | 100 |
    #| 2 | 150 | 50 |
    #| 2 | 250 | 100 |
    #| 3 | 150 | 150 |
    #| 3 | 500 | 350 |
    #| 4 | 100 | 100 |
    #| 4 | 300 | 200 |
    #| 4 | 350 | 50 |
    #| 4 | 460 | 110 |
    #+--------+-----------------------+---------+
     
  • sql:

      select
        persyid,
        total_time_elapsed,
        合并(
            total_time_elapsed -lag(total_time_elapsed)上方(按termyid订单按total_time_elapsed进行分区),
            total_time_elapsed)lap_time
    来自DF
     

Window functions could do it.

  • PySpark:

    from pyspark.sql import functions as F, Window as W
    df = spark.createDataFrame(
        [(1, 200), (1, 300), (1, 550), (2, 100), (2, 150), (2, 250),
         (3, 150), (3, 500), (4, 100), (4, 300), (4, 350), (4, 460)],
        ['PersonID', 'Total_Time_Elapsed'])
    
    w = W.partitionBy('PersonID').orderBy('Total_Time_Elapsed')
    df = df.withColumn(
        'Lap_Time',
        F.coalesce(
            F.col('Total_Time_Elapsed') - F.lag('Total_Time_Elapsed').over(w),
            'Total_Time_Elapsed'))
    df.show()
    # +--------+------------------+--------+
    # |PersonID|Total_Time_Elapsed|Lap_Time|
    # +--------+------------------+--------+
    # |       1|               200|     200|
    # |       1|               300|     100|
    # |       1|               550|     250|
    # |       2|               100|     100|
    # |       2|               150|      50|
    # |       2|               250|     100|
    # |       3|               150|     150|
    # |       3|               500|     350|
    # |       4|               100|     100|
    # |       4|               300|     200|
    # |       4|               350|      50|
    # |       4|               460|     110|
    # +--------+------------------+--------+
    
  • SQL:

    SELECT
        PersonID,
        Total_Time_Elapsed,
        COALESCE(
            Total_Time_Elapsed - LAG(Total_Time_Elapsed) OVER (PARTITION BY PersonID ORDER BY Total_Time_Elapsed),
            Total_Time_Elapsed) Lap_Time
    FROM df
    
居里长安 2025-02-17 23:38:35

如果您正在寻找熊猫解决方案,其中一种方法可能是:

import numpy as np
import pandas as pd
data=[[1,200],[1,300],[1,550],[2,100],[2,150],[2,250],[3,150],[3,500],[4,100],[4,300],[4,350],[4,460]]
df = pd.DataFrame(
data,columns=['PersonID','Total Time Elapsed (sec)']
)
print(df.head(50))
# Peform groupby on PersonID
grouped_df=df.groupby(by='PersonID')
series_list=[]

for name,grp in grouped_df:
  result=grp['Total Time Elapsed (sec)'].diff() # Compute difference between current and prev row
  result=result.fillna(grp['Total Time Elapsed (sec)']) # Fill na with same value in Elapsed col(same row).
  series_list.extend(result.values)

df['Lap Time (sec)']=series_list
df['Lap Time (sec)']=df['Lap Time (sec)'].astype(int) # Changing the datatype
print(df.head(50))

If you are looking for a pandas solution, one of the approaches could be:

import numpy as np
import pandas as pd
data=[[1,200],[1,300],[1,550],[2,100],[2,150],[2,250],[3,150],[3,500],[4,100],[4,300],[4,350],[4,460]]
df = pd.DataFrame(
data,columns=['PersonID','Total Time Elapsed (sec)']
)
print(df.head(50))
# Peform groupby on PersonID
grouped_df=df.groupby(by='PersonID')
series_list=[]

for name,grp in grouped_df:
  result=grp['Total Time Elapsed (sec)'].diff() # Compute difference between current and prev row
  result=result.fillna(grp['Total Time Elapsed (sec)']) # Fill na with same value in Elapsed col(same row).
  series_list.extend(result.values)

df['Lap Time (sec)']=series_list
df['Lap Time (sec)']=df['Lap Time (sec)'].astype(int) # Changing the datatype
print(df.head(50))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文