在减去周末时间后找到时间差

发布于 2025-02-11 08:47:27 字数 862 浏览 1 评论 0原文

idstart_time_end_time_time_time_time_difftime_spent_working_days10308298
12022-04-18 09:58:38.90612:502022-04-18
2:27.204502022-04-18
2022-04-18 21:27:12.3882022-04-18 21:27:27:27.22714839
42022-04-19 09:57:40.4532022-04-19 09:59:59:59:59:07.61287159

TIME_DIFT time_diff = end_diff = end_time_time_-start_time_- start_time _ bl br > time_spent_working_days =需要减去周末并找到时间差。

我能够计算启动时间和结束时间之间的差异以获取Time_diff,但是有人可以通过删除周末来帮助我获得Working_day持续时间。 (星期六/太阳 - 48小时)

IDSTART_TIME_END_TIME_TIME_DIFFTIME_SPENT_WORKING_DAYS
12022-04-18 09:58:38.9062022-04-18 12:50:27.20410308298
22022-04-18 21:26:25.6712022-04-18 21:26:36.31310642
32022-04-18 21:27:12.3882022-04-18 21:27:27.22714839
42022-04-19 09:57:40.4532022-04-19 09:59:07.61287159

TIME_DIFF = END_TIME_ - START_TIME_
TIME_SPENT_WORKING_DAYS = Need to subtract the weekends and find the time difference.

I was able to calculate the difference between start_time and end time to get the time_diff, but can someone help me in getting the working_day duration by removing the weekends. (sat/sun -- 48 hours)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

仄言 2025-02-18 08:47:27

输入:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, '2022-04-18 09:58:38.906',  '2022-04-18 12:50:27.204'),
     (2, '2022-04-18 21:26:25.671',  '2022-04-18 21:26:36.313'),
     (3, '2022-04-18 21:27:12.388',  '2022-04-18 21:27:27.227'),
     (4, '2022-04-19 09:57:40.453',  '2022-04-29 09:59:07.612')],
    ['ID', 'START_TIME_', 'END_TIME_'])
  • 使用高阶功能的方法汇总filter

      df = df.withcolumn(
        'time_spent_working_days',
        F.Aggregate(
            f.expr(“ filter(sequence(to_date(start_time_),to_date(end_time_)),x  - > dayday(x)< 5)”),
            f.lit(0).cast('long'),
            lambda acc,d:acc + f.least(f.date_add(d,1),f.to_timestamp('end_time _'))。cast('long')
                           -f.greatest(d,f.to_timestamp('start_time _'))。cast('long')
        )    
    )
    df.show(truncate = 0)
    #+----+-----------------------------+--------------------------------------------- ---+------------------------------+
    #| id | start_time_ | end_time_ | time_spent_working_days |
    #+----+-----------------------------+--------------------------------------------- ---+------------------------------+
    #| 1 | 2022-04-18 09:58:38.906 | 2022-04-18 12:50:27.204 | 10309 |
    #| 2 | 2022-04-18 21:26:26:25.671 | 2022-04-18 21:26:36.313 | 11 |
    #| 3 | 2022-04-18 21:27:12.388 | 2022-04-18 21:27:27.227 | 15 |
    #| 4 | 2022-04-19 09:57:40.453 | 2022-04-29 09:59:07.612 | 691287 |
    #+----+-----------------------------+--------------------------------------------- ---+------------------------------+
     

    Spark 2.4+

    的语法

      df = df.withcolumn(
        'time_spent_working_days',
        F.EXPR(“”“
            总计的(
                filter(序列(to_date(start_time_),to_date(end_time_)),x  - > dayday(x)< 5),
                铸造(0长),
                (ACC,D) - > acc + cast(最小值(date_add(d,1),to_timestamp(end_time_))
                             - 铸造(最大(d,to_timestamp(start_time_))
            )
        ”“”)
    )
     
  • 的语法使用爆炸>爆炸group> groupby

      df = df.withcolumn('day',f.explode(f.sequence(f.to_date(f.to_date)('start_time_'),f.to_date('end_time_'''end_time_''' ))))
    workday =(〜f.dayofweek('day')。isin([1,7]))。cast('long')
    秒= f.least(f.date_add('day',1),f.to_timestamp('end_time _'))。cast('long')\
              -f.greatest('Day',f.to_timestamp('start_time _'))。cast('long')
    df = df.groupby('id','start_time_','end_time _')。
        f.sum(秒).alias('time_diff'),
        f.sum(秒 * workday).alias('time_spent_working_days')
    )
    
    df.show(truncate = 0)
    #+----+-----------------------------+--------------------------------------------- ---+----------+--------------------------------+
    #| id | start_time_ | end_time_ | time_diff | time_spent_working_days |
    #+----+-----------------------------+--------------------------------------------- ---+----------+--------------------------------+
    #| 1 | 2022-04-18 09:58:38.906 | 2022-04-18 12:50:27.204 | 10309 | 10309 |
    #| 2 | 2022-04-18 21:26:26:25.671 | 2022-04-18 21:26:36.313 | 11 | 11 | 11 |
    #| 3 | 2022-04-18 21:27:12.388 | 2022-04-18 21:27:27.227 | 15 | 15 | 15 |
    #| 4 | 2022-04-19 09:57:40.453 | 2022-04-29 09:59:07.612 | 864087 | 691287 |
    #+----+-----------------------------+--------------------------------------------- ---+----------+--------------------------------+
     

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, '2022-04-18 09:58:38.906',  '2022-04-18 12:50:27.204'),
     (2, '2022-04-18 21:26:25.671',  '2022-04-18 21:26:36.313'),
     (3, '2022-04-18 21:27:12.388',  '2022-04-18 21:27:27.227'),
     (4, '2022-04-19 09:57:40.453',  '2022-04-29 09:59:07.612')],
    ['ID', 'START_TIME_', 'END_TIME_'])
  • Method using higher-order functions aggregate and filter:

    df = df.withColumn(
        'TIME_SPENT_WORKING_DAYS',
        F.aggregate(
            F.expr("filter(sequence(to_date(START_TIME_), to_date(END_TIME_)), x -> weekday(x) < 5)"),
            F.lit(0).cast('long'),
            lambda acc, d: acc + F.least(F.date_add(d, 1), F.to_timestamp('END_TIME_')).cast('long')
                           - F.greatest(d, F.to_timestamp('START_TIME_')).cast('long')
        )    
    )
    df.show(truncate=0)
    # +---+-----------------------+-----------------------+-----------------------+
    # |ID |START_TIME_            |END_TIME_              |TIME_SPENT_WORKING_DAYS|
    # +---+-----------------------+-----------------------+-----------------------+
    # |1  |2022-04-18 09:58:38.906|2022-04-18 12:50:27.204|10309                  |
    # |2  |2022-04-18 21:26:25.671|2022-04-18 21:26:36.313|11                     |
    # |3  |2022-04-18 21:27:12.388|2022-04-18 21:27:27.227|15                     |
    # |4  |2022-04-19 09:57:40.453|2022-04-29 09:59:07.612|691287                 |
    # +---+-----------------------+-----------------------+-----------------------+
    

    Syntax for Spark 2.4+

    df = df.withColumn(
        'TIME_SPENT_WORKING_DAYS',
        F.expr("""
            aggregate(
                filter(sequence(to_date(START_TIME_), to_date(END_TIME_)), x -> weekday(x) < 5),
                cast(0 as long),
                (acc, d) -> acc + cast(least(date_add(d, 1), to_timestamp(END_TIME_)) as long)
                            - cast(greatest(d, to_timestamp(START_TIME_)) as long)
            )
        """)
    )
    
  • Method using explode and groupBy:

    df = df.withColumn('day', F.explode(F.sequence(F.to_date('START_TIME_'), F.to_date('END_TIME_'))))
    workday = (~F.dayofweek('day').isin([1, 7])).cast('long')
    seconds = F.least(F.date_add('day', 1), F.to_timestamp('END_TIME_')).cast('long') \
              - F.greatest('day', F.to_timestamp('START_TIME_')).cast('long')
    df = df.groupBy('ID', 'START_TIME_', 'END_TIME_').agg(
        F.sum(seconds).alias('TIME_DIFF'),
        F.sum(seconds * workday).alias('TIME_SPENT_WORKING_DAYS')
    )
    
    df.show(truncate=0)
    # +---+-----------------------+-----------------------+---------+-----------------------+
    # |ID |START_TIME_            |END_TIME_              |TIME_DIFF|TIME_SPENT_WORKING_DAYS|
    # +---+-----------------------+-----------------------+---------+-----------------------+
    # |1  |2022-04-18 09:58:38.906|2022-04-18 12:50:27.204|10309    |10309                  |
    # |2  |2022-04-18 21:26:25.671|2022-04-18 21:26:36.313|11       |11                     |
    # |3  |2022-04-18 21:27:12.388|2022-04-18 21:27:27.227|15       |15                     |
    # |4  |2022-04-19 09:57:40.453|2022-04-29 09:59:07.612|864087   |691287                 |
    # +---+-----------------------+-----------------------+---------+-----------------------+
    
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文