如何将火花中的时间戳数据汇总到较小的时间范围

发布于 2025-02-12 00:35:48 字数 231 浏览 1 评论 0原文

我正在使用纽约出租车数据从事一个项目。数据包含拾取位置的记录(PulocationId),以及该特定拾取记录的时间戳(tpep_pickup_datetime)。

“数据示例”

我想将数据汇总为每个位置的每小时。汇总应每小时每小时拾取。

I'm working on a project using New York taxi data. The data contain records for pickup location (PULocationID), and the timestamp (tpep_pickup_datetime) for that particular pick-up record.

sample of the data

I want to aggregate the data to be hourly for each location. The aggregation should have an hourly count of pick-ups per location.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月牙弯弯 2025-02-19 00:35:49

您提供的信息有点缺乏。据我了解,这些可能是可能的聚合选项。

使用 date_trunc

from pyspark.sql import functions as F

df = df.groupBy(
        F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
        'PULocationID',
    ).count()

df.show()
# +-------------------+------------+-----+
# |               hour|PULocationID|count|
# +-------------------+------------+-----+
# |2020-01-01 00:00:00|         238|    1|
# |2020-01-01 02:00:00|         238|    2|
# |2020-01-01 02:00:00|         193|    1|
# |2020-01-01 01:00:00|         238|    2|
# |2020-01-01 00:00:00|           7|    1|
# +-------------------+------------+-----+

使用 window> window> window

from pyspark.sql import functions as F

df = df.groupBy(
        F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
        'PULocationID',
    ).count()

df.show(truncate=0)
# +------------------------------------------+------------+-----+
# |hour                                      |PULocationID|count|
# +------------------------------------------+------------+-----+
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238         |2    |
# |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238         |2    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238         |1    |
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193         |1    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7           |1    |

The information you provided is a bit lacking. From what I understood, these could be possible aggregation options.

Using date_trunc

from pyspark.sql import functions as F

df = df.groupBy(
        F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
        'PULocationID',
    ).count()

df.show()
# +-------------------+------------+-----+
# |               hour|PULocationID|count|
# +-------------------+------------+-----+
# |2020-01-01 00:00:00|         238|    1|
# |2020-01-01 02:00:00|         238|    2|
# |2020-01-01 02:00:00|         193|    1|
# |2020-01-01 01:00:00|         238|    2|
# |2020-01-01 00:00:00|           7|    1|
# +-------------------+------------+-----+

Using window

from pyspark.sql import functions as F

df = df.groupBy(
        F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
        'PULocationID',
    ).count()

df.show(truncate=0)
# +------------------------------------------+------------+-----+
# |hour                                      |PULocationID|count|
# +------------------------------------------+------------+-----+
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238         |2    |
# |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238         |2    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238         |1    |
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193         |1    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7           |1    |
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文