如何使用时间戳创建Spark DataFrame?

发布于 2025-02-11 10:42:50 字数 445 浏览 1 评论 0原文

如何使用Python在一个步骤中使用时间戳数据类型创建此SPARK DATAFRAME?这是我分两个步骤进行的。使用Spark 3.1.2

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema_sdf = StructType([ 
    StructField("ts", TimestampType(), True),
    StructField("myColumn", LongType(), True),
    ])

sdf = spark.createDataFrame( ( [ ( to_timestamp(lit("2022-06-29 12:01:19.000")), 0 ) ] ), schema=schema_sdf )

How can I create this Spark dataframe with timestamp data type in one step using python? Here is how I do it in two steps. Using spark 3.1.2

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema_sdf = StructType([ 
    StructField("ts", TimestampType(), True),
    StructField("myColumn", LongType(), True),
    ])

sdf = spark.createDataFrame( ( [ ( to_timestamp(lit("2022-06-29 12:01:19.000")), 0 ) ] ), schema=schema_sdf )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

几味少女 2025-02-18 10:42:50

Pyspark不会自动解释字符串的时间戳值。我主要使用以下语法来创建DF,然后使用cast列类型到时间戳:

from pyspark.sql import functions as F

sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf = sdf.withColumn("ts", F.col("ts").cast("timestamp"))

sdf.printSchema()
# root
#  |-- ts: timestamp (nullable = true)
#  |-- myColumn: long (nullable = true)

自动推断长格式,但是对于时间戳,我们需要cast> cast

另一方面,即使没有铸造,您也可以使用需要时间戳作为输入的功能:

sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf.printSchema()
# root
#  |-- ts: string (nullable = true)
#  |-- myColumn: long (nullable = true)

sdf.selectExpr("extract(year from ts)").show()
# +---------------------+
# |extract(year FROM ts)|
# +---------------------+
# |                 2022|
# +---------------------+

PySpark does not automatically interpret timestamp values from strings. I mostly use the following syntax to create the df and then to cast column type to timestamp:

from pyspark.sql import functions as F

sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf = sdf.withColumn("ts", F.col("ts").cast("timestamp"))

sdf.printSchema()
# root
#  |-- ts: timestamp (nullable = true)
#  |-- myColumn: long (nullable = true)

Long format was automatically inferred, but for timestamp we needed a cast.

On the other hand, even without casting, you are able to use functions which need timestamp as input:

sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf.printSchema()
# root
#  |-- ts: string (nullable = true)
#  |-- myColumn: long (nullable = true)

sdf.selectExpr("extract(year from ts)").show()
# +---------------------+
# |extract(year FROM ts)|
# +---------------------+
# |                 2022|
# +---------------------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文