如何在pyspark中提取和过滤？

发布于 2025-02-11 06:55:13 字数 1020 浏览 1 评论 0原文

数据模式

df = spark.read.parquet(link)
df.printSchema()

# root
#    |--user_id : long
#    |--date : string
#    |--totals: struct
#    |    |--time: long
#    |    |--views: long
#    |   clicks: array
#    |    |--element: struct
#    |    |     |--clicknumber: long
#    |    |     |--eventinfo : struct
#    |    |          |--eventlabel : string 
#    |    |          |--eventaction : string
#    |          |--item: array
#    |    |          |--element: struct
#    |    |     |    |    |--itembrand: String
#    |    |     |    |    |-- itemprice: long

这是我要做的努力的，就是创建一个包含date和eventAction 的pyspark dataframe，而date将在2000年3月5日和4/09/2009和品牌将为“ stihl”。

我之前进行了一些试验，但没有任何结果。

df.select(['date', explode('clicks.eventinfo.eventaction'), 
            explode('clicks.item')])

df.filter(df.clicks.item.itembrand =='Stihl')

原文

Here's the data schema

df = spark.read.parquet(link)
df.printSchema()

# root
#    |--user_id : long
#    |--date : string
#    |--totals: struct
#    |    |--time: long
#    |    |--views: long
#    |   clicks: array
#    |    |--element: struct
#    |    |     |--clicknumber: long
#    |    |     |--eventinfo : struct
#    |    |          |--eventlabel : string 
#    |    |          |--eventaction : string
#    |          |--item: array
#    |    |          |--element: struct
#    |    |     |    |    |--itembrand: String
#    |    |     |    |    |-- itemprice: long

What I'm struggling to do is creating a PySpark dataframe that contains the date and the eventaction while the date will be between 3/05/2000 and 4/09/2009 and brand will be "Stihl".

I made some trials before, but without any results.

df.select(['date', explode('clicks.eventinfo.eventaction'), 
            explode('clicks.item')])

df.filter(df.clicks.item.itembrand =='Stihl')

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空名 2025-02-18 06:55:13

尝试一下。首先，我爆炸了每个数组，然后选择。

from pyspark.sql import functions as F

df = df.withColumn('clicks', F.explode('clicks'))
df = df.withColumn('item', F.explode('clicks.item'))
df = df.filter(
    F.col('date').between('2000-03-05', '2009-04-09') &
    (F.col('item.itembrand') == 'Stihl')
).select('date', 'clicks.eventinfo.eventaction')

Try this. First I exploded every array, then selected.

from pyspark.sql import functions as F

df = df.withColumn('clicks', F.explode('clicks'))
df = df.withColumn('item', F.explode('clicks.item'))
df = df.filter(
    F.col('date').between('2000-03-05', '2009-04-09') &
    (F.col('item.itembrand') == 'Stihl')
).select('date', 'clicks.eventinfo.eventaction')

回复收藏 0 原文

~没有更多了~