Bigquery:正则表达式提取后的 COUNT/GROUP BY
我是 bigquery 的新手。我有一个表 MOVIES
,其架构如下:
[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
SchemaField('title', 'STRING', 'NULLABLE', None, ()),
SchemaField('genres', 'STRING', 'NULLABLE', None, ())]
电影的标题如下所示: Pharaoh's Army (1995)
我想看看大多数电视剧是在哪一年拍摄的。我已经设法从标题中提取年份,并且只保留像这样的戏剧电影:
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()
现在我需要按年份分组,然后对每个组进行计数,以找到戏剧最多的年份。我尝试了这个:
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""
但我收到以下错误 BadRequest: 400 Unrecognized name:year at [5:9]
我不明白。如果没有 group by
,名称 year
会按 order byyear desc
被识别,那么为什么它现在不能识别呢?
I'm quite a newbie in bigquery. I have a table MOVIES
with the following schema :
[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
SchemaField('title', 'STRING', 'NULLABLE', None, ()),
SchemaField('genres', 'STRING', 'NULLABLE', None, ())]
The title of the movies looks like this : Pharaoh's Army (1995)
I want to see in which year most drama were filmed. I already managed to extract the year from the title and only keeping movies that are dramas like so :
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()
Now I need to group by year and then count each group to find the year with the most dramas. I tried this :
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""
But I get the following error BadRequest: 400 Unrecognized name: year at [5:9]
which I don't understand. Without the group by
the name year
was recognized in order by year desc
so why doesn't it recognize it now ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了补充 Martin Weitzmann 的答案,如果您决定使用
WITH
,下面是此类场景的代码/脚本实现。with
可以用作最终查询的单个表达式中的临时表。有关with
行为的更多详细信息,请访问 with 子句 页面。查询:此查询显示
with
在您的案例中的用法。您可以在 BigQuery UI 上运行它。代码:将 QUERY 替换为上述查询
输出
To complement Martin Weitzmann answer, if you decide to go using
WITH
, below its a code/script implementation of such scenario.with
can be use as a temp table in a single expression for your final query. For more details aboutwith
behaviour can be found on with clause page.query: This query shows the usage of
with
on your case. You can run this on the BigQuery UI.code: Replace QUERY with above query
output
聚合器
COUNT()
无法知道year
,因为您只是在上面的行中将其创建为类别/组。ORDER BY
在分组/聚合后执行,因此它可以知道字段year
。粗略的执行顺序是
(像
UNION
这样的设置操作仅适用于查询结果 - 不确定在哪里列出它们 - 它们要么是第一个,要么是最后一个,具体取决于你如何看待它)根据您的
year
中是否有 NULL 值,您可以尝试COUNT(*)
或COUNT(title)
甚至 < code>COUNT(regexp_extract(title, r'\((\d{4})\)'))如果您不需要额外的正则表达式,但需要字段
year
,您可以在 CTE 中准备表格WITH
并在引用 CTE 的查询中进行分组。The aggregator
COUNT()
can't knowyear
because you're only creating it in the line above as a category/group.ORDER BY
gets executed after grouping/aggregations so it can know the fieldyear
.Rough order of execution is afaik
(set operations like
UNION
only work with query results - not sure where to list them - either they're first or last depending on how you look at it)Depending on whether there are NULL values in your
year
or not you could tryCOUNT(*)
orCOUNT(title)
or evenCOUNT(regexp_extract(title, r'\((\d{4})\)'))
If you don't want the extra regex but you need the field
year
, you can prepare the table in a CTEWITH
and do the grouping in a query referencing the CTE.