Bigquery：正则表达式提取后的 COUNT/GROUP BY

发布于 2025-01-14 19:23:05 字数 1117 浏览 1 评论 0原文

我是 bigquery 的新手。我有一个表 MOVIES ，其架构如下：

[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('title', 'STRING', 'NULLABLE', None, ()),
 SchemaField('genres', 'STRING', 'NULLABLE', None, ())]

电影的标题如下所示： Pharaoh's Army (1995)

我想看看大多数电视剧是在哪一年拍摄的。我已经设法从标题中提取年份，并且只保留像这样的戏剧电影：

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()

现在我需要按年份分组，然后对每个组进行计数，以找到戏剧最多的年份。我尝试了这个：

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
  count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""

但我收到以下错误 BadRequest: 400 Unrecognized name:year at [5:9] 我不明白。如果没有 group by，名称 year 会按 order byyear desc 被识别，那么为什么它现在不能识别呢？

原文

I'm quite a newbie in bigquery. I have a table MOVIES with the following schema :

[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('title', 'STRING', 'NULLABLE', None, ()),
 SchemaField('genres', 'STRING', 'NULLABLE', None, ())]

The title of the movies looks like this : Pharaoh's Army (1995)

I want to see in which year most drama were filmed. I already managed to extract the year from the title and only keeping movies that are dramas like so :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()

Now I need to group by year and then count each group to find the year with the most dramas. I tried this :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
  count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""

But I get the following error BadRequest: 400 Unrecognized name: year at [5:9] which I don't understand. Without the group bythe name year was recognized in order by year descso why doesn't it recognize it now ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷路的信 2025-01-21 19:23:05

为了补充 Martin Weitzmann 的答案，如果您决定使用 WITH，下面是此类场景的代码/脚本实现。 with 可以用作最终查询的单个表达式中的临时表。有关 with 行为的更多详细信息，请访问 with 子句页面。

查询：此查询显示 with 在您的案例中的用法。您可以在 BigQuery UI 上运行它。

with movies as (
    select 
        title,
        SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
        genres
        from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc

代码：将 QUERY 替换为上述查询

from google.cloud import bigquery

client = bigquery.Client()

query = """ QUERY """

query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())

输出

year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
  ...

To complement Martin Weitzmann answer, if you decide to go using WITH, below its a code/script implementation of such scenario. with can be use as a temp table in a single expression for your final query. For more details about with behaviour can be found on with clause page.

query: This query shows the usage of with on your case. You can run this on the BigQuery UI.

with movies as (
    select 
        title,
        SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
        genres
        from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc

code: Replace QUERY with above query

from google.cloud import bigquery

client = bigquery.Client()

query = """ QUERY """

query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())

output

year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
  ...

回复收藏 0 原文

七度光 2025-01-21 19:23:05

聚合器 COUNT() 无法知道 year，因为您只是在上面的行中将其创建为类别/组。 ORDER BY 在分组/聚合后执行，因此它可以知道字段year。

粗略的执行顺序是

FROM 输入表
JOIN 表记录
WHERE 字段过滤
GROUP BY 聚合字段
HAVING 聚合过滤器
WINDOW 函数
QUALIFY 窗口字段过滤
DISTINCT 结果字段
ORDER BY 结果字段
LIMIT 和 OFFSET 结果行
SELECT （输出/打印）输出字段

(像 UNION 这样的设置操作仅适用于查询结果 - 不确定在哪里列出它们 - 它们要么是第一个，要么是最后一个，具体取决于你如何看待它）

根据您的 year 中是否有 NULL 值，您可以尝试 COUNT(*) 或 COUNT(title) 甚至 < code>COUNT(regexp_extract(title, r'\((\d{4})\)'))

如果您不需要额外的正则表达式，但需要字段 year ，您可以在 CTE 中准备表格WITH 并在引用 CTE 的查询中进行分组。

回复收藏 0 原文

~没有更多了~

关于作者

惟欲睡

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Bigquery：正则表达式提取后的 COUNT/GROUP BY

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

Bigquery：正则表达式提取后的 COUNT/GROUP BY

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。