Bigquery:正则表达式提取后的 COUNT/GROUP BY

发布于 2025-01-14 19:23:05 字数 1117 浏览 1 评论 0原文

我是 bigquery 的新手。我有一个表 MOVIES ,其架构如下:

[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('title', 'STRING', 'NULLABLE', None, ()),
 SchemaField('genres', 'STRING', 'NULLABLE', None, ())]

电影的标题如下所示: Pharaoh's Army (1995)

我想看看大多数电视剧是在哪一年拍摄的。我已经设法从标题中提取年份,并且只保留像这样的戏剧电影:

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()

现在我需要按年份分组,然后对每个组进行计数,以找到戏剧最多的年份。我尝试了这个:

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
  count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""

但我收到以下错误 BadRequest: 400 Unrecognized name:year at [5:9] 我不明白。如果没有 group by,名称 year 会按 order byyear desc 被识别,那么为什么它现在不能识别呢?

I'm quite a newbie in bigquery. I have a table MOVIES with the following schema :

[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('title', 'STRING', 'NULLABLE', None, ()),
 SchemaField('genres', 'STRING', 'NULLABLE', None, ())]

The title of the movies looks like this : Pharaoh's Army (1995)

I want to see in which year most drama were filmed. I already managed to extract the year from the title and only keeping movies that are dramas like so :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()

Now I need to group by year and then count each group to find the year with the most dramas. I tried this :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
  count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""

But I get the following error BadRequest: 400 Unrecognized name: year at [5:9] which I don't understand. Without the group bythe name year was recognized in order by year descso why doesn't it recognize it now ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

迷路的信 2025-01-21 19:23:05

为了补充 Martin Weitzmann 的答案,如果您决定使用 WITH,下面是此类场景的代码/脚本实现。 with 可以用作最终查询的单个表达式中的临时表。有关 with 行为的更多详细信息,请访问 with 子句 页面。

查询:此查询显示 with 在您的案例中的用法。您可以在 BigQuery UI 上运行它。

with movies as (
    select 
        title,
        SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
        genres
        from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc

代码:将 QUERY 替换为上述查询

from google.cloud import bigquery

client = bigquery.Client()

query = """ QUERY """

query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())

输出

year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
  ...

To complement Martin Weitzmann answer, if you decide to go using WITH, below its a code/script implementation of such scenario. with can be use as a temp table in a single expression for your final query. For more details about with behaviour can be found on with clause page.

query: This query shows the usage of with on your case. You can run this on the BigQuery UI.

with movies as (
    select 
        title,
        SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
        genres
        from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc

code: Replace QUERY with above query

from google.cloud import bigquery

client = bigquery.Client()

query = """ QUERY """

query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())

output

year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
  ...
七度光 2025-01-21 19:23:05

聚合器 COUNT() 无法知道 year,因为您只是在上面的行中将其创建为类别/组。 ORDER BY 在分组/聚合后执行,因此它可以知道字段year

粗略的执行顺序是

  1. FROM 输入表
  2. JOIN 表记录
  3. WHERE 字段过滤
  4. GROUP BY 聚合字段
  5. HAVING 聚合过滤器
  6. WINDOW 函数
  7. QUALIFY 窗口字段过滤
  8. DISTINCT 结果字段
  9. ORDER BY 结果字段
  10. LIMIT 和 OFFSET 结果行
  11. SELECT (输出/打印)输出字段

(像 UNION 这样的设置操作仅适用于查询结果 - 不确定在哪里列出它们 - 它们要么是第一个,要么是最后一个,具体取决于你如何看待它)

根据您的 year 中是否有 NULL 值,您可以尝试 COUNT(*)COUNT(title) 甚至 < code>COUNT(regexp_extract(title, r'\((\d{4})\)'))

如果您不需要额外的正则表达式,但需要字段 year ,您可以在 CTE 中准备表格WITH 并在引用 CTE 的查询中进行分组。

The aggregator COUNT() can't know year because you're only creating it in the line above as a category/group. ORDER BY gets executed after grouping/aggregations so it can know the field year.

Rough order of execution is afaik

  1. FROM input tables
  2. JOINs on table records
  3. WHERE fields filter
  4. GROUP BY aggregate fields
  5. HAVING aggregates filter
  6. WINDOW functions
  7. QUALIFY window fields filter
  8. DISTINCT result fields
  9. ORDER BY result fields
  10. LIMIT and OFFSET result rows
  11. SELECT (output/print) output fields

(set operations like UNION only work with query results - not sure where to list them - either they're first or last depending on how you look at it)

Depending on whether there are NULL values in your year or not you could try COUNT(*) or COUNT(title) or even COUNT(regexp_extract(title, r'\((\d{4})\)'))

If you don't want the extra regex but you need the field year, you can prepare the table in a CTE WITH and do the grouping in a query referencing the CTE.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文