在AWS Athena中,如何离开加入,右边只有1行,高性能?

发布于 2025-01-25 10:11:47 字数 1631 浏览 0 评论 0原文

我有2张表:

regex_table:包含正则图案

IDREGEX_PATTERN
1'HEL。*'
2'1.*'

string_table:包含字符串

ID字符串
1Hello string 1 Hello
2地狱,

预期的结果应包含每个Regex模式的1个匹配的字符串,例如

REGEX_PARTTNN字符串
'Hel。*'Hello'1
.*'

每个REGEX_PATTERN只需要一个字符串,因此在此示例中,地狱hello可以是'Hel。*'hel。*'

我需要在AWS Athena(引擎盖下的Presto)上进行此操作,以便

我尝试过的巨型数据集:

  1. 使用Row_number()
select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

此工作,但是太慢了,因为连接必须检查字符串表的所有字符串,而仅需要一个匹配的字符串,其余的可以跳过,就像break ruby​​或python

  1. 使用相关子查询
select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

中,但在给定上下文中接收到的错误不支持

是否有更好的方法左JOIN在雅典娜中解决此问题?

目前,我必须在python中进行此匹配,以利用break循环在找到匹配的字符串时突破looping string_table。在SQL中可以做同样的事情吗?

I have 2 tables:

regex_table: contains the regex patterns

idregex_pattern
1'hel.*'
2'1.*'

string_table: contains strings

idstring
1hello
2hell

The expected result should contain 1 matched string for each regex pattern, like

regex_patternstring
'hel.*'hello
'1.*'

each regex_pattern only needs one string, so in this example, both hell and hello can be a match for 'hel.*'

I need to do this on AWS Athena (presto under the hood) for a giant data set

I've tried:

  1. use row_number()
select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break loop in ruby or python

  1. Use Correlated subquery
select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

but received error Correlated subquery in given context is not supported

Is there a better way than left join to solve this problem in athena?

Currently I have to do this matching in python to take advantage of the break loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文