在AWS Athena中，如何离开加入，右边只有1行，高性能？

发布于 2025-01-25 10:11:47 字数 1631 浏览 0 评论 0原文

我有2张表：

regex_table：包含正则图案

ID	REGEX_PATTERN
1'HEL	。*'
2'1	.*'

string_table：包含字符串

ID	字符串
1	Hello string 1 Hello
2	地狱，

预期的结果应包含每个Regex模式的1个匹配的字符串，例如

REGEX_PARTTNN	字符串
'Hel。*'	Hello'1
.*'

每个REGEX_PATTERN只需要一个字符串，因此在此示例中，地狱和hello可以是'Hel。*'hel。*'

我需要在AWS Athena（引擎盖下的Presto）上进行此操作，以便

我尝试过的巨型数据集：

使用Row_number（）

select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

此工作，但是太慢了，因为连接必须检查字符串表的所有字符串，而仅需要一个匹配的字符串，其余的可以跳过，就像break ruby或python

使用相关子查询

select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

中，但在给定上下文中接收到的错误不支持

是否有更好的方法左JOIN在雅典娜中解决此问题？

目前，我必须在python中进行此匹配，以利用break循环在找到匹配的字符串时突破looping string_table。在SQL中可以做同样的事情吗？

原文

I have 2 tables:

regex_table: contains the regex patterns

id	regex_pattern
1	'hel.*'
2	'1.*'

string_table: contains strings

id	string
1	hello
2	hell

The expected result should contain 1 matched string for each regex pattern, like

regex_pattern	string
'hel.*'	hello
'1.*'

each regex_pattern only needs one string, so in this example, both hell and hello can be a match for 'hel.*'

I need to do this on AWS Athena (presto under the hood) for a giant data set

I've tried:

use row_number()

select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break loop in ruby or python

Use Correlated subquery

select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

but received error Correlated subquery in given context is not supported

Is there a better way than left join to solve this problem in athena?

Currently I have to do this matching in python to take advantage of the break loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?

分享到QQ

分享到微博