在AWS Athena中,如何离开加入,右边只有1行,高性能?
我有2张表:
regex_table:包含正则图案
ID | REGEX_PATTERN |
---|---|
1'HEL | 。*' |
2'1 | .*' |
string_table:包含字符串
ID | 字符串 |
---|---|
1 | Hello string 1 Hello |
2 | 地狱, |
预期的结果应包含每个Regex模式的1个匹配的字符串,例如
REGEX_PARTTNN | 字符串 |
---|---|
'Hel。*' | Hello'1 |
.*' |
每个REGEX_PATTERN只需要一个字符串,因此在此示例中,地狱
和hello
可以是'Hel。*'hel。*'
我需要在AWS Athena(引擎盖下的Presto)上进行此操作,以便
我尝试过的巨型数据集:
- 使用Row_number()
select *
from (
select *,
row_number() over (
partition by regex_pattern
order by string
) as rn
from regex_table
left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1
此工作,但是太慢了,因为连接必须检查字符串表的所有字符串,而仅需要一个匹配的字符串,其余的可以跳过,就像break
ruby或python
- 使用相关子查询
select regex_pattern,
(
select string
from string_table
where regexp_like(string, regex_pattern)
limit 1
)
from regex_table
中,但在给定上下文中接收到的错误不支持
是否有更好的方法左JOIN
在雅典娜中解决此问题?
目前,我必须在python中进行此匹配,以利用break
循环在找到匹配的字符串时突破looping string_table。在SQL中可以做同样的事情吗?
I have 2 tables:
regex_table: contains the regex patterns
id | regex_pattern |
---|---|
1 | 'hel.*' |
2 | '1.*' |
string_table: contains strings
id | string |
---|---|
1 | hello |
2 | hell |
The expected result should contain 1 matched string for each regex pattern, like
regex_pattern | string |
---|---|
'hel.*' | hello |
'1.*' |
each regex_pattern only needs one string, so in this example, both hell
and hello
can be a match for 'hel.*'
I need to do this on AWS Athena (presto under the hood) for a giant data set
I've tried:
- use row_number()
select *
from (
select *,
row_number() over (
partition by regex_pattern
order by string
) as rn
from regex_table
left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1
this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break
loop in ruby or python
- Use Correlated subquery
select regex_pattern,
(
select string
from string_table
where regexp_like(string, regex_pattern)
limit 1
)
from regex_table
but received error Correlated subquery in given context is not supported
Is there a better way than left join
to solve this problem in athena?
Currently I have to do this matching in python to take advantage of the break
loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论