字符串列值的功能比较红移
我正在使用对于列字符串案例不敏感的比较。更好地处理我所做的字符串尺寸
select *
from testset t
join predictions p
on p.id = c.id and
c.text ilike '%'+p.text+'%'
union
select *
from testset t
join predictions p
on p.id = c.id and
p.text ilike '%'+c.text+'%'
,以考虑长度(c)>长度(P)和Viceversa。顺便说一句,类似物具有多个限制,例如
p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"
它无法正常工作。 然后我在想使用 redshift函数函数),但我不确定如何支持Levenshtein距离,字符串相似性(带有阈值)等诸如该功能(以及可能是可能的),使用 udf python函数。
我现在的看法
create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
def diff(strL, strR):
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, strL, strR).ratio()
return ratio
return diff(a,b)
return f_compare(a,b)
$$ LANGUAGE plpythonu;
with samples as (
select
cast('TOKEN1 TOKEN3' as VARCHAR) as name,
cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)
select f_compare(name, name1) from samples
I'm using ilike for column string case insensitive comparison in Redshift. To better handle the string size I do
select *
from testset t
join predictions p
on p.id = c.id and
c.text ilike '%'+p.text+'%'
union
select *
from testset t
join predictions p
on p.id = c.id and
p.text ilike '%'+c.text+'%'
in order to consider length(c) > length(p) and viceversa. By the way, the like has several limitations, like when
p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"
it will not work.
I was thinking then to use a Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.
My take looks like right now
create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $
def diff(strL, strR):
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, strL, strR).ratio()
return ratio
return diff(a,b)
return f_compare(a,b)
$ LANGUAGE plpythonu;
with samples as (
select
cast('TOKEN1 TOKEN3' as VARCHAR) as name,
cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)
select f_compare(name, name1) from samples
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用Python库,而不是编写自己的图书馆。 thefuzz 很受欢迎。
UDF
查询,
您必须将该库导入到红移群集中。从 github 下载fuzzywuzzy repo。
You can use the python libraries instead of writing your own. TheFuzz is a pretty popular one.
UDF
Query
You'll have to import that library into your redshift cluster though. Download the fuzzywuzzy repo from github, zip it, upload it to your S3 and create a library using it.