字符串列值的功能比较红移

发布于 2025-02-03 12:35:56 字数 1516 浏览 2 评论 0原文

我正在使用对于列字符串案例不敏感的比较。更好地处理我所做的字符串尺寸

select *
    from testset t
    join predictions p
    on p.id = c.id and
    c.text ilike '%'+p.text+'%'

union

select *
    from testset t
    join predictions p
    on p.id = c.id and
    p.text ilike '%'+c.text+'%'

,以考虑长度(c)>长度(P)和Viceversa。顺便说一句,类似物具有多个限制,例如

p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"

它无法正常工作。 然后我在想使用 redshift函数函数),但我不确定如何支持Levenshtein距离,字符串相似性(带有阈值)等诸如该功能(以及可能是可能的),使用 udf python函数

我现在的看法

create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
    def diff(strL, strR):
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, strL, strR).ratio()
        return ratio
    return diff(a,b)
    return f_compare(a,b)
$$ LANGUAGE plpythonu;  

with samples as (
    select 
    cast('TOKEN1 TOKEN3' as VARCHAR) as name,
    cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)

select f_compare(name, name1) from samples

I'm using ilike for column string case insensitive comparison in Redshift. To better handle the string size I do

select *
    from testset t
    join predictions p
    on p.id = c.id and
    c.text ilike '%'+p.text+'%'

union

select *
    from testset t
    join predictions p
    on p.id = c.id and
    p.text ilike '%'+c.text+'%'

in order to consider length(c) > length(p) and viceversa. By the way, the like has several limitations, like when

p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"

it will not work.
I was thinking then to use a Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.

My take looks like right now

create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $
    def diff(strL, strR):
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, strL, strR).ratio()
        return ratio
    return diff(a,b)
    return f_compare(a,b)
$ LANGUAGE plpythonu;  

with samples as (
    select 
    cast('TOKEN1 TOKEN3' as VARCHAR) as name,
    cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)

select f_compare(name, name1) from samples

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

隔纱相望 2025-02-10 12:35:56

您可以使用Python库,而不是编写自己的图书馆。 thefuzz 很受欢迎。

它使用Levenshtein距离来计算简单使用程序包中序列之间的差异。

UDF

CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$
  FROM fuzzywuzzy import fuzz 
  RETURN fuzz.ratio (string_a,string_b) 
$ LANGUAGE plpythonu;

查询,

SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93

您必须将该库导入到红移群集中。从 github 下载fuzzywuzzy repo。

CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'

You can use the python libraries instead of writing your own. TheFuzz is a pretty popular one.

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

UDF

CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$
  FROM fuzzywuzzy import fuzz 
  RETURN fuzz.ratio (string_a,string_b) 
$ LANGUAGE plpythonu;

Query

SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93

You'll have to import that library into your redshift cluster though. Download the fuzzywuzzy repo from github, zip it, upload it to your S3 and create a library using it.

CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文