字符串列值的功能比较红移

发布于 2025-02-03 12:35:56 字数 1516 浏览 2 评论 0原文

select *
    from testset t
    join predictions p
    on p.id = c.id and
    c.text ilike '%'+p.text+'%'

union

select *
    from testset t
    join predictions p
    on p.id = c.id and
    p.text ilike '%'+c.text+'%'

，以考虑长度（c）＆gt;长度（P）和Viceversa。顺便说一句，类似物具有多个限制，例如

p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"

它无法正常工作。然后我在想使用 redshift函数函数），但我不确定如何支持Levenshtein距离，字符串相似性（带有阈值）等诸如该功能（以及可能是可能的），使用 udf python函数。

我现在的看法

create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
    def diff(strL, strR):
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, strL, strR).ratio()
        return ratio
    return diff(a,b)
    return f_compare(a,b)
$$ LANGUAGE plpythonu;  

with samples as (
    select 
    cast('TOKEN1 TOKEN3' as VARCHAR) as name,
    cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)

select f_compare(name, name1) from samples

原文

I'm using ilike for column string case insensitive comparison in Redshift. To better handle the string size I do

select *
    from testset t
    join predictions p
    on p.id = c.id and
    c.text ilike '%'+p.text+'%'

union

select *
    from testset t
    join predictions p
    on p.id = c.id and
    p.text ilike '%'+c.text+'%'

in order to consider length(c) > length(p) and viceversa. By the way, the like has several limitations, like when

p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"

it will not work.
I was thinking then to use a Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.

My take looks like right now

create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $
    def diff(strL, strR):
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, strL, strR).ratio()
        return ratio
    return diff(a,b)
    return f_compare(a,b)
$ LANGUAGE plpythonu;  

with samples as (
    select 
    cast('TOKEN1 TOKEN3' as VARCHAR) as name,
    cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)

select f_compare(name, name1) from samples

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隔纱相望 2025-02-10 12:35:56

您可以使用Python库，而不是编写自己的图书馆。 thefuzz 很受欢迎。

它使用Levenshtein距离来计算简单使用程序包中序列之间的差异。

UDF

CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$
  FROM fuzzywuzzy import fuzz 
  RETURN fuzz.ratio (string_a,string_b) 
$ LANGUAGE plpythonu;

查询，

SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93

您必须将该库导入到红移群集中。从 github 下载fuzzywuzzy repo。

CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'

You can use the python libraries instead of writing your own. TheFuzz is a pretty popular one.

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

UDF

CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$
  FROM fuzzywuzzy import fuzz 
  RETURN fuzz.ratio (string_a,string_b) 
$ LANGUAGE plpythonu;

Query

SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93

You'll have to import that library into your redshift cluster though. Download the fuzzywuzzy repo from github, zip it, upload it to your S3 and create a library using it.

CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'

回复收藏 0 原文

~没有更多了~