如何使用Snowflake SQL找到单列中字符串的变化？

发布于 2025-02-09 08:42:30 字数 1177 浏览 1 评论 0原文

说我有这样的桌子：

Person1	Person2
Dave	Fred
Dave Dave	Dave
Dave Dave	Dave Dave Dave Dave
Dave	Dave Dave
Dave	Dave Dave Dave Dave
Dave Dave	Jeff

在专栏中“ Person1”中显然是最受欢迎的输入，所以我想在SQL（Snowflake）中产生“相似性得分”或“列中的变化”分数。

相反，对于列“ Person2”，字符串之间存在更大的变化，因此相似度得分将较低，或列内列内的变化。因此，您可能最终会以相似的分数输出为：'person1'：0.9，'person2'：0.4。

如果这只是行明确的Levenshtein距离（LD），我该如何在这些方面推出编辑距离以获得每列的分数？目前，我只能看到如何在'person1'和'person2'之间获得LD，而不是在“ Person1”和“ Person2”之间获得LD。

非常感谢

原文

Say I have a table like this:

Person1	Person2
Dave	Fred
Dave	Dave
Dave	Mike
Fred	Dave
Dave	Mike
Dave	Jeff

In column 'Person1' clearly Dave is the most popular input, so I'd like to produce a 'similarity score' or 'variation within column' score that would reflect that in SQL (Snowflake).

In contrast, for the column 'Person2' there is more variation between the strings and so the similarity score would be lower, or variation within column higher. So you might end up with a similarity score output as something like: 'Person1': 0.9, 'Person2': 0.4.

If this is just row-wise Levenshtein Distance (LD), how can I push EDITDISTANCE across these to get a score for each column please? At the moment I can only see how to get the LD between 'Person1' and 'Person2', rather than within 'Person1' and 'Person2'.

Many thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伴我老 2025-02-16 08:42:30

您提出的0.9和0.4的值似乎是相同性的比率，因此可以通过计数和比率_OF_REPORT像这样计算：


with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), p1 as (
    select 
      person1 
      ,count(*) as c_p1
      ,ratio_to_report(c_p1) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p1 desc) = 1
), p2 as (
    select 
      person2 
      ,count(*) as c_p2
      ,ratio_to_report(c_p2) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p2 desc) = 1
)
select 
    p1.q as p1_same,
    p2.q as p2_same
from p1
cross join p2
;

给予：

p1_same	p2_same
0.8333333 0.3333333	0.333333 0.333333

EditDistance：

因此，使用完整的交叉接口，我们可以计算 editdistance 所有值的，并找到总数的比率：

with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), combo as (
select 
    editdistance(da.person1, db.person1) as p1_dist    
    ,editdistance(da.person2, db.person2) as p2_dist
from data as da
cross join data as db
)
select count(*) as c
    ,sum(p1_dist) as s_p1_dist
    ,sum(p2_dist) as s_p2_dist
    ,c / s_p1_dist as p1_same
    ,c / s_p2_dist as p2_same 
from combo
 ;

但是所需的结果不符。

与

给定的editDistance给出了相同的零的结果，差的差异为差异，这些比例之间，能够平均如此。

select 
     avg(JAROWINKLER_SIMILARITY(da.person1, db.person1)/100) as p1_dist    
    ,avg(JAROWINKLER_SIMILARITY(da.person2, db.person2)/100) as p2_dist
from data as da
cross join data as db;

sql -reference/functions/jarowinkler_similarity.html“ rel =“ nofollow noreferrer	”> jarowinklet相似
性结果已经在0-100	之间缩放在0-100

You proposed values of 0.9 and 0.4 seem like ratio's of sameness, so that can be calculated with a count and ratio_of_report like so:


with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), p1 as (
    select 
      person1 
      ,count(*) as c_p1
      ,ratio_to_report(c_p1) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p1 desc) = 1
), p2 as (
    select 
      person2 
      ,count(*) as c_p2
      ,ratio_to_report(c_p2) over () as q
    from data
    group by 1
    qualify row_number() over(order by c_p2 desc) = 1
)
select 
    p1.q as p1_same,
    p2.q as p2_same
from p1
cross join p2
;

giving:

P1_SAME	P2_SAME
0.833333	0.333333

Editdistance:

So using a full cross join, we can calculate the editdistance of all values, and find the ratio of this to the total count:

with data(person1, person2) as (
    select * from values
        ('Dave','Fred'),
        ('Dave','Dave'),
        ('Dave','Mike'),
        ('Fred','Dave'),
        ('Dave','Mike'),
        ('Dave','Jeff')
), combo as (
select 
    editdistance(da.person1, db.person1) as p1_dist    
    ,editdistance(da.person2, db.person2) as p2_dist
from data as da
cross join data as db
)
select count(*) as c
    ,sum(p1_dist) as s_p1_dist
    ,sum(p2_dist) as s_p2_dist
    ,c / s_p1_dist as p1_same
    ,c / s_p2_dist as p2_same 
from combo
 ;

But given editdistance gives a result of zero for same and positive value for difference, the scaling of these does not align with the desired result...

JAROWINKLER_SIMILARITY:

Given the Jarowinklet similarity result is already scaled between 0 - 100, it makes more sense to be able to average this..

select 
     avg(JAROWINKLER_SIMILARITY(da.person1, db.person1)/100) as p1_dist    
    ,avg(JAROWINKLER_SIMILARITY(da.person2, db.person2)/100) as p2_dist
from data as da
cross join data as db;