如何使用Snowflake SQL找到单列中字符串的变化?
说我有这样的桌子:
Person1 | Person2 |
---|---|
Dave | Fred |
Dave Dave | Dave |
Dave Dave | Dave Dave Dave Dave |
Dave | Dave Dave |
Dave | Dave Dave Dave Dave |
Dave Dave | Jeff |
在专栏中“ Person1”中显然是最受欢迎的输入,所以我想在SQL(Snowflake)中产生“相似性得分”或“列中的变化”分数。
相反,对于列“ Person2”,字符串之间存在更大的变化,因此相似度得分将较低,或列内列内的变化。因此,您可能最终会以相似的分数输出为:'person1':0.9,'person2':0.4。
如果这只是行明确的Levenshtein距离(LD),我该如何在这些方面推出编辑距离以获得每列的分数?目前,我只能看到如何在'person1'和'person2'之间获得LD,而不是在“ Person1”和“ Person2”之间获得LD。
非常感谢
Say I have a table like this:
Person1 | Person2 |
---|---|
Dave | Fred |
Dave | Dave |
Dave | Mike |
Fred | Dave |
Dave | Mike |
Dave | Jeff |
In column 'Person1' clearly Dave is the most popular input, so I'd like to produce a 'similarity score' or 'variation within column' score that would reflect that in SQL (Snowflake).
In contrast, for the column 'Person2' there is more variation between the strings and so the similarity score would be lower, or variation within column higher. So you might end up with a similarity score output as something like: 'Person1': 0.9, 'Person2': 0.4.
If this is just row-wise Levenshtein Distance (LD), how can I push EDITDISTANCE across these to get a score for each column please? At the moment I can only see how to get the LD between 'Person1' and 'Person2', rather than within 'Person1' and 'Person2'.
Many thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您提出的0.9和0.4的值似乎是相同性的比率,因此可以通过计数和比率_OF_REPORT像这样计算:
给予:
EditDistance:
因此,使用完整的交叉接口,我们可以计算 editdistance 所有值的,并找到总数的比率:
但是 所需的结果不符。
与
给定的editDistance给出了相同的零的结果,差的差异为差异,这些比例 之间,能够平均如此。
You proposed values of 0.9 and 0.4 seem like ratio's of sameness, so that can be calculated with a count and ratio_of_report like so:
giving:
Editdistance:
So using a full cross join, we can calculate the editdistance of all values, and find the ratio of this to the total count:
But given editdistance gives a result of zero for same and positive value for difference, the scaling of these does not align with the desired result...
JAROWINKLER_SIMILARITY:
Given the Jarowinklet similarity result is already scaled between 0 - 100, it makes more sense to be able to average this..