如何优化从 SQL 中的大表检索最低编辑距离?

发布于 2024-09-02 07:03:25 字数 1679 浏览 8 评论 0原文

我在优化我正在进行的编辑距离计算时遇到了麻烦。我需要执行以下操作:

  1. 获取源字符串的最小距离的记录以及源字符串的修剪版本

  2. 选择距离最小的记录
  3. 如果最小距离相等(原始与修剪过的),则选择距离最小的修剪过的记录
  4. 如果仍有多个记录属于上述两类,则选择出现频率最高的一条

这是我的工作版本:

DECLARE @Results TABLE
(
    ID int,
    [Name] nvarchar(200), 
    Distance int, 
    Frequency int, 
    Trimmed bit
)


INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@Source, [Name])) As Distance,
           Frequency, 
           'False' As Trimmed
    FROM
           MyTable

INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance,
           Frequency, 
           'True' As Trimmed
    FROM
           MyTable

SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency)

我相信我在这里需要做的是..

  1. 不要将结果哑化到临时表中
  2. 仅从“MyTable”中进行 1 个选择
  3. 从初始 select 语句中将结果设置为正确。 (因为 select 将设置变量,并且您可以在一个 select 语句中设置多个变量)

我知道必须有一个很好的实现,但我无法弄清楚......这是我所得到的:

SELECT top 1 @ResultID = ID, 
             @Result = [Name], 
            (dbo.Levenshtein(@Source, [Name])) As distOrig,
             (dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
             Frequency
FROM
    MyTable
WHERE /* ... yeah I'm lost */
ORDER BY distOrig, distTrimmed, Frequency 

有什么想法吗?

I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following:

  1. Get the record with the minimum distance for the source string as well as a trimmed version of the source string
  2. Pick the record with the minimum distance
  3. If the min distances are equal (original vs trimmed), choose the trimmed one with the lowest distance
  4. If there are still multiple records that fall under the above two categories, pick the one with the highest frequency

Here's my working version:

DECLARE @Results TABLE
(
    ID int,
    [Name] nvarchar(200), 
    Distance int, 
    Frequency int, 
    Trimmed bit
)


INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@Source, [Name])) As Distance,
           Frequency, 
           'False' As Trimmed
    FROM
           MyTable

INSERT INTO @Results
    SELECT ID, 
           [Name], 
           (dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance,
           Frequency, 
           'True' As Trimmed
    FROM
           MyTable

SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency)
SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency)

I believe what I need to do here is to..

  1. Not dumb the results to a temporary table
  2. Do only 1 select from `MyTable`
  3. Setting the results right in the select from the initial select statement. (Since select will set variables and you can set multiple variables in one select statement)

I know there has to be a good implementation to this but I can't figure it out... this is as far as I got:

SELECT top 1 @ResultID = ID, 
             @Result = [Name], 
            (dbo.Levenshtein(@Source, [Name])) As distOrig,
             (dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
             Frequency
FROM
    MyTable
WHERE /* ... yeah I'm lost */
ORDER BY distOrig, distTrimmed, Frequency 

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

在风中等你 2024-09-09 07:03:25

我认为您的尝试与您所说的工作代码不同,因为工作代码首先按距离排序,无论是原始距离还是修剪后的距离。您的尝试首先按原始距离排序,然后进行修剪。

我不确定我完全理解您想要做什么,但是以下内容是否满足您的需要?

SELECT TOP 1
    @ResultId = ID,
    @Result = [Name],
    @ResultDist = distOrig,
    @ResultTrimmed = distTrimmed
FROM (
    SELECT
        ID, [Name], 
        dbo.Levenshtein(@Source, [Name]) As distOrig,
        dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
        Frequency
    FROM MyTable
) AS T
ORDER BY
    CASE WHEN distOrig > distTrimmed THEN distOrig ELSE distTrimmed END, -- Distance
    CASE WHEN distOrig > distTrimmed THEN 1 ELSE 0 END,                  -- Trimmed
    Frequency                                                            -- Frequency

I think your attempt differs from the code that you say works in that the working code orders by distance first, whether or not that is original or trimmed distance. Your attempt orders by original distance first, then trimmed.

I'm not sure I understand what you're trying to do entirely, but does the following do what you need?

SELECT TOP 1
    @ResultId = ID,
    @Result = [Name],
    @ResultDist = distOrig,
    @ResultTrimmed = distTrimmed
FROM (
    SELECT
        ID, [Name], 
        dbo.Levenshtein(@Source, [Name]) As distOrig,
        dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed,
        Frequency
    FROM MyTable
) AS T
ORDER BY
    CASE WHEN distOrig > distTrimmed THEN distOrig ELSE distTrimmed END, -- Distance
    CASE WHEN distOrig > distTrimmed THEN 1 ELSE 0 END,                  -- Trimmed
    Frequency                                                            -- Frequency
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文