MySQL SELECT DISTINCT 具有变异容忍度

发布于 2024-11-15 15:06:01 字数 486 浏览 2 评论 0原文

在我的数据库中,我有很多非常相似但不相同的条目。例如,只有两个字符可能不同,例如:

Row1:“天气很好,请参阅 http://xyz56.com "

Row2: "天气很好,参见http://xyz31.com"

我想去掉这些部分重复并只收到一个结果对于这两行。无论是哪个,我建议使用第一个出现的。

我可以利用 MySQL 的任何功能来有效地完成此操作吗?我的第一个想法是提取更多数据并对字符串进行比较,如果匹配字符超过某个阈值而不是忽略它。缺点是我永远不会知道我必须从数据库中提取多少条目,而且效率也很低,因为我必须将每一行与所有其他行进行比较 (O(n²))。

更新: 更具体地说明用例:方差的位置并不总是位于字符串的末尾,而且变化的字符也可能不止 2 个。每行的字符串长度各不相同。

in my database I have a lot of entries that are very similar but not identical. For instance just two characters might be different such as:

Row1: "The weather is nice, see http://xyz56.com"

Row2: "The weather is nice, see http://xyz31.com"

I would like to get rid of these partial duplicates and just receive one result for these two rows. It does not matter which one it is, I would suggest to use the first one that appears.

Is there any feature I could utilize from MySQL to do this efficiently? My first thought was to pull more data and do a comparism on the string, if the matching characters are over some threshold than ignore it. Downside is I will never know how many entries I have to pull from the database and it also is quiet inefficient since I have to compare each row with all the other rows (O(n²)).

Update:
To be more specific on the use cases: The position of the variance is not always at the end of the string and it might also be more than just 2 characters that change. The string length varies with each row.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

や莫失莫忘 2024-11-22 15:06:01

我的建议是使用 Levenshtein 距离,这是字符串相似性的度量。要让 MySQL 直接计算此值,您必须在存储过程中实现它,示例如下: http://www.artfulsoftware.com/infotree/queries.php#552

PHP 和 Java 也有常见的实现。

My suggestion would be to use Levenshtein distance, which is a measure for string similarity. To get MySQL to compute this directly, you will have to implement it in a stored procedure, an example is over here: http://www.artfulsoftware.com/infotree/queries.php#552.

There are also common implementations for PHP and Java.

平生欢 2024-11-22 15:06:01

您可以使用 SOUNDEX。

SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.

mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello')                                        |
+---------------------------------------------------------+
| H400                                                    |
+---------------------------------------------------------+
1 row in set (0.00 sec)

来源: http://www.tutorialspoint.com/mysql/mysql -string-functions.htm#operator_sounds-like

例如,在 Oracle PL/SQL 中,您的字符串具有相同的 SOUNDEX 并且 SOUNDEX 无法区分:

select soundex ('The weather is nice, see http://xyz56.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362                                           
1 row selected.

select soundex ('The weather is nice, see http://xyz31.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362                                           
1 row selected.

You can use SOUNDEX.

SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.

mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello')                                        |
+---------------------------------------------------------+
| H400                                                    |
+---------------------------------------------------------+
1 row in set (0.00 sec)

Source: http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like

For example, in Oracle PL/SQL your strings have the same SOUNDEX and are SOUNDEX indistinguishable:

select soundex ('The weather is nice, see http://xyz56.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362                                           
1 row selected.

select soundex ('The weather is nice, see http://xyz31.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362                                           
1 row selected.
桜花祭 2024-11-22 15:06:01
SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);

SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);

还不是爱你 2024-11-22 15:06:01

MySQL 的 Levenshtein 距离算法:

请参阅:MySQL 的 Levenshtein 距离的实现/模糊搜索?

编辑距离
两个字符串之间的编辑距离是指将一个字符串转换为另一个字符串所需的最少操作次数,其中一次操作可以是插入、删除或替换一个字符。 Jason Rust 在 http://www.codejanitor.com/wp/ 发布了这个 MySQL 算法。

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END; 

辅助函数:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END; 

Levenshtein 距离算法:Oracle PL/SQL 实现

来源:http://www.merriampark.com/ ldplsql.htm

CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  (p_source_string   IN VARCHAR2,
   p_target_string   IN VARCHAR2)
  RETURN                NUMBER
  DETERMINISTIC
AS
  v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
  v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
  TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
  column_to_left        mytabtype;
  current_column        mytabtype;
  v_cost                NUMBER := 0;
BEGIN
  IF v_length_of_source = 0 THEN
    RETURN v_length_of_target;
  ELSIF v_length_of_target = 0 THEN
    RETURN v_length_of_source;
  ELSE
    FOR j IN 0 .. v_length_of_target LOOP
      column_to_left(j) := j;
    END LOOP;
    FOR i IN 1.. v_length_of_source LOOP
      current_column(0) := i;
      FOR j IN 1 .. v_length_of_target LOOP
        IF SUBSTR (p_source_string, i, 1) =
           SUBSTR (p_target_string, j, 1)
        THEN v_cost := 0;
        ELSE v_cost := 1;
        END IF;
        current_column(j) := LEAST (current_column(j-1) + 1,
                                    column_to_left(j) + 1,
                                    column_to_left(j-1) + v_cost);
      END LOOP;
      FOR j IN 0 .. v_length_of_target  LOOP
        column_to_left(j) := current_column(j);
      END LOOP;
    END LOOP;
  END IF;
  RETURN current_column(v_length_of_target);
END ld;

如果您假设有一个名为 EMPLOYEES 的表,其中包含 VARCHAR2 类型的名为 FIRST_NAME 的列,您可以轻松找到记录其中编辑距离 = 1,如下所示:

SELECT *
  FROM employees alfa
 WHERE EXISTS
          (SELECT 'X'
             FROM employees beta
            WHERE ld (beta.first_name, alfa.first_name) = 1);

通过此查询,您可以在结果集的每一行中显示编辑距离 = 1 的first_name 的列表:

SELECT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) = 1;

示例:

SELECT DISTINCT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) <= 2
          AND ld (b.first_name, a.first_name) > 0;

FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana

Levenshtein Distance Algorithm for MySQL:

Please see: Implementation of Levenshtein distance for mysql/fuzzy search?

Levenshtein distance
The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END; 

Helper function:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END; 

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation

SOURCE: http://www.merriampark.com/ldplsql.htm

CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  (p_source_string   IN VARCHAR2,
   p_target_string   IN VARCHAR2)
  RETURN                NUMBER
  DETERMINISTIC
AS
  v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
  v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
  TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
  column_to_left        mytabtype;
  current_column        mytabtype;
  v_cost                NUMBER := 0;
BEGIN
  IF v_length_of_source = 0 THEN
    RETURN v_length_of_target;
  ELSIF v_length_of_target = 0 THEN
    RETURN v_length_of_source;
  ELSE
    FOR j IN 0 .. v_length_of_target LOOP
      column_to_left(j) := j;
    END LOOP;
    FOR i IN 1.. v_length_of_source LOOP
      current_column(0) := i;
      FOR j IN 1 .. v_length_of_target LOOP
        IF SUBSTR (p_source_string, i, 1) =
           SUBSTR (p_target_string, j, 1)
        THEN v_cost := 0;
        ELSE v_cost := 1;
        END IF;
        current_column(j) := LEAST (current_column(j-1) + 1,
                                    column_to_left(j) + 1,
                                    column_to_left(j-1) + v_cost);
      END LOOP;
      FOR j IN 0 .. v_length_of_target  LOOP
        column_to_left(j) := current_column(j);
      END LOOP;
    END LOOP;
  END IF;
  RETURN current_column(v_length_of_target);
END ld;

If you suppose to have a table called EMPLOYEES with a column named FIRST_NAME of type VARCHAR2, you can find easily the records which have Levenshtein Distance = 1 as follows:

SELECT *
  FROM employees alfa
 WHERE EXISTS
          (SELECT 'X'
             FROM employees beta
            WHERE ld (beta.first_name, alfa.first_name) = 1);

With this query you can show, in every row of the result set, the list of the first_name with Levenshtein Distance = 1:

SELECT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) = 1;

An example:

SELECT DISTINCT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) <= 2
          AND ld (b.first_name, a.first_name) > 0;

FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文