如何在 MySQL 的两个不同表中选择相似的行（可能吗？）

发布于 2024-09-12 02:09:25 字数 785 浏览 3 评论 0原文

我有两张桌子，我想获取所有拼写相似的学校或同一所学校的学校。例如：

my_table_a

学校

：

Olde School        
New School    
Other, C.S. School   
Main School
Too Cool for School

my_table_b:

School

Old School
New ES    
Other School 
Main School  
Hardknocks School

是否可以编写一个 SELECT 查询来查找两个表中拼写相似的学校。有没有办法在列上使用 LIKE 或通配符？

例如：

SELECT  my_table_a.school, my_table_b.school
FROM ` my_table_a` ,  my_table_b
WHERE  my_table_a.directory_school_name_09_10 LIKE  my_table_b.school

我用真实的表尝试了上述语句，我只是得到了“=”会产生的结果。

基本上，我想获取每个表列中的前 4 所学校。（当然，在现实世界中，我不会知道前 4 所学校是相似的）。

我想做的事情可能吗？

原文

I've got two tables and I'd like to grab all of the schools where there is a similarly spelled school or the same school. For example:

my_table_a:

School

Olde School        
New School    
Other, C.S. School   
Main School
Too Cool for School

my_table_b:

School

Old School
New ES    
Other School 
Main School  
Hardknocks School

Is it possible to write a SELECT query that will find the similarly spelled schools in the two tables. Is there a way to use LIKE or wildcards on columns?

Something such as:

SELECT  my_table_a.school, my_table_b.school
FROM ` my_table_a` ,  my_table_b
WHERE  my_table_a.directory_school_name_09_10 LIKE  my_table_b.school

I tried the above statement with my real tables and I simply got what '=' would have produced.

Basically, I want to grab the first 4 schools in each table's column. (Of course, in the real world, I won't know that the first 4 schools are similar).

Is what I'm trying to do even possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风柔一江水 2024-09-19 02:09:25

对于 Levenshtein Distance 算法的 UDF 实现，您可能需要查看“codejanitor.com：Levenshtein 距离作为 MySQL 存储函数”：

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
  DECLARE s1_char CHAR;
  DECLARE cv0, cv1 VARBINARY(256);
  SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
  IF s1 = s2 THEN
    RETURN 0;
  ELSEIF s1_len = 0 THEN
    RETURN s2_len;
  ELSEIF s2_len = 0 THEN
    RETURN s1_len;
  ELSE
    WHILE j <= s2_len DO
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
      SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
      WHILE j <= s2_len DO
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
      END WHILE;
      SET cv1 = cv0, i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END

现在让我们使用您在问题中提供的数据构建一个测试用例：

CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));

INSERT INTO table_a VALUES('Olde School');      
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');

INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');

然后：

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (a.name = b.name);

显然返回学校名称完全匹配的匹配项：

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | NULL        |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.00 sec)

现在我们可以尝试使用 LEVENSHTEIN 函数来返回编辑距离等于或小于 2 个字符的学校名称：

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | Old School  |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.08 sec)

现在使用 < = 3 作为编辑距离阈值：

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);

我们得到以下结果：

+---------------------+--------------+
| name                | name         |
+---------------------+--------------+
| Olde School         | Old School   |
| Olde School         | Other School |
| New School          | Old School   |
| Other, C.S. School  | NULL         |
| Main School         | Main School  |
| Too Cool for School | NULL         |
+---------------------+--------------+
6 rows in set (0.06 sec)

注意这次 Olde School 也匹配 Other School 和 New School< /code> 也匹配 Old School 。这些可能是误报，并且表明定义阈值对于避免不正确的匹配非常重要。

解决此问题的一种常见技术是在应用阈值时考虑字符串的长度。事实上，我引用的网站为此实现还提供了一个LEVENSHTEIN_RATIO函数，该函数返回基于字符串长度的编辑差异的比率（以百分比形式）。

For a UDF implementation of the Levenshtein Distance algorithm you may want to check out "codejanitor.com: Levenshtein Distance as a MySQL Stored Function":

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
  DECLARE s1_char CHAR;
  DECLARE cv0, cv1 VARBINARY(256);
  SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
  IF s1 = s2 THEN
    RETURN 0;
  ELSEIF s1_len = 0 THEN
    RETURN s2_len;
  ELSEIF s2_len = 0 THEN
    RETURN s1_len;
  ELSE
    WHILE j <= s2_len DO
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
      SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
      WHILE j <= s2_len DO
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
      END WHILE;
      SET cv1 = cv0, i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END

Now let's build a test case, using the data you provided in your question:

CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));

INSERT INTO table_a VALUES('Olde School');      
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');

INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');

Then:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (a.name = b.name);

Obviously returns a match where the school names match exactly:

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | NULL        |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.00 sec)

Now we can try to use the LEVENSHTEIN function to return school names that have an edit distance of 2 characters or less:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | Old School  |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.08 sec)

Now using <= 3 as an edit distance threshold:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);

We get the following result:

+---------------------+--------------+
| name                | name         |
+---------------------+--------------+
| Olde School         | Old School   |
| Olde School         | Other School |
| New School          | Old School   |
| Other, C.S. School  | NULL         |
| Main School         | Main School  |
| Too Cool for School | NULL         |
+---------------------+--------------+
6 rows in set (0.06 sec)

Note how this time Olde School also matched Other School, and New School matched Old School as well. These are probably false positive, and shows that defining the threshold is very important to avoid incorrect matches.

One common technique to tackle this problem is to take into consideration the length of the strings when applying a threshold. In fact, the site that I cited for this implementation also provides a LEVENSHTEIN_RATIO function which returns the ratio (as a percentage) of the edit difference based on the length of the strings.

回复收藏 0 原文

好多鱼好多余 2024-09-19 02:09:25

您可以尝试比较调用 SOUNDEX。

或者您可以比较编辑距离。

回复收藏 0 原文

ぇ气 2024-09-19 02:09:25

这个问题相当广泛。什么才算相似？不管怎样，虽然我不能给你一个确切的答案，但我可以说你可能想使用全文搜索： http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html 或使用某种使用 LIKE 通配符运算符的技术: %

... LIKE "new%school"

也会匹配“new school”和“new darn school”。

This question is pretty broad. What qualifies as similar? Anyway, while I can't give you an exact answer, I can say that you'll probably want to use a fulltext search: http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html or use some technique using the LIKE wildcard operator: %

... LIKE "new%school"

will match "new school" and "new darn school" too.

回复收藏 0 原文

葬シ愛 2024-09-19 02:09:25

你可以尝试这个，

select test1.test1name,test2.test2name from test1  join test2 on
soundex(SUBSTRING_INDEX(test1.test1name, ' ',1))=
soundex(SUBSTRING_INDEX(test2.test2name, ' ',1))

只需根据你的表名和字段名输入即可。

You can try this one

select test1.test1name,test2.test2name from test1  join test2 on
soundex(SUBSTRING_INDEX(test1.test1name, ' ',1))=
soundex(SUBSTRING_INDEX(test2.test2name, ' ',1))

just put the table name and field name according to your one.

回复收藏 0 原文

~没有更多了~