MySQL SELECT DISTINCT 具有变异容忍度
在我的数据库中,我有很多非常相似但不相同的条目。例如,只有两个字符可能不同,例如:
Row1:“天气很好,请参阅 http://xyz56.com "
Row2: "天气很好,参见http://xyz31.com"
我想去掉这些部分重复并只收到一个结果对于这两行。无论是哪个,我建议使用第一个出现的。
我可以利用 MySQL 的任何功能来有效地完成此操作吗?我的第一个想法是提取更多数据并对字符串进行比较,如果匹配字符超过某个阈值而不是忽略它。缺点是我永远不会知道我必须从数据库中提取多少条目,而且效率也很低,因为我必须将每一行与所有其他行进行比较 (O(n²))。
更新: 更具体地说明用例:方差的位置并不总是位于字符串的末尾,而且变化的字符也可能不止 2 个。每行的字符串长度各不相同。
in my database I have a lot of entries that are very similar but not identical. For instance just two characters might be different such as:
Row1: "The weather is nice, see http://xyz56.com"
Row2: "The weather is nice, see http://xyz31.com"
I would like to get rid of these partial duplicates and just receive one result for these two rows. It does not matter which one it is, I would suggest to use the first one that appears.
Is there any feature I could utilize from MySQL to do this efficiently? My first thought was to pull more data and do a comparism on the string, if the matching characters are over some threshold than ignore it. Downside is I will never know how many entries I have to pull from the database and it also is quiet inefficient since I have to compare each row with all the other rows (O(n²)).
Update:
To be more specific on the use cases: The position of the variance is not always at the end of the string and it might also be more than just 2 characters that change. The string length varies with each row.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我的建议是使用 Levenshtein 距离,这是字符串相似性的度量。要让 MySQL 直接计算此值,您必须在存储过程中实现它,示例如下: http://www.artfulsoftware.com/infotree/queries.php#552。
PHP 和 Java 也有常见的实现。
My suggestion would be to use Levenshtein distance, which is a measure for string similarity. To get MySQL to compute this directly, you will have to implement it in a stored procedure, an example is over here: http://www.artfulsoftware.com/infotree/queries.php#552.
There are also common implementations for PHP and Java.
您可以使用 SOUNDEX。
来源: http://www.tutorialspoint.com/mysql/mysql -string-functions.htm#operator_sounds-like
例如,在 Oracle PL/SQL 中,您的字符串具有相同的 SOUNDEX 并且 SOUNDEX 无法区分:
You can use SOUNDEX.
Source: http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like
For example, in Oracle PL/SQL your strings have the same SOUNDEX and are SOUNDEX indistinguishable:
MySQL 的 Levenshtein 距离算法:
请参阅:MySQL 的 Levenshtein 距离的实现/模糊搜索?
编辑距离
两个字符串之间的编辑距离是指将一个字符串转换为另一个字符串所需的最少操作次数,其中一次操作可以是插入、删除或替换一个字符。 Jason Rust 在 http://www.codejanitor.com/wp/ 发布了这个 MySQL 算法。
辅助函数:
Levenshtein 距离算法:Oracle PL/SQL 实现
来源:http://www.merriampark.com/ ldplsql.htm
如果您假设有一个名为 EMPLOYEES 的表,其中包含 VARCHAR2 类型的名为 FIRST_NAME 的列,您可以轻松找到记录其中编辑距离 = 1,如下所示:
通过此查询,您可以在结果集的每一行中显示编辑距离 = 1 的first_name 的列表:
示例:
Levenshtein Distance Algorithm for MySQL:
Please see: Implementation of Levenshtein distance for mysql/fuzzy search?
Levenshtein distance
The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.
Helper function:
Levenshtein Distance Algorithm: Oracle PL/SQL Implementation
SOURCE: http://www.merriampark.com/ldplsql.htm
If you suppose to have a table called EMPLOYEES with a column named FIRST_NAME of type VARCHAR2, you can find easily the records which have Levenshtein Distance = 1 as follows:
With this query you can show, in every row of the result set, the list of the first_name with Levenshtein Distance = 1:
An example: