编辑替代方案
我有一大堆查询并使用levenshtein来计算拼写错误,现在levenshtein导致mysql占用完整的CPU时间。 我的查询是 UNION 语句中的全文搜索 + levenshtein。 sql1 是我当前的查询,sql2 只是全文搜索,速度快且不会使用太多 cpu 时间,最后一个是 leventhein ,它将达到峰值!
你们中的任何人还有其他方法来解决拼写错误吗? 请不要回答标准化数据,我已经想到了这一点,但不适用于我的数据,因为我无法预先进行匹配/计算并创建带有索引的单独表。
$sql1 = "(SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE)) UNION (SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3) ORDER BY sanction_id";
$sql2 = "SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE) ORDER BY sanction_id";
$sql3 = "SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3";
i have a big set of queries and use levenshtein to calculate typos, now levenshtein causes mysql to take full cpu time.
My query is a fulltext search + levenshtein in a UNION statement.
sql1 is my current query, sql2 is only fulltext search which is fast and doesnt use too much cpu time, the last one the leventhein one which will peak!
Any of you have an alternative way to get typos as well?
Please don't answer normalize data, I have thought of that, but is not applicable to my data, as I cannot pre-make the matches/calculations and create a separate table with indexes.
$sql1 = "(SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE)) UNION (SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3) ORDER BY sanction_id";
$sql2 = "SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE) ORDER BY sanction_id";
$sql3 = "SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3";
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您只依赖 MySQL,那么没有一个简单的解决方案。
通常,这是使用专门的 ngram 索引来解决的,用于快速候选者查找过滤,然后仅对 10-50 个候选者计算 levensthein,这比计算所有对的 levensthein 更快。
像 Solr/Lucene 这样的专业全文搜索引擎已经内置了这个功能。PostgreSQL
有 pg_trgm contrib 模块 (http://www.postgresql.org/docs/9.0/static/pgtrgm.html),它的工作方式就像一个魅力。
您甚至可以使用全文索引在 MySQL 中模拟这一点,但您必须从所有文档中收集单词,将其转换为 ngram,在其上创建全文索引,并将它们全部组合在一起以进行快速查找。这会带来各种冗余、同步问题……不值得你花时间。
If you are tied only to MySQL there is not an easy solution.
Usually this is solved using specialized ngram indexing for fast candidate lookup filtering and then calculating levensthein only on like 10-50 candidates which is faster that calculating levensthein for all pairs.
Specialized fulltext search engines like Solr/Lucene have this built in.
PostgreSQL has pg_trgm contrib module (http://www.postgresql.org/docs/9.0/static/pgtrgm.html) which works like a charm.
You can even simulate this in MySQL using fulltext indexing, but you have to collect words from all your documents convert them to ngrams, create fulltext indexes on them, and hack them all together for fast lookup. Which brings all sorts of trouble with redundancy, sync...not worth your time.