给定一串单词:如何在忽略标点符号的情况下查找 MySQL 中 varchar(1000) 列中的每个单词(不区分大小写)?
我有两个数据库 a
和 b
,其中包含科学论文的标题。我想将这些数据库合并到一个数据库c
。
a
可能包含b
中没有的标题,反之亦然。- 标题可能同时存在于数据库
a
和b
中。 - 字母和标点符号的大小写可能不匹配:
- “这是一个标题。” VS。 “这是一个标题”
- “这是——又一个——标题。” VS。 “这是另一个标题”
- “基于 bla 的 k 均值算法。” VS“基于bla的k均值算法”
首先,我想到使用 MySQL 内部的 levenstein 距离函数来匹配两个数据库中的相同标题,但查看数百万行,我不知道这是否会表现得足够好。然后我想到了全文搜索来匹配标题,但据我所知,全文搜索与常见单词不匹配,因此匹配在实际上不同的相似标题上表现不佳。
因此,我不需要 100% 的匹配过程。但我希望利率尽可能高。有什么建议吗?
I have two database a
and b
with titles of scientific papers in them. I want to merge those databases to one single database c
.
- It is possible that
a
contains titles which are not inb
and vice versa. - It is possible that a title is in both databases
a
andb
. - It is possible that the cases of the letters and the punctuation do not match:
- "This is a Title." VS. "this is a title"
- "This is - yet another - title." VS. "This is yet another title"
- "The k-mean algorithm based on bla." VS "The k mean Algorithm based on bla"
First I thought of using a levenstein distance function inside of MySQL to match the same titles in both databases, but looking at millions of rows I don't know if this would perform well enough. Then I thought of a fulltext search to match the titles but as far as I know fulltext searches do not match common words so the matchings would not perform well on similar titles which are in fact different.
I do not need a 100 % in the matching procedure as a result. But I want to have the rate as high as possible. Any advice?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想到的一个想法是创建一个搜索列,其中包含不带任何标点符号且小写的文本,并进行比较。
如果您使用与 mySQL 相关的其他语言或平台,那么在那里进行规范化可能是最简单的 - 我想不出本地 mySQL 函数可以去除标点符号等。这当然是可能的,但可能只能使用一组极其复杂的
REPLACE()
调用。One idea that comes to mind is to create a search column containing the text without any punctuation and in lower case, and to compare that.
If you are using some other language or platform in connection with mySQL, it might be easiest to do the normalization there - I can't think of a native mySQL function to strip punctuation and such. It's surely possible but maybe only using a hellishly complex set of
REPLACE()
calls.