SQL Server Express:对于相似名称的查询,比 LIKE 更好
我需要一个特殊的运算符,它可能比 LIKE 更好一点来查询“相似”值。
场景:
我有一张学生表,还有一张课程表。课程表是从其他软件导入的,因此导入行上的 StudentID
列为空。因此,我们需要用户为每节课手动选择适当的学生行,然后可以填充 StudentID
列,以便它们正确同步。两个表都包含名字和姓氏,但其中很多很可能拼写错误。
导入行后,我想向用户展示学生行中的名称,其中名称与每个课程行中存储的值“前五个最相似”。事实上,我想按照从最相似到最不相似的降序排列它们。
包含 LIKE 运算符的查询并不能完全消除它,因为它要求列中必须存在特定文本,并且它不返回“相似度分数”。
据我了解(来自非技术文章),美国邮局对这个问题处理得很好......人们总是拼错名字和街道名称,但他们的“查找相似”算法非常有效。
我知道算法可能因一种解决方案而异。例如,我从文章中读到,一些算法考虑语音学,另一些算法考虑元音和辅音的计数,而另一些算法则认为通过电话说话时“T”听起来像“P”。
我可以将每条记录加载到我的应用程序代码中,并用 C#、VB.NET 或其他语言编写我自己的算法,但这样做存在很多问题,包括性能问题。我宁愿在查询中完成此操作,所以我正在寻找替代方案。
我正在使用 SQL Server Express,但我确信该解决方案也适用于其他数据库平台。
I need a special operator that's maybe a bit better than LIKE to query for "similar" values.
THE SCENARIO:
I have a table of students, and I have a table of lessons. The table of lessons was imported from other software, so the StudentID
column is null on the imported rows. So we need the user to manually select the appropriate student row for each lesson, and then the StudentID
column can be populated so they're properly synced. Both tables contain first and last names, but a lot of them are very likely to be misspelled.
After importing the rows, I would like to present the user with the names from the student rows where the names are "top five most similar" to the values stored in each lesson row. In fact I'd like to present them in descending order from most-to-least similar.
A query containing the LIKE operator doesn't quite cut it because it requires specific text must exist within the column, and it doesn't return a "similarity score".
It is my understanding (from non-technical articles) that the US Post Office has this issue very well handled... People misspell names and street names all the time, but their algorithm for "find similar" is very effective.
I know the algorithm could vary from one solution to the next. For example I read from the article that some algorithms consider phonetics, others consider the count of vowels and consonants, while others consider that "T" sounds like "P" when spoken over the phone.
I COULD load every record into my app code and write my own algorithm in c#, VB.NET or whatever, but there are lots of problems with that including performance. I'd rather accomplish this within the query so I'm looking for alternatives.
I'm using SQL Server Express but I'm sure the solution applies to other database platforms.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
SQL Server 支持 SOUNDEX() 函数,但这仅适用于发音相似的名称,而且效果不佳,至少在处理非英语文本时是这样。您可以用 c# 或 vb.net 编写自己的函数,以促进可能适用于您的需求的任何算法,并将其作为标量函数导入到 sql server 中。
SQL Server supports the SOUNDEX() function, but this works only for similar sounding names and that not to well, at least if you handle non english texts. You could write you own function in c# or vb.net, facilitating any algorithm that might apply to your needs, and import it as a scalar function into sql server.
SQL FreeText 可能适合您:
http://msdn.microsoft.com/en- us/library/ms176078.aspx
它会搜索同义词库,尽管我不确定它对名称的处理效果如何。
然而,它非常容易实现。
SQL FreeText might work for you:
http://msdn.microsoft.com/en-us/library/ms176078.aspx
It searches against a Thesaurus, although i'm not sure how well it does with names.
Its very easy to implement, however.