Limited, Ltd, Incorporated, Inc 等的字符串模式匹配
我们正在做大量工作来尝试协调大约 1,000 个重复的制造商名称和 1,000,000 个重复的零件号。出现的一件事是如何“匹配”诸如“Limited”与“Ltd.”之类的内容。与“有限公司”
目的是让应用程序将这些匹配的项目协调为标准格式。所以:
ACME 有限公司 ACME有限公司 ACME Ltd
应全部调节到 ACME Ltd。
这也将用于防止将来输入额外的重复项。
关于如何在 SQL Server 中完成这种模式匹配有什么建议吗?任何已知的算法来查找具有映射等价项的项目等......?
谢谢!
埃里克.
We're doing a LOT of work towards trying to reconcile about 1,000 duplicate manufacturer names and 1,000,000 duplicate part numbers. One thing that has come up is how to "match" things like "Limited" vs. "Ltd." vs. "Ltd"
The purpose is for the application to reconcile these matched items into a standard format. So:
ACME Ltd.
ACME Limited
ACME Ltd
Should all be reconciled into ACME Ltd.
This will also be used to prevent entering additional duplicates in the future.
Any suggestions on how to accomplish this pattern matching in SQL Server? Any known algorithms to find items with mapped equivalencies, etc...?
Thanks!
Eric.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个表格在一列中列出您想要的内容,在下一列中列出您想要的变化,怎么样?
然后,如果您在第二列中找到匹配项,则将其更改为第一列。当您找到其他替代方案时,可能需要多次迭代。
How about a table that lists what you want in one column and variations in the next?
Then, if you find a match on the second column, you change it to the first. It may take several iterations, as you find other alternatives.
使用 SQL Server 全文搜索,您可以使用同义词:
根据您的情况,您可以添加如下部分:
这里是一个链接 更详细地介绍了如何修改同义词库文件。这可能适合您想要做的事情...
SQL Server 还通过使用
LIKE
提供了一些有限的模式匹配。我建议查看它提供的选项,看看它们是否会足以满足您的需求。如果
LIKE
不够,您可以随时查看创建 CLR允许您使用正则表达式的存储过程或 UDF。这将允许您匹配更复杂的模式......Using SQL Server Full Text Search you can use synonyms:
In your case you could add a section like the following:
Here is a link that goes into more detail on how to modify the thesaurus file. This may work for what you are trying to do...
SQL Server also offers some limited pattern matching by using
LIKE
. I would recommend looking over the options it offers to see if they will be sufficient for your needs.If
LIKE
is insufficient you can always look at creating a CLR stored procedure or UDFs that will allow you to use regular expressions. This will allow you to match MUCH more complex patters...