智能数据库搜索

发布于 2024-07-08 16:44:12 字数 568 浏览 14 评论 0原文

问题是有一个包含大约 20k 客户记录的数据库,我想尽最大努力避免重复条目。 数据库是Microsoft SQL Server 2005,维护该数据库的应用程序是Microsoft Dynamics/SL。 我正在创建一个与该数据库交互的 ASP.NET Web 服务。 我的服务可以将客户记录插入数据库、从中读取记录或修改这些记录。 无论是在我的网络服务中,还是通过 MS Dynamics,或者在 Sql Server 中,我想在用户确认添加新记录之前给出可能的匹配列表。

所以用户会提交一条记录,如果它看起来是唯一的,该记录将保存并返回一个新的ID。 如果存在可能的重复,用户可以重新提交并确认:“是的,我看到可能的重复,这是一条新记录,我想提交它”。

如果只是标点符号或空格,这很容易(例如,如果您输入“Company, Inc.”并且数据库中有“Company Inc.”,但是如果有细微的变化(例如“Company Corp.”)怎么办? ”而不是“Company Inc”,或者如果存在拼写错误,例如“Cmpany, Inc”。是否有可能返回列表中类似的记录?如果绝对不可能,我将处理我的问题如果由于重复而需要合并记录,它只会导致以后做更多的工作。

The issue is there is a database with around 20k customer records and I want to make a best effort to avoid duplicate entries. The database is Microsoft SQL Server 2005, the application that maintains that database is Microsoft Dynamics/SL. I am creating an ASP.NET webservice that interacts with that database. My service can insert customer records into the database, read records from it, or modify those records. Either in my webservice, or through MS Dynamics, or in Sql Server, I would like to give a list of possible matches before a user confirms a new record add.

So the user would submit a record, if it seems to be unique, the record will save and return a new ID. If there are possible duplications, the user can then resubmit with a confirmation saying, "yes, I see the possible duplicates, this is a new record, and I want to submit it".

This is easy if it is just a punctuation or space thing (such as if you are entering "Company, Inc." and there is a "Company Inc" in the database, But what if there is slight changes such as "Company Corp." instead of "Company Inc" or if there is a fat fingered misspelling, such as "Cmpany, Inc." Is it even possible to return records like that in the list? If it's absolutely not possible, I'll deal with what I have. It just causes more work later on, if records need to be merged due to duplications.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

雄赳赳气昂昂 2024-07-15 16:44:12

哪种算法最适合的具体细节你很大程度上取决于你的领域,所以我建议尝试一些不同的领域 - 你甚至可能需要结合一些以获得最佳结果。 缩写,尤其是特定领域的缩写,可能也需要进行预处理或标准化。

对于名称,您可能最好使用 语音算法 - 该算法考虑了发音。 这些将使史密斯和施密特得分接近,因为他们在说这些话时很容易混淆。 Double Metaphone 是一个不错的首选。

对于胖指法,您可能最好使用 编辑距离 算法 - 它给出了2个词之间的“差异”。 这些将使史密斯和斯莫斯的分数接近——尽管两者可能会被拼音搜索漏掉。

T-SQL 有 SOUNDEX 和 DIFFERENCE - 但它们相当差。 Levenshtein 变体是规范的选择,但还有其他不错的选择 - 其中大多数都相当简单如果您找不到适当许可的实现,请在 C# 中实现

所有这些都将比 T-SQL 更容易从 C# 编码/使用(尽管我确实发现 严重滥用 T-SQL 中的双变音位,但可能在 SQL 中起作用)。

虽然这个 示例位于 Access 中(而且我从未真正查看过代码,或使用实现)所包含的演示文稿很好地了解了您最终可能需要做什么。 该代码可能值得一看,也许是从 VBA 移植的。

The specifics of which algorithm will work best for you depends greatly on your domain, so I'd suggest experimenting with a few different ones - you may even need to combine a few to get optimal results. Abbreviations, especially domain specific ones, may need to be preprocessed or standardized as well.

For the names, you'd probably be best off with a phonetic algorithm - which takes into account pronunciation. These will score Smith and Schmidt close together, as they are easy to confuse when saying the words. Double Metaphone is a good first choice.

For fat fingering, you'd probably be better off with an edit distance algorithm - which gives a "difference" between 2 words. These would score Smith and Smoth close together - even though the 2 may slip through the phonetic search.

T-SQL has SOUNDEX and DIFFERENCE - but they are pretty poor. A Levenshtein variant is the canonical choice, but there's other good choices - most of which are fairly easy to implement in C#, if you can't find a suitably licensed implementation.

All of these are going to be much easier to code/use from C# than T-SQL (though I did find double metaphone in a horrendous abuse of T-SQL that may work in SQL).

Though this example is in Access (and I've never actually looked at the code, or used the implementation) the included presentation gives a fairly good idea of what you'll probably end up needing to do. The code is probably worth a look, and perhaps a port from VBA.

遗弃M 2024-07-15 16:44:12

查看 SQL Server 中的 SOUNDEXing。 我相信它会给您带来您正在寻找的可能匹配的模糊性。

SOUNDEX @ MSDN

SOUNDEX @ 维基百科

Look into SOUNDEXing within SQL Server. I believe it will give you the fuzziness of probable matches that you're looking for.

SOUNDEX @ MSDN

SOUNDEX @ Wikipedia

听,心雨的声音 2024-07-15 16:44:12

如果可以将 Lucene.NET 集成到您的解决方案中,您一定要尝试一下。

If it's possible to integrate Lucene.NET into your solutionm you should definetly try it out.

网白 2024-07-15 16:44:12

您可以尝试使用带有 FreeText(或 FreeTextTable)函数的全文搜索来尝试查找可能的匹配项。

You could try using Full Text Search with FreeText (or FreeTextTable) functions to try to find possible matches.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文