用于查找名字和姓氏变体的姓名匹配字典

发布于 2024-11-05 15:07:56 字数 506 浏览 0 评论 0原文

我有一个可以存储和跟踪访客的应用程序。这些访客是由调度程序(用户)在设置访问时根据需要在系统中创建的。问题是,大多数时候,访客唯一重要的唯一标识符如下:

  • 名字
  • 姓氏
  • 公司名称

同一个人存在重复记录的风险是固有的,调度程序可能会输入新的访客记录来代替在系统中搜索以该名字存在的人。

当我遇到某人输入同名访客时,我会显示一个警告对话框,其中包含有关此人可能是谁的各种建议,但即使这样也还不够好。

我可以输入“Jim Jones”,此人可能以“James Jones”或“Jimmy Jones”的身份存在于系统中。我看到有可用的名称识别软件包,但它们很昂贵,而且肯定比我正在寻找的更重。

有人知道在哪里可以找到免费或开源的字典文件,我可以通过编程方式访问该文件以查找潜在的名称变体吗?软件或在线服务固然很好,但即使只是数据转储或简单的文本文件也可以。

我知道即使这也无法防止重复的访客记录,我只是试图将其保持在最低限度,因此这不是一个关键功能。

I have an application that will store and track visitors. These visitors are created in the system by schedulers(users) as needed when they set up a visit. The problem is that most of the time the only important unique identifiers of a visitor are as follows:

  • First Name
  • Last Name
  • Company Name

The risk of duplicate records existing for the same person is inherent, a scheduler may enter a new visitor record in lieu of searching the system for somebody existing by that name.

When I encounter somebody entering a visitor by the same name I display a warning dialog with various suggestions of who this person COULD be, but then even that is not good enough.

I could enter 'Jim Jones' and this person may exist in the system as 'James Jones' or 'Jimmy Jones'. I see there are name recognition software packages available but they are expensive and certainly more heavy than what I am looking for.

Would anybody know where to find a free or open source dictionary file that I can programatically access to find potential name variants? Software or an online service would be nice but even just a data dump or simple text file might do.

I know even this will not prevent duplicate visitor records, I am just trying to keep that at a minimum so it is not a critical feature.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

风吹雨成花 2024-11-12 15:07:56

查看 Moby 项目 (http://icon.shef.ac.uk/Moby/mwords.html) 了解常见的名字和姓氏。您可以使用 Metaphone 和 soundex 等工具对相似的名称进行预计算,并使用它来识别潜在的匹配项。您还提到了公司名称,这些名称管理起来有点困难,因为它们可以由很多东西组成,为此,可以查看 12-dicts 单词列表 (http://wordlist.sourceforge.net/) 2+2lemma该包中提供的列表提供了共享共同词根的多种形式,可以与类似的拼写解决方案结合使用以提供改进的结果。

Check out the Moby project (http://icon.shef.ac.uk/Moby/mwords.html) for common first and last names. You can do a precomputation for similar names using tools like metaphone and soundex and use that to identify potential matches. You also mention company names which are a bit harder to manage since they can be made up of lots of things, for that maybe check out the 12-dicts word list (http://wordlist.sourceforge.net/) the 2+2lemma list provided in that package provides multiple forms that share common roots which can be used in conjunction with a simiar spelling solution to provide improved results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文