匹配“模糊”基于多个输入的数据
我有一个搜索和匹配问题:
输入
在我的数据库中,除了一些其他匹配特征之外,我还有数千个名称:几列数字数据,以及几列有助于识别该特定公司的其他文本。
潜在客户拥有大约 500 个公司名称,然后为每个名称提供了稀疏的附加特征(如上所述)。
当前流程
过去,该流程是手动流程,通过搜索数据库来尝试匹配客户给出的每个名称,找到一个与向我报告的名称“相似”的名称,然后验证附加特征是否匹配。然而,主要问题是报告的名称不相同,通常可能包含缩写或仅存储在我的数据库中的名称的一部分,并且附加特征可能不完整或仅部分匹配。
自动化
我想自动化这个过程,因为它经常发生。最佳解决方案是从客户列表中输入一家公司以及他们为其填写的任何附加特征,然后尝试在我的数据库中查找前 5 个匹配项。
我从未使用过 Lucene 或 Sphinx,但它们似乎更受文档驱动。有没有办法格式化这些输入,以便这些库可以解决这个问题,或者还有哪些其他软件工具可以解决这个问题?
I have a search and matching problem:
Inputs
In my database, I have thousands of names, in addition to some other matching characteristics: a few columns of numerical data, and a few columns of other text that helps identify this specific company.
A prospective client has about 500 company names, and then sparsely populated additional characteristics as mentioned above for each of the names.
Current Process
In the past, the process has been a manual one, try to match each name given by the client by searching through the database, finding a name "like" the one reported to me, and then verifying that the additional characteristics match up. However, the main issue is that the names reported are not the same, can often contain abbreviations or only parts of the name stored in my database, and the additional characteristics may be incomplete or only partially matching as well.
Automation
I want to automate this process since it happens frequently. The optimal solution would input one company from the client list along with any of the additional characteristics they filled in for it, and then try to find the top 5 matches in my database.
I've never used Lucene or Sphinx, but they seem to be more document driven. Is there a way to format these inputs so those libraries work for this problem, or instead, what other software tools exist that would work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于 Lucene,“文档”可以很容易地成为表中的一行,我认为您会喜欢模糊搜索和搜索命中评分功能。
To Lucene, a 'document' can easily be a row in a table and I think you will like the fuzzy~ search and search hit scoring capabilities.