启发式预测名称或公司
问题
我们正在接收字符串,它们可能代表公司名称或人名。我们需要一种启发式方法来确定这一点。
初步想法
使用 XML 文档与任一节点 Commercial String /Commercial 或 Personal String /Personal 和分数匹配字符串 +1(抱歉不知道如何在 SO 中格式化 XML)
不能只检查专有名词。 IE Bob's Company 是一家名为 Bob Compton 的公司
需要以某种格式返回置信度。我想不出如何以百分比的形式进行操作,我所能想到的就是如果找到匹配项,则使用整数
可能的商业(全部将转换为小写):co, co., inc , inc. 等(每个的详细版本)
我可以获得在线英文名称列表
问题
有人以前遇到过这种域名问题吗?你用了什么方法?有什么华丽的方法可以解决这个问题吗?
谢谢。
Problem
We are recieving strings and they may either represent a company name or a person's name. We need a heuristic to determine this.
Initial thoughts
Use an XML doc with either node Commercial String /Commercial or Personal String /Personal and score matching strings +1 (sorry dont know how to format XML in SO)
Cant just check for proper nouns. I.E. Bob's Company is a company where Bob Compton is a name
Need to return confidence level in some format. I can't think of how to do it as a percentage, all I can think to do is if it finds a match use an integer
Possible Commercial (all will be converted to lower case): co, co., inc, inc., etc (verbose versions of each)
I can get a English Name list from online
Question
Has anyone ran into this kind of domain problem before? What methods did you use? Any flashy way of solving this?
Thank You.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我以前没有这样做过,但还有一些其他想法:
检查非专有名词(例如“and”、“the”、“piping”)。事实上,如果您有一本英语词典和一个姓名列表,那么任何不是姓名的单词都可以很好地指向公司名称。
一个大问题是有些公司只是以一个人的名字命名。 “Fred Meyer”、“JC Penney”和“Lockheed Martin”是看起来像人名的公司的例子。可能没有真正好的方法来解决这个问题(无论如何可能都不容易)。如果您可以对名字和姓氏进行分类,那么双姓氏或仅姓氏可能是降低确定性的好理由。
我同意你的整数想法。除非您可以进行一些非常广泛且非常彻底的测试,否则您的百分比可能毫无意义。我可能会运行所有测试(返回名称、公司或未知)并比较结果,根据结果的一致性将整数相加。
I haven't done this before, but some other thoughts:
Check for non-proper nouns (e.g. "and", "the", "piping"). In fact, if you have an English dictionary and a names list, any word that is not a name could be a good pointer to a company name.
A big problem is that some companies are just named after a person(s). "Fred Meyer", "J.C. Penney", and "Lockheed Martin" are examples of companies that look just like human names. There's likely no really good way around this (probably nothing easy anyway). If you can categorize first and last names, a double last name or last name only might be a good reason to lower the certainty.
I would agree with your integer idea. Unless you can do some very broad and very thorough testing, your percentages would probably be meaningless. I would probably run all the tests (returning name, company, or unknown) and compare the results, adding up an integer based on consistency in results.
你能与已知公司名称的数据库进行比较吗?
例如在英国: http://wck2.companieshouse.gov.uk
当然,这并不如果这实际上是某人的名字,但有一家同名的公司,请提供帮助。
Can you compare to a database of known company names?
E.g. in the UK: http://wck2.companieshouse.gov.uk
Of course, this doesn't help if it's actually someone's name, but there's a company with the same name.