从文本 blob 中检测名字和姓氏的最佳方法
我正在开发一个程序,该程序对美国名片进行 OCR 并尝试返回名字、姓氏等信息。挑战在于如何做到这一点。
到目前为止,我已经构建了以下数据文件:
first_names.txt (Contains 23k+ first names)
last_names.txt (Contains 86k+ last names)
job_title.txt (Contains 500+ job titles)
us_cities.txt (Contains 10k+ us cities)
states_full.txt (Contains full names of all US states)
states_abv.txt (Contains all US state abbreviations)
我的目标是通过空格对 OCR 数据进行标记,并尝试根据每个字符串作为某种数据类型的可能性来赋予每个字符串“权重”。
例如,文本 blob 中前面的字符串更有可能是名称、公司或标题。同样,如果在first_names.txt或last_names.txt中找到一个字符串,那么它对名字/姓氏的权重将更大。
这种方法在理论上听起来不错,但我想知道从编程角度实现它的最佳方法。 (PHP,语言并不重要)棘手的部分是某些令牌的权重是相对于其他令牌而言的。例如:
- 如果一个标记看起来可能是名字,那么下一个标记很可能是姓氏。
- 有些标记是相互关联的,但如果事物被空格分解,我不知道如何将它们联系起来。例如,“Anne Marie, FL”将被视为三个标记 - “Anne”、“Marie”和“FL”。更糟糕的是,“Anne”和“Marie”会增加成为名字的分量。现在,如果权重也根据位置授予,则先前具有名字权重的字符串可能会获胜,从而使这些字符串可以被检测为城市。
我知道外面有很多聪明人,所以也许有人对此有想法!
I'm working on a program that does OCR on a US business card and tries to return information like first name, last name, etc. The challenge is how to do that.
So far I've built the following data files:
first_names.txt (Contains 23k+ first names)
last_names.txt (Contains 86k+ last names)
job_title.txt (Contains 500+ job titles)
us_cities.txt (Contains 10k+ us cities)
states_full.txt (Contains full names of all US states)
states_abv.txt (Contains all US state abbreviations)
The goal was for me to tokenize the OCR data by spaces and try to award "weight" to each string based on the likeliness of it being a certain type of data.
For example, a string earlier in the text blob is more likely to be the name, company, or title. Likewise, if a string is found in first_names.txt or last_names.txt, then it will have more weight towards first/last name.
This approach sounds ok in theory, but I'm wondering about the best way to approach it from a programming perspective. (PHP, not that language matters) The tricky part is that some token's weight are relative to other tokens. For example:
- If a token seems likely to be a first name, then it is likely that the next token is a last name.
- Some tokens are related to each other, but if things are exploded by spaces, I'm not sure how to relate them. Example, "Anne Marie, FL" would be considered three tokens - "Anne", "Marie", and "FL". Worse yet, "Anne" and "Marie" would gain weight towards being a first name. Now, if weight is also awarded based on position, a previous string with first name weight could win, freeing these strings up to be detected as city.
I know there's a lot of smart people out there, so maybe someone has an idea on this one!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
了解例外情况(例如,名为玛丽苏的城镇)很有帮助,但如果您的软件能够处理最有可能的情况,最终用户应该会感到高兴。名称可以按照每个类别中出现的相对频率进行排序:个人名称、公司名称、城市名称。对于公司来说,可以使用员工数量来计算相对可能性。对于城市、人口。
您是否已经有规则来检查包含每个标记的行的相对位置?
当然,名片格式有相当多,但如果您有数百个示例名片,您应该能够识别一些常见的格式规则。只需制定一些规则即可发挥巨大帮助。一条规则可能是“80% 的名片在个人姓名和公司名称下方都有地址”,尽管您的名片样本可能无法真正代表所有可能的名片、所有语言等,但它是一个开始。即使是一些 50% 和 80% 的规则也可以简化您的任务。
你也许可以用一个荒谬的例子想出几个规则。
这
表明我们可以考虑个人和公司名称相对于邮政编码的相对 Y 位置。尽管个人姓名、职务和公司名称可能以多种顺序出现,但邮政编码可能位于公司名称下方。邮政编码将更接近城市名称等。
虽然像“Samantha”这样的单词可能是个人姓名、街道名称或公司名称的一部分,但它最有可能是人名。您应该能够找到列出出生名称的相对频率、名称为“Samantha”的城镇人口以及名称为“Samantha”的注册公司数量的数据库。即使是部分数据库也有助于建立一些合理的可能性猜测。
其他可能的规则:
It's helpful to know the exceptions (e.g. a town named Mary Sue), but end users should be pleased if your software can handle the most likely cases. Names can be sorted by relative frequency of occurrence in each category: personal name, company name, city name. For companies, the number of employees can be used to calculate relative likelihood. For cities, population.
Do you already have rules to check the relative position of the line containing each token?
There are certainly quite a few business card formats, but if you have several hundred sample business cards you should be able to identify some common format rules. Having just a few rules could help immensely. One rule might be "80% of all cards have the address beneath the personal name and company name," Although your sample of business cards may not be truly representative of all possible business cards, all languages, etc., etc., it's a start. Even a few 50% and 80% rules could simplify your task.
You can probably think up several rules using a ridiculous example.
is more likely than
That suggests we can consider the relative Y-position of personal and company names relative to postal codes. Although personal name, job title, and company name may follow in any of several orders, postal codes are likely to be located below company names. Postal codes will be closer to city names, etc.
Although a word like "Samantha" could be part of a personal name, a street name, or a company name, it's most likely a person name. You should be able to find databases that list the relative frequency of birth names, the population of towns with the name "Samantha", and the number of registered corporations with the name "Samantha." Even partial databases would be helpful to establish some reasonable guesstimates of likelihood.
Other possible rules: