从文本 blob 中检测名字和姓氏的最佳方法

发布于 2024-12-17 05:10:41 字数 902 浏览 2 评论 0原文

我正在开发一个程序，该程序对美国名片进行 OCR 并尝试返回名字、姓氏等信息。挑战在于如何做到这一点。

到目前为止，我已经构建了以下数据文件：

first_names.txt  (Contains 23k+ first names)
last_names.txt (Contains 86k+ last names)
job_title.txt (Contains 500+ job titles)
us_cities.txt (Contains 10k+ us cities)
states_full.txt (Contains full names of all US states)
states_abv.txt  (Contains all US state abbreviations)

我的目标是通过空格对 OCR 数据进行标记，并尝试根据每个字符串作为某种数据类型的可能性来赋予每个字符串“权重”。

例如，文本 blob 中前面的字符串更有可能是名称、公司或标题。同样，如果在first_names.txt或last_names.txt中找到一个字符串，那么它对名字/姓氏的权重将更大。

这种方法在理论上听起来不错，但我想知道从编程角度实现它的最佳方法。（PHP，语言并不重要）棘手的部分是某些令牌的权重是相对于其他令牌而言的。例如：

如果一个标记看起来可能是名字，那么下一个标记很可能是姓氏。
有些标记是相互关联的，但如果事物被空格分解，我不知道如何将它们联系起来。例如，“Anne Marie, FL”将被视为三个标记 - “Anne”、“Marie”和“FL”。更糟糕的是，“Anne”和“Marie”会增加成为名字的分量。现在，如果权重也根据位置授予，则先前具有名字权重的字符串可能会获胜，从而使这些字符串可以被检测为城市。

我知道外面有很多聪明人，所以也许有人对此有想法！

原文

I'm working on a program that does OCR on a US business card and tries to return information like first name, last name, etc. The challenge is how to do that.

So far I've built the following data files:

first_names.txt  (Contains 23k+ first names)
last_names.txt (Contains 86k+ last names)
job_title.txt (Contains 500+ job titles)
us_cities.txt (Contains 10k+ us cities)
states_full.txt (Contains full names of all US states)
states_abv.txt  (Contains all US state abbreviations)

The goal was for me to tokenize the OCR data by spaces and try to award "weight" to each string based on the likeliness of it being a certain type of data.

For example, a string earlier in the text blob is more likely to be the name, company, or title. Likewise, if a string is found in first_names.txt or last_names.txt, then it will have more weight towards first/last name.

This approach sounds ok in theory, but I'm wondering about the best way to approach it from a programming perspective. (PHP, not that language matters) The tricky part is that some token's weight are relative to other tokens. For example:

If a token seems likely to be a first name, then it is likely that the next token is a last name.
Some tokens are related to each other, but if things are exploded by spaces, I'm not sure how to relate them. Example, "Anne Marie, FL" would be considered three tokens - "Anne", "Marie", and "FL". Worse yet, "Anne" and "Marie" would gain weight towards being a first name. Now, if weight is also awarded based on position, a previous string with first name weight could win, freeing these strings up to be detected as city.

I know there's a lot of smart people out there, so maybe someone has an idea on this one!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挖鼻大婶 2024-12-24 05:10:41

了解例外情况（例如，名为玛丽苏的城镇）很有帮助，但如果您的软件能够处理最有可能的情况，最终用户应该会感到高兴。名称可以按照每个类别中出现的相对频率进行排序：个人名称、公司名称、城市名称。对于公司来说，可以使用员工数量来计算相对可能性。对于城市、人口。

您是否已经有规则来检查包含每个标记的行的相对位置？

当然，名片格式有相当多，但如果您有数百个示例名片，您应该能够识别一些常见的格式规则。只需制定一些规则即可发挥巨大帮助。一条规则可能是“80% 的名片在个人姓名和公司名称下方都有地址”，尽管您的名片样本可能无法真正代表所有可能的名片、所有语言等，但它是一个开始。即使是一些 50% 和 80% 的规则也可以简化您的任务。

你也许可以用一个荒谬的例子想出几个规则。

John Smith
Chief Operating Officer
Acme Inc.
123 Main Street
Somewhere, XZ 01010

这

Somewhere, XZ
01010
John Smith
Acme Inc.
Chief Operating Officer
123 Main Street

表明我们可以考虑个人和公司名称相对于邮政编码的相对 Y 位置。尽管个人姓名、职务和公司名称可能以多种顺序出现，但邮政编码可能位于公司名称下方。邮政编码将更接近城市名称等。

虽然像“Samantha”这样的单词可能是个人姓名、街道名称或公司名称的一部分，但它最有可能是人名。您应该能够找到列出出生名称的相对频率、名称为“Samantha”的城镇人口以及名称为“Samantha”的注册公司数量的数据库。即使是部分数据库也有助于建立一些合理的可能性猜测。

其他可能的规则：

一行末尾（对于从左到右的文本）或单独一行的 5 - 7 位字母和数字的混合可能是邮政编码。
“Inc”、“Ltd”、“Corp”和其他缩写应该增加一行被识别为公司名称的可能性
。个人姓名可能位于标题上方。（也许 85% - 95% 的时间？）
电话号码遵循数量有限的模式，并且往往包含邮政编码中未找到的字符：“(”“)”“。”
网站遵循常见模式。即使有人的合法名称是“CarolGreen.com”，如果她的名字被识别为一个网站，她可能也不会感到惊讶。
“@”符号几乎可以肯定是电子邮件地址的一部分。假设电子邮件地址确实出现，则电子邮件地址可能位于人名下方的某行。
某些信息可能不存在。该卡可能未列出网站。可能有电话号码，但没有街道地址。该人可能没有头衔。个人名片上不得有公司名称。很可能至少有一行是个人名字。

It's helpful to know the exceptions (e.g. a town named Mary Sue), but end users should be pleased if your software can handle the most likely cases. Names can be sorted by relative frequency of occurrence in each category: personal name, company name, city name. For companies, the number of employees can be used to calculate relative likelihood. For cities, population.

Do you already have rules to check the relative position of the line containing each token?

There are certainly quite a few business card formats, but if you have several hundred sample business cards you should be able to identify some common format rules. Having just a few rules could help immensely. One rule might be "80% of all cards have the address beneath the personal name and company name," Although your sample of business cards may not be truly representative of all possible business cards, all languages, etc., etc., it's a start. Even a few 50% and 80% rules could simplify your task.

You can probably think up several rules using a ridiculous example.

John Smith
Chief Operating Officer
Acme Inc.
123 Main Street
Somewhere, XZ 01010

is more likely than

Somewhere, XZ
01010
John Smith
Acme Inc.
Chief Operating Officer
123 Main Street

That suggests we can consider the relative Y-position of personal and company names relative to postal codes. Although personal name, job title, and company name may follow in any of several orders, postal codes are likely to be located below company names. Postal codes will be closer to city names, etc.

Although a word like "Samantha" could be part of a personal name, a street name, or a company name, it's most likely a person name. You should be able to find databases that list the relative frequency of birth names, the population of towns with the name "Samantha", and the number of registered corporations with the name "Samantha." Even partial databases would be helpful to establish some reasonable guesstimates of likelihood.

Other possible rules:

A mix of letters and numbers 5 - 7 digits at the end of a line (for left-to-right text) or on its own line is likely to be a postal code.
"Inc," "Ltd", "Corp" and other abbreviations should increase the likelihood that a line is identified as a company name
A personal name is likely to be located above a title. (Maybe 85% - 95% of the time?)
Phone numbers follow a somewhat limited number of patterns, and tend to include characters not found in postal codes: "(" ")" "."
Websites follow common patterns. Even if there's someone whose legal name is "CarolGreen.com", she probably wouldn't be surprised if her name was recognized as a website.
The "@" symbol is almost certainly part of an email address. An email address is likely located on some line beneath the person name, assuming the email address appears at all.
Some information may be absent. The card may not list a website. There may be a phone number, but not a street address. The person may not have a title. A personal business card may not have a company name. It's most likely that at least one line will be a personal name.

回复收藏 0 原文

~没有更多了~