检测/解析文本中的邮寄地址

发布于 2024-10-19 08:05:12 字数 287 浏览 2 评论 0 原文

是否有任何开源/商业库可以检测文本中的邮寄地址,就像 Apple 的邮件应用程序在 Mac/iPhone 上为地址添加下划线一样。

我一直在网上做一些研究,想法似乎是使用 Google、Regex 或完整的 NLP 软件包(例如斯坦福大学的 NLP),这些软件包通常非常庞大。我怀疑 iPhone 是否有 500MB NLP 包,或者每次阅读电子邮件时都会连接到 Google。这让我相信应该有一种更简单的方法。可惜 UIDataDetectors 不是开源的。

我知道这个问题以前曾被问过,但没有确凿的答案,所以这是我的尝试。

Are there any open source/commercial libraries out there that can detect mailing addresses in text, just like how Apple's Mail app underlines addresses on the Mac/iPhone.

I've been doing a little online research and the ideas seem to be either to use Google, Regex or a full on NLP package such as Stanford's NLP, which usually are pretty massive. I doubt iPhone has a 500MB NLP package in there, or connects to Google every time you read an email. Which makes me to believe there should be an easier way. Too bad UIDataDetectors is not open source.

I know this question has been asked before, but there were no conclusive answers, so here's my try.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

伴随着你 2024-10-26 08:05:12

至于Python,你可以尝试Pyap:
https://pypi.python.org/pypi/pyap

目前支持美国和加拿大地址

As for Python you can try Pyap:
https://pypi.python.org/pypi/pyap

It currently supports US and Canadian addresses

雨落□心尘 2024-10-26 08:05:12

解析地址不是一门科学。在我的办公室,我们多年来一直在处理地址解析,问题是对于什么构成有效地址没有任何规则。我们使用 USPS 地址数据库来清理地址,这实际上比我们自己获得的速度更快、更准确。它使我们获得了 98% 的准确率,而之前我们获得了大约 90% 的已清理地址。

地址解析的更大问题往往是人们不以相同的方式输入地址。同一地址可能采用以下所有形式。

博蒙特街东 128 号
东博蒙特街128号
128 E Bmt 圣
博蒙特街128号
128 Highway 88

第三个看起来完全错误,但人们有时会输入这个。有时街道也是高速公路。有很多可能性。只要尝试捕获 90%,您就会接受这对于地址解析来说已经是最好的结果了。

Parsing addresses isn't a science. At my office we have been dealing with address parsing for years and the problem is that there aren't any rules about what constitutes a valid address. We use the USPS address database for cleaning addresses which is actually pretty fast and way more accurate than we were ever able to get on our own. It gets us 98% accuracy where as before we got about 90% cleaned addresses.

The bigger problem with address parsing tends to be that people don't input the address the same way. The same address might be in all the following forms.

128 E Beaumont St
128 East Beaumont Street
128 E Bmt St
128 Beaumont Street
128 Highway 88

The third one looks totally wrong but people will type that sometimes. Sometimes a street is also a highway. There are a bunch of possibilities. Just try to catch 90% and you accept that is as good as it gets for address parsing.

孤云独去闲 2024-10-26 08:05:12

正如 Drew 提到的,通过提取地址然后将其与 USPS 数据进行比较,实际上可以获得极高的准确性。每年从 USPS 获取一张 DVD 当然可以,但不考虑地址的变化。为此,您需要更新的版本。美国邮政局每月发布其更新的地址数据(以专有格式),因此这将是权威地址的良好来源。

最重要的是,使用地址验证服务(在提取地址数据后)将为您标准化地址,然后检查它们的送达能力和/或空缺状态。正如德鲁所提到的,同一个地址可以用许多不同的方式编写,但仍然有效。但是,USPS 将始终使用标准化格式。

为了以编程方式完成您想要的操作,您肯定需要一个 API,尽管也可以使用列表处理服务。

SmartyStreets 有一个名为 LiveAddress 的免费地址验证 API,它将标准化、验证,然后验证任何美国邮政地址。为了充分披露,我是 SmartyStreets 的创始人。

You can actually get extremely high accuracy as Drew mentioned by extracting the addresses and then comparing them against the USPS data. Getting a DVD from the USPS yearly will certainly work but doesn't factor in the addresses that change. For that, you would want a more up-to-date version. The USPS publishes it's updated address data (in proprietary format) monthly so that would be a good source of authoritative addresses.

On top of that, using an address validation service (after you extract the address data) will standardize the addresses for you and then check them for deliverability and/or vacancy status. As Drew mentioned, the same address can be written in many different ways that still work. However, the USPS will always use the standardized format.

In order to do what you are looking for programmatically, you'll definitely want an API, although list processing services are also available.

SmartyStreets has a free address validation API called LiveAddress that will standardize, verify, and then validate any US postal address. In the interest of full disclosure, I'm the founder of SmartyStreets.

绝不放开 2024-10-26 08:05:12

Extractiv 提供由 Language Computer Corporation 可以解析上传文档或网络抓取中的实体和关系。前一个服务使用 REST API。我放入此 URL,它提取了 4/5 的地址。请注意,将它们像这样串在一起会使它们变得特别困难。

在此 JSON 输出中搜索“地址”:
http://rest.extractiv.com/extractiv/?url=https://stackoverflow.com/questions/5099684/detect-parse-mailing-addresses-in-text&output_format=json

其中之一:(

{
  "id": 11,
  "len": 17,
  "offset": 1557,
  "text": "128 E Beaumont St",
  "type": "ADDRESS"
},

注意:如果您使用 HTML 输出,这更多地用于演示,它会过滤掉非句子内容,这就是我显示 JSON 的原因)。

免责声明:我在 Extractiv 工作。

更新
Extractiv 已不复存在。

Extractiv provides commercial NLP powered by Language Computer Corporation that can parse entities and relations in either uploaded documents or from web crawls. The former service utilizes a REST API. I dropped this URL in, and it extracts 4/5 of the addresses. Note, having them strung like that together makes them especially difficult.

Search for "address" in this JSON output:
http://rest.extractiv.com/extractiv/?url=https://stackoverflow.com/questions/5099684/detect-parse-mailing-addresses-in-text&output_format=json

One of them:

{
  "id": 11,
  "len": 17,
  "offset": 1557,
  "text": "128 E Beaumont St",
  "type": "ADDRESS"
},

(Note: if you use the HTML output, which is more for demos, it filters out non-sentence content, which is why I showed the JSON instead).

Disclaimer: I work at Extractiv.

Update:
Extractiv is no more.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文