如何使用 Java 清理邮政地址中的街道号码?
为了确保数据隐私,我必须在删除街道号码后发布地址列表。
因此,例如:
1600 Amphitheatre Parkway, Mountain View, CA
需要发布为
Amphitheatre Parkway, Mountain View, CA
What's the best way to do this in Java?这需要正则表达式吗?
To ensure data privacy, I have to publish a list of addresses after removing the street numbers.
So, for example:
1600 Amphitheatre Parkway, Mountain View, CA
needs to be published as
Amphitheatre Parkway, Mountain View, CA
What's the best way to do this in Java? Does this require regex?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
编辑:怎么样...
或者JavaScript...
我最初的建议是(JavaScript)...
EDIT : How about...
or JavaScript...
My original suggestion was (JavaScript)...
这是一个技术上很难解决的问题。但我认为这并不重要。
您说您想从地址中删除街道号码以确保数据隐私。您认为这到底如何确保隐私?我的意思是,它可能会给那些住在有几千户住宅的街道上的人带来一点隐私,但在中等街道上,它会将范围缩小到几百人;在一条小街上,可能有几个选择,在一些乡村道路上,它可能会准确地告诉您该地址对应于哪所房子。
这不是清理。
如果您将任何其他数据与该地址关联起来,问题就会变得更加复杂。
This is a technically difficult problem to solve. But I don't think that matters.
You say you want to strip out the street number from the address to ensure data privacy. How in the world do you think that ensures privacy? I mean, it might give a little privacy to those who live on a street with a few thousand homes, but on a medium street it narrows it down to a few hundred people; on a small street there are maybe a few choices and on some rural roads it may tell you exactly which house the address corresponds to.
This is not sanitization.
The problem is then compounded greatly if you are associating any other data with that address.
一种可能性是使用 CASS 系统,该系统通常会解析地址并以 XML 形式返回。然后,您可以轻松获取街道名称、城市和州,而忽略街道号码。
One possibility is to use a CASS system that typically will parse the address and return in XML. Then, you can easily grab the street name, city, and state, ignoring the street number.
Natchy,我在一家名为 SmartyStreets 的地址验证公司工作:解析街道地址是我们的专业领域。我将强调 pkananen 和 Mark 所说的,这远远超出了正则表达式的能力,而且无论如何(除了数据隐私之外),您当前的方法不如其他方法有效。
USPS 授权某些地址解析器供应商使用其官方数据并返回经过认证的结果,特别是“CASS 认证”。通常 CASS 与邮件相关,但可以很好地扩展到您需要做的事情的领域。有一些 API(用于入口点的东西)和批处理服务(例如上传列表)可以验证和组件化地址。
当地址被分解为多个组件时,很容易只使用您实际需要的部分。您还将验证该地址是否存在、完整、准确并且能够满足您的目的。
例如,在 LiveAddress 的 API 页面(您可以将其用作您的自己的研究),您可以看到它是如何工作的,并且从文档中,您可以选择要显示或存储的地址片段。 (有趣的是!我们在该页面上的默认示例地址也是 Google 位于加利福尼亚州山景城的地址。)
如果您对解析地址还有任何其他疑问,我将很乐意亲自为您提供帮助。
Natchy, I work for an address verification company called SmartyStreets: and parsing street addresses is our area of expertise. I'll reinforce what pkananen and Mark have said in that this is far beyond the capabilities of regular expressions and anyway -- data privacy aside -- your current approach is less effective than others.
The USPS authorizes certain vendors of address parsers to use their official data and return certified results, specifically, "CASS-Certified." Usually CASS is associated with mailings, but extends well into the realm of what you need to do. There are APIs (for point-of-entry stuff) and batch services (like uploading a list) that will validate and componentize an address.
When an address is broken into components, it's very easy to use only the pieces you actually need. You'll also verify that the address exists, is complete, accurate, and will serve your purposes.
For example, on LiveAddress' API page (which you can use as a springboard for your own research), you can see how it works and, from the docs, that you can pick and choose which pieces of the addresses you'll want to display or store. (Funny thing! Our default sample address on that page is also Google's address in Mountain View, CA.)
If you have any further questions about parsing addresses, I'll be happy to personally help you.