从包含姓名和地址的文本块中提取地址/联系方式?
我有一段文本,其中包括姓名(可能是公司名称)和地址,也可能是电子邮件地址。我想从中提取街道地址,最好是姓名和地址。
这些数据是从多个来源获取的,所以我不知道实际的格式。可能是这样的,
Company name, [email protected]
ATTN John Doe
care of Company Name
123 Street St
New York, NY 12345
US
123-456-7890
但是任何这些行都可能被重新排列或丢失(电话号码可能排在第一位,没有 ATTN 或 c/o 等)。此外,这可能来自任何国家。
目标是 a) 将地址插入 Google 地图 API,b) 创建包含尽可能多信息的联系人。
这是我的一个随机想法:
- 取任何带有电子邮件地址的行(可以使用正则表达式轻松找到),存储电子邮件地址并从进一步考虑中删除该行。
- 获取任何带有电话号码的行(仅限数字和 [-+()]),存储该号码,并从进一步考虑中删除该行。
- 将最后三行视为街道地址 - 将它们插入谷歌地图并期待最好的结果。
显然,这是很多juju魔法。有更聪明的方法吗?是否有任何图书馆具有良好的正则表达式来查找不同国家的街道地址?
I have a block of text that includes name, maybe company name, and address, and maybe email address. I want to extract the street address out of that, and preferably name and address.
This data is siphoned from multiple sources, so I have no idea about the actual formatting. It could be something like this
Company name, [email protected]
ATTN John Doe
care of Company Name
123 Street St
New York, NY 12345
US
123-456-7890
But any of those lines could be rearranged or missing (phone number could come first, no ATTN or c/o, etc). Also, this could be from any country.
The goal is to a) plug the address into the Google Maps API, and b) create a contact with as much information as possible.
Here is a random idea I had:
- Take any line with an email address (can be found with a regex easily), store the email address and remove the line from further consideration.
- Take any line with a phone number (digits only, and [-+()]), store that number, and remove the line from further consideration.
- Take the last three lines and consider those the street address - plug them into Google Maps and hope for the best.
Obviously, that's a lot of juju magic. Is there a smarter approach? Are there are any libraries that have good regexes to look for street addresses of different countries?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
取决于你的来源。如果您可以控制它从来源到达的方式,那么您可以进行一些格式化。
Depends on your source. If you have control of how it arrives from your source, then you can do some formatting.