自由格式文本的通用地址解析器

发布于 2024-07-24 09:49:37 字数 772 浏览 11 评论 0原文

我们有一个显示地图数据的程序（想想谷歌地图，但为我们的客户提供更多的交互性和自定义图层）。

我们允许通过一组组合框进行导航，这些组合框预先填充了一堆数据的某些字段（即：国家：加拿大，填写了省份字段。选择安大略省，然后填写了县/地区列表。选择一个县/区域，并填写城市，等等...）。

虽然这保证了准确的地址，但如果用户不知道街道地址或城市位于哪里（即基奇纳在哪个县/地区？），这对用户来说会很痛苦。

因此，我们正在考虑尝试使用自由格式文本字段进行地址解析器。

用户可以输入类似这样的内容（类似于 Google 地图、Bing 地图等...）： 22 Main St, Kitchener, On

我们可以将其分为几个部分并对数据进行查找并找到他们正在寻找的点（或建议替代方案）。

问题在于我们如何正确划分信息？我们如何分解这些部分并找到可能的匹配项？我猜我们不能保证用户会以我们总是期望的格式输入数据（显然）。后续工作是，如果我们没有找到完全匹配（或找到多个完全匹配……例如，在不同县的两个城市具有相同街道名称），如何呈现数据。

我们在地图数据中拥有大量可用数据（主要是mapinfo选项卡格式）。所以我们可以快速扫描街道名称、城市、州等。但我不确定解决这个问题的最佳方法。当然，使用谷歌地图会很好，但我们的大多数客户都处于封闭的网络中，通常不允许外部访问，而且大多数人不愿意依赖谷歌地图（因为它不包含他们需要的那么多信息），例如自定义地图图层）。显然，他们可以去谷歌并获得正确的位置，然后转移到我们的软件，但这将非常耗时，而且过程的速度可能非常重要。

原文

We have a program that displays map data (think Google Maps, but with much more interactivity and custom layers for our clients).

We allow navigation via a set of combo boxes that prefill certain fields with a bunch of data (ie: Country: Canada, the Province field is filled in. Select Ontario, and a list of Counties/Regions is filled in. Select a county/region, and a city is filled in, etc...).

While this guarantees accurate addresses, it's a pain for the users if they don't know where a street address or a city are located (ie, which county/region is kitchener in?).

So we are looking at trying to do an address parser with a freeform text field.

The user could enter something like this (similar to Google Maps, Bing Maps, etc...):
22 Main St, Kitchener, On

And we could compartmentalize it into sections and do lookups on the data and get to the point they are looking for (or suggest alternatives).

The problem with this is that how do we properly compartmentalize information? How do we break up the sections and find possible matches? I'm guessing we wouldn't be guaranteed that the user would enter data in a format we always expected (obviously). A follow up to this would be how to present the data if we don't find an exact match (or find multiple exact matches... two cities with the same street name in different counties, for example).

We have a ton of data available in the mapping data (mapinfo tab format mostly). So we can do quick scans of street names, cities, states, etc. But I'm not sure about the best way to go about approaching this problem. Sure, using Google Maps would be nice, bue most of our clients are in closed in networks where outside access is not usually allowed and most aren't willing to rely on google maps (since it doesn't contain as much information as they need, such as custom map layers). They could, obviously, go to google and get the proper location then move to our software, but this would time consuming and speed of the process can be quite important.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

缘字诀 2024-07-31 09:49:37

这本质上是一类命名实体解析问题。维基百科上的 NER

解决此问题的最佳方法是使用语言转换器解析地址以识别各种构造 - 一种方法类似于在有限状态机中使用正则表达式。

我在名为 GATE 的 Java NLP 和机器学习框架方面取得了巨大成功，他们的转换器库称为 Jape 。查看他们的 GUI，并使用它来编写一些 Java 代码！

他们的内置示例应该可以帮助您了解基础知识，然后您可以根据需要进行扩展。本质上，它使用规则和规则引擎将文本划分为组件，因此类似的内容

Xyz, Blah St,
Foo City, 11110, CA

将被翻译为，

Place: Xyz
Street: Blah St
City: Foo
...

然后您可以使用位置数据库进行匹配。

除了规则之外，Jape 还支持字典查找 - 因此，如果您的数据库中已经有“Blah St”，并且它有 2 个父项 - city Foo 和 Bar - 您只需通过解析下一行来消除歧义。

编辑：GATE 包含一个名为 ANNIE 的工具 - 一个信息提取系统，可以用来识别地址。这使用了一些您可以构建的内置 Jape 规则。

This is essentially a class of the Named Entity Resolution problem. NER on Wikipedia

The best way to approach this is to parse the address using a language transducer to identify various constructs - an approach is similar to using regular expressions with a finite state machine.

I've had great success with the Java NLP and Machine learning framework called GATE, and their transducer lib is called Jape. Check out their GUI, and use that to write some Java code for it!

Their built in examples should get you started with the basics, and you can then extend it as needed. Essentially, it compartmentalizes text into components using the rules and the rule engine, so something like,

Xyz, Blah St,
Foo City, 11110, CA

would be translated to,

Place: Xyz
Street: Blah St
City: Foo
...

And then you can use your database of locations to do matches.

Jape also supports dictionary lookups, apart from rules - so if you already have "Blah St" in your database, and it has 2 parents - city Foo and Bar - you just disambiguate by parsing the next line.

Edit: GATE includes a tool called ANNIE - an information extraction system, that can be played around with to identify addresses. This uses some built in Jape rules that you can build upon.

回复收藏 0 原文

静谧幽蓝 2024-07-31 09:49:37

顺便说一句，您是否见过 SmartyStreets 正在试验的新 API 端点？它从文本中提取地址并验证它们并将其转换为组件。

请参阅另一篇 Stack Overflow 帖子更详细的内容。我在 SmartyStreets 工作并帮助开发了它，所以我可以告诉你这是一个非常困难的问题，即使从表面上看它似乎很简单。

回复收藏 0 原文

贱贱哒 2024-07-31 09:49:37

Simson Garfinkel 为 NeXTstep 制作了一本漂亮的地址簿（后来针对 Mac OS X 进行了编译和更新，并提交给了 Apple 设计竞赛）。从那时起，它已开源并可从他的以下网站获取：

http://simson.net/ref/ sbook5/

回复收藏 0 原文

雪若未夕 2024-07-31 09:49:37

Geocoder.ca 清理、标准化和地理编码位置地址字符串。它还附加了邮政编码、时区和区号。

例如：
https://geocoder.ca/22%20Main%20St， %20Kitchener,%20On?geoit=xml

<geodata>
        <latt>43.286272</latt>
        <longt>-80.445823</longt>
     <postal>N0B1E1</postal>
<Dissemination_Area><dauid>35300802</dauid><adauid>35300042</adauid></Dissemination_Area>
<AreaCode>226,519</AreaCode>
<TimeZone>America/Toronto</TimeZone>
<standard>
<stnumber>22</stnumber><staddress>Main ST</staddress><city>Kitchener</city><prov>ON</prov><confidence>0.7</confidence></standard>
</geodata>

Geocoder.ca cleans up, standardizes and geocodes location address strings. It appends postal code, timezone and area code too.

For eg:
https://geocoder.ca/22%20Main%20St,%20Kitchener,%20On?geoit=xml

<geodata>
        <latt>43.286272</latt>
        <longt>-80.445823</longt>
     <postal>N0B1E1</postal>
<Dissemination_Area><dauid>35300802</dauid><adauid>35300042</adauid></Dissemination_Area>
<AreaCode>226,519</AreaCode>
<TimeZone>America/Toronto</TimeZone>
<standard>
<stnumber>22</stnumber><staddress>Main ST</staddress><city>Kitchener</city><prov>ON</prov><confidence>0.7</confidence></standard>
</geodata>

回复收藏 0 原文

~没有更多了~