识别网页上物理地址的算法

发布于 2024-07-09 14:35:30 字数 76 浏览 8 评论 0原文

识别 HTML 页面上的结构化数据的最佳算法是什么?

例如,谷歌将识别电子邮件中的家庭/公司地址,并提供该地址的地图。

What are the best algorithms for recognizing structured data on an HTML page?

For example Google will recognize the address of home/company in an email, and offers a map to this address.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

書生途 2024-07-16 14:35:30

诸如 GATE 之类的命名实体提取框架至少解决了 位置信息提取问题,在已知地点地名词典的协助下帮助解决常见问题。 除非页面是从公共源机器生成的,否则您会发现正则表达式对于这项工作来说有点弱。

A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.

老旧海报 2024-07-16 14:35:30

如果您有正确的标记(而不仅仅是页面中的文本),我同意上面的“美丽汤”建议。 特别是,地址标签应该提供最容易实现的目标。 另请查看 adr 微格式。 如果前两个没有获取足够的信息或者我没有必要的数据来查找前两个,我只会使用正则表达式。

If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.

慢慢从新开始 2024-07-16 14:35:30

如果您还必须处理国际地址,那么您将陷入头痛的境地。 国际地址格式千差万别。

If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.

路弥 2024-07-16 14:35:30

我猜谷歌会采取两步方法来解决这个问题(至少我会这么做)。 首先,他们使用一些相当通用的搜索模式来挑选出可能是地址的所有内容,然后使用地图数据库查找该字符串并查看是否找到任何匹配项。 如果他们这样做,则可能是一个地址,如果他们不这样做,则可能不是。 如果您可以在代码中使用地图数据库,这可能会让您的生活更轻松。

除非您可以限制地址的地理位置,否则我猜想仅通过解析字符串就几乎不可能将其识别为地址,这仅仅是因为世界各地使用的地址格式存在巨大差异。

I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.

Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.

月棠 2024-07-16 14:35:30

不要使用正则表达式。 使用现有的 HTML 解析器,例如在 Python 中,我强烈推荐 BeautifulSoup。 即使您使用正则表达式来解析 BeautifulSoup 抓取的 HTML 元素。

如果您使用自己的正则表达式来执行此操作,您不仅需要担心找到所需的数据,还需要担心无效的 HTML 等问题,以及您会遇到的许多其他非常不明显的问题。

Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.

If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..

江湖彼岸 2024-07-16 14:35:30

如果你想把它做到完美,你所问的问题确实是一个很难的问题。 虽然一个简单的正则表达式在大多数情况下都能得到正确的结果,但编写一个每次都完全正确的正则表达式却非常困难。 有很多奇怪的极端情况,并且在某些情况下没有单一明确的答案。 我见过的大多数网站在处理除了最简单的 URL 之外的所有内容方面都做得很糟糕。

如果您想走正则表达式路线,最好的选择可能是查看源代码
http://metacpan.org/pod/Regexp::Common::URI ::http

What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.

If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http

世界和平 2024-07-16 14:35:30

同样,正则表达式应该可以解决问题。

由于地址种类繁多,您只能通过“(数字),(名称)Street|Boulevard|Main”等表达式来猜测字符串是否是地址。

您可以考虑研究一些旨在映射文本中找到的地址,看看它们是如何工作的

Again, regular expressions should do the trick.

Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc

You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work

没有伤那来痛 2024-07-16 14:35:30
  1. 这取决于您的要求。

对于电子邮件和联系方式,正则表达式就足够了。
对于地址,仅使用正则表达式是没有帮助的。 思考 NLP(NER) 和 NLP(NER) POS 标签。
要查找与人相关的信息,没有 NER,你什么都做不了。

  • 如果您需要段落等信息,请使用标签获取内容。
  1. It depends upon your requirement.

for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.

  • If you need information like paragraphs get the contents by using tags.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文