使用正则表达式从段落中获取地址

发布于 2024-12-08 01:50:23 字数 943 浏览 1 评论 0原文

好吧,这个有点痛。我正在用 Python 进行一些抓取,试图从几行标记不佳的 HTML 中获取地址。以下是格式示例:

256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>

我只想检索 1234 Fake Ave S, Gotham。有什么想法吗?我整晚都在做正则表达式,现在我的大脑很混乱......

编辑: 有关数据如何到达的可能场景的更多详细信息。有时第一行会出现,有时则不会。我见过的所有地址都包含 Ave、Way、St,尽管我不想将其用作选择的因素,因为我不确定它们将始终如此。第二行和第三行是 alPhone (或可能的电子邮件或网站):

我想到的是

  1. 选择第二行到最后一行上的所有内容(因此,如果有三行,则选择第二行,如果只有两行,则选择第一行)电话号码)。
  2. 选择最后一行中不在括号中的所有内容。
  3. 将倒数第二行和最后一行合并,在两者之间添加“,”。

我正在使用 Scrapy 来获取 HTML 代码。地址都在同一个 div 中,我想使用正则表达式将数据进一步分解为适当的部分。现在我无法弄清楚如何做到这一点。

Edit2:

根据 Ofir 的评论,我应该提到我已经制作了表达式来隔离电话号码和括号部分。

电话(或可能的电子邮件或网站):

((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+@[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))

括号:

\((.*?)\)

我不确定如何使用它们来构建除这些以外的所有内容的声明。

Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:

256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>

I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...

Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):

What I had in mind was something that

  1. Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
  2. Selects everything on last line that isn't in parentheses.
  3. Combine the 2nd to last line and last line, adding a ", " in between the two.

I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.

Edit2:

As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.

Phone (or possible email or website):

((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+@[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))

parentheses:

\((.*?)\)

I'm not sure how to use those to construct a everything-but-these statement.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤独患者 2024-12-15 01:50:23

在您的情况下,可能更容易关注您不想要的内容:

  • html 标签 (
    )
  • 电话号码
  • 括号中的所有内容

每个都可以轻松地与简单的正则匹配表达式,可以轻松构造一个表达式来匹配其余表达式(大概是地址)

It is possible that in your case it is easier to focus on what you don't want:

  • html tags (<br>)
  • phone numbers
  • everything in parenthesis

Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)

日裸衫吸 2024-12-15 01:50:23

这试图将最后两行与字符串隔离:

>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S

修剪括号可能最好留给单独的代码行,而不是使正则表达式进一步复杂化。

This attempts to isolate the last two lines out of the string:

>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S

Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.

淤浪 2024-12-15 01:50:23

据我了解您的问题,我认为您解决问题的方法是错误的。

正则表达式并不是一种神奇的工具,可以从一堆无差别的文本元素中提取相关数据。它是一种工具,只能从具有可变部分的文本中提取数据,但也可以作为可变部分相对定位的锚点的最小稳定结构。

在您的处理中,在我看来,您首先隔离了这部分,其中包含可能的电话号码,后跟 1/2 行上的地址。但这样做,你会丢失信息:之前和之后是锚定信息,你不应该试图在消除这些信息后获得的剩余部分中找到某些东西。

此外,我认为您不仅仅只想获取电话号码和地址:您可能想提取此部分之前和之后的其他信息。通过形状良好的正则表达式,您可以一次捕获所有片段。

因此,请提供更多的文本,在有限的部分之前和之后都有足够的字符,以便编写正确且更简单的正则表达式策略来捕获您想要的所有数据。 Triple已经问过你这个问题了,而你没有,为什么?

As far as I understood you problem, I think you are taking the wrong way to solve it.

Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.

In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.

Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.

So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文