使用正则表达式从段落中获取地址

发布于 2024-12-08 01:50:23 字数 943 浏览 6 评论 0原文

好吧，这个有点痛。我正在用 Python 进行一些抓取，试图从几行标记不佳的 HTML 中获取地址。以下是格式示例：

256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>

我只想检索 1234 Fake Ave S, Gotham。有什么想法吗？我整晚都在做正则表达式，现在我的大脑很混乱......

编辑：有关数据如何到达的可能场景的更多详细信息。有时第一行会出现，有时则不会。我见过的所有地址都包含 Ave、Way、St，尽管我不想将其用作选择的因素，因为我不确定它们将始终如此。第二行和第三行是 alPhone （或可能的电子邮件或网站）：

我想到的是

选择第二行到最后一行上的所有内容（因此，如果有三行，则选择第二行，如果只有两行，则选择第一行）电话号码）。
选择最后一行中不在括号中的所有内容。
将倒数第二行和最后一行合并，在两者之间添加“，”。

我正在使用 Scrapy 来获取 HTML 代码。地址都在同一个 div 中，我想使用正则表达式将数据进一步分解为适当的部分。现在我无法弄清楚如何做到这一点。

Edit2：

根据 Ofir 的评论，我应该提到我已经制作了表达式来隔离电话号码和括号部分。

电话（或可能的电子邮件或网站）：

((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+@[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))

括号：

\((.*?)\)

我不确定如何使用它们来构建除这些以外的所有内容的声明。

原文

Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:

256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>

I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...

Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):

What I had in mind was something that

Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.

I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.

Edit2:

As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.

Phone (or possible email or website):

((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+@[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))

parentheses:

\((.*?)\)

I'm not sure how to use those to construct a everything-but-these statement.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独患者 2024-12-15 01:50:23

在您的情况下，可能更容易关注您不想要的内容：

html 标签 ()
电话号码
括号中的所有内容

每个都可以轻松地与简单的正则匹配表达式，可以轻松构造一个表达式来匹配其余表达式（大概是地址）

回复收藏 0 原文

日裸衫吸 2024-12-15 01:50:23

这试图将最后两行与字符串隔离：

>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S

修剪括号可能最好留给单独的代码行，而不是使正则表达式进一步复杂化。

This attempts to isolate the last two lines out of the string:

>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S

Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.

回复收藏 0 原文