Beautifulsoup 解析-详细信息

发布于 2024-12-01 18:16:41 字数 493 浏览 0 评论 0原文

我已经问过一个问题,但似乎我的解释不清楚。 因此,我再次询问更多详细信息。

<h2 class="sectionTitle">
CORPORATE HEADQUARTERS  </h2>
277 Park Avenue<br />
New York, New York 10172
<br /><br />United States<br /><br />

我只想提取纽约,纽约没有邮政编码 10172

这是另一个问题..

<h2 class="sectionTitle">
BACKGROUND</h2>
He graduated Blabala 
</span>

我只想提取他毕业的 Blabla

我已经花了几天时间,所以我觉得我可能会变得疯狂.. 请帮助我..提前感谢您的帮助。

I already asked a question, but it seems my explnation was not clear..
So, I am asking again with more detail info.

<h2 class="sectionTitle">
CORPORATE HEADQUARTERS  </h2>
277 Park Avenue<br />
New York, New York 10172
<br /><br />United States<br /><br />

I would like to extract only New York, New York without postal code 10172

And this is another question..

<h2 class="sectionTitle">
BACKGROUND</h2>
He graduated Blabala 
</span>

I would like to extract only He graduated Blabla

I have been spending few days, so I feel I could become crazy..
Please help me.. thank you for your kind help in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

许仙没带伞 2024-12-08 18:16:41

您仍然需要更多细节才能编写好的正则表达式。

例如,如果要提取“CORPORATE HEADQUARTERS”的第二行,而没有始终存在的邮政编码,则可以这样写:

>>> import re
>>> html = '''
... <h2 class="sectionTitle">
... CORPORATE HEADQUARTERS  </h2>
... 277 Park Avenue<br />
... New York, New York 10172
... <br /><br />United States<br /><br />
... 
... <h2 class="sectionTitle">
... BACKGROUND</h2>
... He graduated Blabala
... </span>
... '''
>>> re.search('(?s)<h2 class="sectionTitle">\s*CORPORATE HEADQUARTERS\s*</h2>.*?<br />([^<>]+) \d+', html).group(1).strip()
'New York, New York'
>>> re.search('(?s)<h2 class="sectionTitle">\s*BACKGROUND\s*</h2>([^<>]+)', html).group(1).strip()
'He graduated Blabala'

You still need more detail to write a good regex.

For example, if you want to extract the second line of "CORPORATE HEADQUARTERS" without a postal code that always exists, it can be written like this:

>>> import re
>>> html = '''
... <h2 class="sectionTitle">
... CORPORATE HEADQUARTERS  </h2>
... 277 Park Avenue<br />
... New York, New York 10172
... <br /><br />United States<br /><br />
... 
... <h2 class="sectionTitle">
... BACKGROUND</h2>
... He graduated Blabala
... </span>
... '''
>>> re.search('(?s)<h2 class="sectionTitle">\s*CORPORATE HEADQUARTERS\s*</h2>.*?<br />([^<>]+) \d+', html).group(1).strip()
'New York, New York'
>>> re.search('(?s)<h2 class="sectionTitle">\s*BACKGROUND\s*</h2>([^<>]+)', html).group(1).strip()
'He graduated Blabala'
酒浓于脸红 2024-12-08 18:16:41

您应该使用 tag.contents 使用 .split('\n') 逐行分割,.rsplit(' ', 1)` 仅分割最右边的空格分隔字符串。

You should use a combination of tag.contents with .split('\n') to split on lines and.rsplit(' ', 1)` to split only the right most space-separated string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文