Beautiful Soup - 抓取第一个指定标签后的字符串
我试图在开始 标记之后立即获取字符串。以下代码有效:
webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
print elem.parent
当 html 如下所示时:
plan_49913.doc
但当 html 如下所示时无效:
plan_49913.doc< br/> 取代的文档: 2012 年 1 月
我尝试过使用 attrs 但无法让它工作。基本上我只想在 html 的任一实例中获取“plan_49913.doc”。
任何建议将不胜感激。
先感谢您。
〜克里斯克
I'm trying to grab the string immediately after the opening <td>
tag. The following code works:
webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
print elem.parent
when the html looks like this:
<td>plan_49913.doc</td>
but not when the html looks like this:
<td>plan_49913.doc<br />
<font color="#990000">Document superseded by: </font><a href="/plans/Jan_2012.html">January 2012</a></td>
I've tried playing with attrs but can't get it to work. Basically I just want to grab 'plan_49913.doc' in either instance of html.
Any advice would be greatly appreciated.
Thank you in advance.
~chrisK
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这对我有用:
我缺少什么吗?
另请注意,根据文档:
因此,您不需要传递
'td'
因为它已经被忽略,也就是说,将返回与任何其他标记下匹配的任何文本。This works for me:
Is there something I'm missing?
Also, note that according to the documentation:
So you don't need to pass
'td'
since it's already being ignored, that is, any text that matches under any other tag will be returned.只需使用
next
属性,它包含下一个节点,这是一个文本节点。如果您愿意,可以更改
if
子句以使用正则表达式。Just use the
next
property, it contains the next node, and that's a textual node.you can change the
if
clause to use a regex if you prefer.