BeautifulSoup 和表行内的换行符?
示例代码:
from BeautifulSoup import BeautifulSoup, SoupStrainer
html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''
soup=BeautifulSoup(html)
rows=soup.findAll('tr')
print rows
print rows[0].text.encode("utf8")
我希望输出类似于“Foo Bar”,或者即使两行之间有一个实际的换行符也可以,但我得到的输出只有“FooBar”,请注意没有空格两条线之间。
对 python 和 beautifulsoup 很陌生,有人可以帮忙吗?
Sample code:
from BeautifulSoup import BeautifulSoup, SoupStrainer
html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''
soup=BeautifulSoup(html)
rows=soup.findAll('tr')
print rows
print rows[0].text.encode("utf8")
I would like the output to be something like "Foo Bar" or even if there was an actual newline between the two lines that would be fine, but the output I get just has "FooBar", note that there is no whitespace between the two lines.
Very new to python and beautifulsoup, can someone give a hand?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
cell = rows[0].find('td')
更进一步,然后使用cell.contents
查看其内容,然后过滤您需要的元素,然后用空格连接
它们。另一种选择:您可以使用正则表达式将
替换为空格。为此,您可以编写:然后您可以将多个连续的空格替换为
然后字符串应如下所示:
然后您可以轻松提取所需的部分。
You can go one level further using
cell = rows[0].find('td')
, then see its contents usingcell.contents
, then filter the elements you need, thenjoin
them by spaces.Another option: you can use a regular expression for replacing the
<br />
by a space. for that you can write:Then you can replace multiple consecutive whitespaces by
Then the string should look like this:
Then you can easily extract the part you need.
您可能需要考虑使用 lxml 而不是 BeautifulSoup。
lxml
使您能够使用 XPath 搜索元素(我认为)比使用 BeautifulSoup 的 API 更容易。产量
和
产量
You might want to consider using lxml instead of BeautifulSoup.
lxml
enables you to search for elements using XPath which (I think) is easier than using BeautifulSoup's API.yields
and
yields