删除
来自已解析的 Beautiful Soup 列表的标签？

发布于 2024-11-06 05:21:37 字数 294 浏览 3 评论 0原文

我目前正在进入一个 for 循环，其中包含我想要的所有行：

page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):

此时，我已经有了我的信息，但

<br />

标签正在破坏我的输出。

去除这些最干净的方法是什么？

原文

I'm currently getting into a for loop with all the rows I want:

page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):

At this point, I have my information, but the

<br />

tags are ruining my output.

What's the cleanest way to remove these?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

琴流音 2024-11-13 05:21:37

for e in soup.findAll('br'):
    e.extract()

for e in soup.findAll('br'):
    e.extract()

回复收藏 0 原文

少女情怀诗 2024-11-13 05:21:37

如果您想将转换为换行符，请执行以下操作：

def text_with_newlines(elem):
    text = ''
    for e in elem.recursiveChildGenerator():
        if isinstance(e, basestring):
            text += e.strip()
        elif e.name == 'br':
            text += '\n'
    return text

If you want to translate the <br />'s to newlines, do something like this:

def text_with_newlines(elem):
    text = ''
    for e in elem.recursiveChildGenerator():
        if isinstance(e, basestring):
            text += e.strip()
        elif e.name == 'br':
            text += '\n'
    return text

回复收藏 0 原文

撩发小公举 2024-11-13 05:21:37

将开头的标签替换为空格
Beautiful soup 还接受 urlopen 对象上的 .read() 所以这应该可以工作 - - -

page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....

re.sub 用空格替换 br 标签

replace tags at the start with a space
Beautiful soup also accepts the .read() on the urlopen object so this should work - - -

page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....

the re.sub replaces the br tag with a whitespace

回复收藏 0 原文

甜心小果奶 2024-11-13 05:21:37

也许 some_string.replace(' ','\n') 用换行符替换换行符。

>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data

您可能需要查看 html5lib 和 lxml，它们在解析 html 方面都非常出色。 lxml 确实很快，而 html5lib 的设计非常健壮。

Maybe some_string.replace('<br />','\n') to replace the breaks with newlines.

>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data

You might want to check out html5lib and lxml, which are both pretty great at parsing html. lxml is really fast and html5lib is designed to be extremely robust.

回复收藏 0 原文

~没有更多了~