python [lxml] - 清除 html 标签
from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
我将上面的(丑陋的)代码放在一起作为我对 python 领域的初步尝试。我正在尝试使用 lxml Cleaner 清理几个 html 页面,所以最后我只留下文本而没有其他内容 - 但尽我所能,上面的内容似乎不起作用,我尽管我在 remove_tags
和 links= 中使用了参数,但仍然留下大量标记(并且它似乎没有被破坏的 html),特别是链接,这些链接没有被删除是的
知道发生了什么吗,也许我用 lxml 叫错了树?我认为这是在 python 中解析 html 的方法?
from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags
and links=True
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
来自 David 的解决方案将不带分隔符的文本连接起来:
但这一个帮助了我 - 按照我需要的方式连接:
solution from David concatenates the text with no separator:
but this one helped me - concatenation the way I needed:
不确定这个方法在您提出问题时是否存在,但如果您经历过,
应该会返回 html 文档中的所有文本内容,减去所有标记。
Not sure if this method existed around the time you made your question, but if you go through
That should return you all the text content in the html document, minus all the markup.
我认为你应该看看美丽汤。使用本文中的建议 并按以下方式剥离 HTML 元素:
其中
page
是您的 html 字符串。如果您需要进一步说明,可以查看 HTML 解析 中的深入了解 Python 案例研究。
I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:
Where
page
is your string of html.Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.