python [lxml] - 清除 html 标签

发布于 2024-09-03 08:45:58 字数 785 浏览 8 评论 0原文

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

我将上面的（丑陋的）代码放在一起作为我对 python 领域的初步尝试。我正在尝试使用 lxml Cleaner 清理几个 html 页面，所以最后我只留下文本而没有其他内容 - 但尽我所能，上面的内容似乎不起作用，我尽管我在 remove_tags 和 links= 中使用了参数，但仍然留下大量标记（并且它似乎没有被破坏的 html），特别是链接，这些链接没有被删除是的

知道发生了什么吗，也许我用 lxml 叫错了树？我认为这是在 python 中解析 html 的方法？

原文

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

很酷不放纵 2024-09-10 08:45:58

来自 David 的解决方案将不带分隔符的文本连接起来：

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

但这一个帮助了我 - 按照我需要的方式连接：

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

回复收藏 0 原文

分開簡單 2024-09-10 08:45:58

不确定这个方法在您提出问题时是否存在，但如果您经历过，

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

应该会返回 html 文档中的所有文本内容，减去所有标记。

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

回复收藏 0 原文

时光礼记 2024-09-10 08:45:58

我认为你应该看看美丽汤。使用本文中的建议并按以下方式剥离 HTML 元素：

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

其中 page 是您的 html 字符串。

如果您需要进一步说明，可以查看 HTML 解析中的深入了解 Python 案例研究。

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

回复收藏 0 原文

~没有更多了~

关于作者

浅沫记忆

暂无简介

文章

765 人气

关注发私信

友情链接

文江博客

python [lxml] - 清除 html 标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

梦年海沫深

liaowenxiong

丢了幸福的猪

kaipeng

微信用户

独享拥抱

友情链接

python [lxml] - 清除 html 标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

梦年海沫深

liaowenxiong

丢了幸福的猪

kaipeng

微信用户

独享拥抱

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。