python [lxml] - 清除 html 标签

发布于 2024-09-03 08:45:58 字数 785 浏览 8 评论 0原文

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

我将上面的(丑陋的)代码放在一起作为我对 python 领域的初步尝试。我正在尝试使用 lxml Cleaner 清理几个 html 页面,所以最后我只留下文本而没有其他内容 - 但尽我所能,上面的内容似乎不起作用,我尽管我在 remove_tagslinks= 中使用了参数,但仍然留下大量标记(并且它似乎没有被破坏的 html),特别是链接,这些链接没有被删除是的

知道发生了什么吗,也许我用 lxml 叫错了树?我认为这是在 python 中解析 html 的方法?

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

很酷不放纵 2024-09-10 08:45:58

来自 David 的解决方案将不带分隔符的文本连接起来:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

但这一个帮助了我 - 按照我需要的方式连接:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))
分開簡單 2024-09-10 08:45:58

不确定这个方法在您提出问题时是否存在,但如果您经历过,

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

应该会返回 html 文档中的所有文本内容,减去所有标记。

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

时光礼记 2024-09-10 08:45:58

我认为你应该看看美丽汤。使用本文中的建议 并按以下方式剥离 HTML 元素:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

其中 page 是您的 html 字符串。

如果您需要进一步说明,可以查看 HTML 解析 中的深入了解 Python 案例研究。

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文