删除 AppEngine Python Env 中的 HTML 标签（相当于 Ruby 的 Sanitize）

发布于 2024-08-24 17:39:28 字数 432 浏览 13 评论 0原文

我正在寻找一个 python 模块，它将帮助我摆脱 HTML 标签但保留文本值。我之前尝试过 BeautifulSoup，但不知道如何完成这个简单的任务。我尝试搜索可以执行此操作的 Python 模块，但它们似乎都依赖于其他在 AppEngine 上运行不佳的库。

下面是来自 Ruby 清理库的示例代码，这就是我在 Python 中所追求的：

require 'rubygems'
require 'sanitize'

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Sanitize.clean(html) # => 'foo'

感谢您的建议。

-e

原文

I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.

Below is a sample code from Ruby's sanitize library and that's what I am after in Python:

require 'rubygems'
require 'sanitize'

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Sanitize.clean(html) # => 'foo'

Thanks for your suggestions.

-e

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深陷 2024-08-31 17:39:28

>>> import BeautifulSoup
>>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)  
>>> bs.findAll(text=True)
[u'foo']

这将为您提供 (Unicode) 字符串列表。如果您想将其转换为单个字符串，请使用''.join(thatlist)。

>>> import BeautifulSoup
>>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)  
>>> bs.findAll(text=True)
[u'foo']

This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).

回复收藏 0 原文

夏雨凉 2024-08-31 17:39:28

如果您不想使用单独的库，那么您可以导入标准 django utils。例如：

from django.utils.html import strip_tags
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped 
# you got: foo

它也已经包含在 Django 模板中，所以你不需要任何其他东西，只需使用过滤器，如下所示：

{{ unsafehtml|striptags }}

顺便说一句，这是最快的方法之一。

If you don't want to use separate libs then you can import standard django utils. For example:

from django.utils.html import strip_tags
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped 
# you got: foo

Also its already included in Django templates, so you dont need anything else, just use filter, like this:

{{ unsafehtml|striptags }}

Btw, this is one of the fastest way.

回复收藏 0 原文

冬天旳寂寞 2024-08-31 17:39:28

使用lxml：

htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

from lxml.html import fromstring

mySearchTree = fromstring(htmlstring)

for item in mySearchTree.cssselect('a'):
    print item.text

Using lxml:

htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

from lxml.html import fromstring

mySearchTree = fromstring(htmlstring)

for item in mySearchTree.cssselect('a'):
    print item.text

回复收藏 0 原文

蓝礼 2024-08-31 17:39:28

#!/usr/bin/python

from xml.dom.minidom import parseString

def getText(el):
    ret = ''
    for child in el.childNodes:
        if child.nodeType == 3:
            ret += child.nodeValue
        else:
            ret += getText(child)
    return ret

html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text  </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)

印刷：

这是一个链接和一些粗体文本，后面是图像

#!/usr/bin/python

from xml.dom.minidom import parseString

def getText(el):
    ret = ''
    for child in el.childNodes:
        if child.nodeType == 3:
            ret += child.nodeValue
        else:
            ret += getText(child)
    return ret

html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text  </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)

Prints:

this is a link and some bold text followed by an image

回复收藏 0 原文

罪歌 2024-08-31 17:39:28

迟到了，但是。

您可以使用 Jinja2.Markup()

http://jinja.pocoo.org/文档/api/#jinja2.Markup.striptags

from jinja2 import Markup 
Markup("<div>About</div>").striptags()
u'About'

Late, but.

You can use Jinja2.Markup()

http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags

from jinja2 import Markup 
Markup("<div>About</div>").striptags()
u'About'

回复收藏 0 原文

~没有更多了~

关于作者

鸢与

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

删除 AppEngine Python Env 中的 HTML 标签（相当于 Ruby 的 Sanitize）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

删除 AppEngine Python Env 中的 HTML 标签（相当于 Ruby 的 Sanitize）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。