Rails - strip_tags - 没有捕获 DOCTYPE?
给定一封 HTML 电子邮件,我使用以下内容将其精简为文本:
body = body.gsub(/\\r\\n?/, "\n");
body = body.gsub(/\\n\\n?/, "\n");
body = simple_format(body)
body = strip_tags(body)
但我现在看到一个标签通过了以下内容:
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">
哪个输出如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
有什么想法吗?
Given an HTML email, I'm using the following to strip down to just the text:
body = body.gsub(/\\r\\n?/, "\n");
body = body.gsub(/\\n\\n?/, "\n");
body = simple_format(body)
body = strip_tags(body)
But I'm now seeing that one tag gets passed this:
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">
Which outputs like so:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Any ideas why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我猜想对于 strip_tags 来说,它看起来已经被弃用了,它认为 doctype 语句既不是标签,也不是注释。您可以单独将其删除:
string.gsub(/
I guess for strip_tags, which looks like it's been deprecated, considers the doctype statement neither a tag, nor a comment. You could strip it out separately:
string.gsub(/<!.*?$/,'')
我最终使用 Hpricot 发短信,效果很好
I ended up using Hpricot to text, worked great
我建议使用 Nokogiri 来满足您的解析需求。它得到了很好的支持,速度很快,非常灵活,并且是许多其他 HTML/XML 类型 gem 的基础。它有一个 Hpricot 模式,尽管我不确定为什么有人需要它,因为它的语法功能更全面。
特别是,要从 HTML 中删除标签,我建议查看 Loofah。它可以将标签列入白名单,并且可以进行多层清理。
I'd recommend using Nokogiri for your parsing needs. It's very well supported, plenty fast, very flexible, and the basis of a lot of other HTML/XML type gems. It has a Hpricot mode, though I'm not sure why anyone would need that as its syntax is more full-featured.
In particular, to strip tags from HTML, I'd recommend looking into Loofah. It can whitelist tags, and has several layers of cleansing it can do.