使用 Ruby 从 HTML 文档中删除文本
有很多关于如何使用 Ruby 从文档中删除 HTML 标签的示例,Hpricot 和 Nokogiri 都有 inside_text 方法,可以轻松快速地为您删除所有 HTML。
我想做的恰恰相反,从 HTML 文档中删除所有文本,只留下标签及其属性。
我考虑过循环遍历文档,将inner_html设置为nil,但实际上你必须反向执行此操作,因为第一个元素(根)具有文档其余部分的inner_html,所以理想情况下我必须从最里面的元素并将inner_html设置为nil,同时向上移动到祖先。
有谁知道一个巧妙的小技巧可以有效地做到这一点?我想也许正则表达式可以做到这一点,但可能不如 HTML 标记器/解析器那么有效。
There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
发布评论
评论(4)
这也有效:
This works too:
您可以扫描字符串以创建“标记”数组,然后仅选择那些 html 标记:
==编辑==
或者甚至更好,只需扫描 html 标记;)
You can scan the string to create an array of "tokens", and then only select those that are html tags:
==Edit==
Or even better, just scan for html tags ;)
要抓取标签之外的所有内容,您可以像这样使用 nokogiri:
当然,这会抓取
或