使用 Ruby 从 HTML 文档中删除文本

发布于 2024-08-05 18:31:56 字数 342 浏览 7 评论 0原文

有很多关于如何使用 Ruby 从文档中删除 HTML 标签的示例,Hpricot 和 Nokogiri 都有 inside_text 方法,可以轻松快速地为您删除所有 HTML。

我想做的恰恰相反,从 HTML 文档中删除所有文本,只留下标签及其属性。

我考虑过循环遍历文档,将inner_html设置为nil,但实际上你必须反向执行此操作,因为第一个元素(根)具有文档其余部分的inner_html,所以理想情况下我必须从最里面的元素并将inner_html设置为nil,同时向上移动到祖先。

有谁知道一个巧妙的小技巧可以有效地做到这一点?我想也许正则表达式可以做到这一点,但可能不如 HTML 标记器/解析器那么有效。

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.

What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.

I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.

Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

給妳壹絲溫柔 2024-08-12 18:31:56

这也有效:

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove

This works too:

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
清秋悲枫 2024-08-12 18:31:56

您可以扫描字符串以创建“标记”数组,然后仅选择那些 html 标记:

>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"

==编辑==

或者甚至更好,只需扫描 html 标记;)

>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"

You can scan the string to create an array of "tokens", and then only select those that are html tags:

>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"

==Edit==

Or even better, just scan for html tags ;)

>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
杯别 2024-08-12 18:31:56

要抓取标签之外的所有内容,您可以像这样使用 nokogiri:

doc.search('//text()').text

当然,这会抓取

blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
  nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text

如果您愿意,也可以将其列入白名单,但这可能会更加耗时:

whitelist = ['p', 'span', 'strong', 'i', 'b']  #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
  nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text

您也可以只构建一个巨大的 XPath 表达式并进行一次搜索。老实说,我不知道哪种方式更快,或者是否存在明显的差异。

To grab everything not in a tag, you can use nokogiri like this:

doc.search('//text()').text

Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:

blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
  nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text

You could also whitelist if you preferred, but that's probably going to be more time-intensive:

whitelist = ['p', 'span', 'strong', 'i', 'b']  #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
  nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text

You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.

怎言笑 2024-08-12 18:31:56

我刚刚想出了这个,但是 @andre-r 的解决方案soo要好得多!

#!/usr/bin/env ruby

require 'nokogiri'

def strip_text doc
  Nokogiri(doc).tap { |doc|
    doc.traverse do |node|
      node.content = nil if node.text?
    end
  }.to_s
end

require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
  def test_that_all_text_gets_strippped_from_the_document
    dirty, clean = YAML.load DATA
    assert_equal clean, strip_text(dirty)
  end
end
__END__
---
- |
  <!DOCTYPE html>
  <html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
    <head>
        <meta http-equiv='Content-type'     content='text/html; charset=UTF-8' />
        <title>Test HTML Document</title>
        <meta http-equiv='content-language' content='en' />
    </head>
    <body>
        <h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
        <div class='main'>
            <p>
                <strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
            </p>
        </div>
    </body>
  </html>
- |
  <!DOCTYPE html>
  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <title></title>
  <meta http-equiv="content-language" content="en">
  </head>
  <body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
  </html>

I just came up with this, but @andre-r's solution is soo much better!

#!/usr/bin/env ruby

require 'nokogiri'

def strip_text doc
  Nokogiri(doc).tap { |doc|
    doc.traverse do |node|
      node.content = nil if node.text?
    end
  }.to_s
end

require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
  def test_that_all_text_gets_strippped_from_the_document
    dirty, clean = YAML.load DATA
    assert_equal clean, strip_text(dirty)
  end
end
__END__
---
- |
  <!DOCTYPE html>
  <html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
    <head>
        <meta http-equiv='Content-type'     content='text/html; charset=UTF-8' />
        <title>Test HTML Document</title>
        <meta http-equiv='content-language' content='en' />
    </head>
    <body>
        <h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
        <div class='main'>
            <p>
                <strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
            </p>
        </div>
    </body>
  </html>
- |
  <!DOCTYPE html>
  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <title></title>
  <meta http-equiv="content-language" content="en">
  </head>
  <body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
  </html>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文