在 Rails 中创建 HTML 段落的文本摘录

发布于 2025-01-12 02:29:24 字数 1629 浏览 3 评论 0 原文

我正在尝试提取一篇文章的摘录（标记解析为 HTML），其中仅包含段落中的纯文本。所有 HTML 都需要被删除，换行符、制表符和连续空格需要替换为单个空格。

我的第一步是创建一个简单的测试：

describe "#from_html" do
  it "creates an excerpt from given HTML" do
    html = "<p>The spice extends <b>life</b>.<br>The spice    expands consciousness.</p>\n
           <ul><li>Skip me</li></ul>\n
           <p>The <i>spice</i> is vital to space travel.</p>"

    text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."

    expect(R::ExcerptHelper.from_html(html)).to eq(text)
  end
end

然后开始摆弄并想出这个：

def from_html(html)
  Nokogiri::HTML.parse(html).css("p").map{|node|
    node.children.map{|child|
      child.name == "br" ? child.replace(" ") : child
    } << " "
  }.join.strip.gsub(/\s+/, " ")
end

我对 Rails 有点生疏，这可能可以更高效、更优雅地完成。我希望在这里得到一些指点。

提前致谢！

方法 2

转向 sanitize 方法（感谢@max）并根据 Rails::Html: :PermitScrubber

方法 3

意识到我的源文档格式为 Markdown，我冒险探索自定义 Redcarpet 渲染器。

有关完整示例，请参阅我的答案。

原文

I'm trying to extract an excerpt for an article (markdown parsed to HTML), where only plain text from paragraphs is included. All HTML needs to be stripped and line breaks, tabs and sequential whitespace needs to be replaced by a single space.

My first step was creating a simple test:

describe "#from_html" do
  it "creates an excerpt from given HTML" do
    html = "<p>The spice extends <b>life</b>.<br>The spice    expands consciousness.</p>\n
           <ul><li>Skip me</li></ul>\n
           <p>The <i>spice</i> is vital to space travel.</p>"

    text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."

    expect(R::ExcerptHelper.from_html(html)).to eq(text)
  end
end

And started fiddling and came up with this:

def from_html(html)
  Nokogiri::HTML.parse(html).css("p").map{|node|
    node.children.map{|child|
      child.name == "br" ? child.replace(" ") : child
    } << " "
  }.join.strip.gsub(/\s+/, " ")
end

I'm a bit Rusty on Rails and this can probably be done much more efficient and elegant. I'm hoping for some pointers here.

Thanks in advance!

Approach 2

Turned to the sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber

Approach 3

Realizing my source document is formatted as Markdown, I ventured forth by exploring a custom Redcarpet renderer.

See my answer for a complete example.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流绪微梦 2025-01-19 02:29:24

我最终编写了一个自定义的 Redcarpet 渲染器（受到 Redcarpet::Render::StripDown）。这似乎是最干净的方法，格式之间的解析和转换最少。

module R::Markdown
  class ExcerptRenderer < Redcarpet::Render::Base
    # Methods where the first argument is the text content
    [
      # block-level calls
      :paragraph,

      # span-level calls
      :codespan, :double_emphasis,
      :emphasis, :underline, :raw_html,
      :triple_emphasis, :strikethrough,
      :superscript, :highlight, :quote,

      # footnotes
      :footnotes, :footnote_def, :footnote_ref,

      # low level rendering
      :entity, :normal_text
    ].each do |method|
      define_method method do |*args|
        args.first
      end
    end

    # Methods where content is replaced with an empty space
    [
      :autolink, :block_html
    ].each do |method|
      define_method method do |*|
        " "
      end
    end

    # Methods we are going to [snip]
    [
      :list, :image, :table, :block_code
    ].each do |method|
      define_method method do |*|
        " [#{method}] "
      end
    end

    # Other methods
    def link(link, title, content)
      content
    end

    def header(text, header_level)
      " #{text} "
    end

    def block_quote(quote)
      " “#{quote}” "
    end

    # Replace all whitespace with single space
    def postprocess(document)
      document.gsub(/\s+/, " ").strip
    end
  end
end

并解析它：

extensions = {
  autolink:                     true,
  disable_indented_code_blocks: true,
  fenced_code_blocks:           true,
  lax_spacing:                  true,
  no_intra_emphasis:            true,
  strikethrough:                true,
  superscript:                  true,
  tables:                       true
}

markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)

markdown.render(md).html_safe

I ended up writing a custom Redcarpet renderer (inspired by Redcarpet::Render::StripDown). which seems the cleanest approach with the least parsing and converting between formats.

module R::Markdown
  class ExcerptRenderer < Redcarpet::Render::Base
    # Methods where the first argument is the text content
    [
      # block-level calls
      :paragraph,

      # span-level calls
      :codespan, :double_emphasis,
      :emphasis, :underline, :raw_html,
      :triple_emphasis, :strikethrough,
      :superscript, :highlight, :quote,

      # footnotes
      :footnotes, :footnote_def, :footnote_ref,

      # low level rendering
      :entity, :normal_text
    ].each do |method|
      define_method method do |*args|
        args.first
      end
    end

    # Methods where content is replaced with an empty space
    [
      :autolink, :block_html
    ].each do |method|
      define_method method do |*|
        " "
      end
    end

    # Methods we are going to [snip]
    [
      :list, :image, :table, :block_code
    ].each do |method|
      define_method method do |*|
        " [#{method}] "
      end
    end

    # Other methods
    def link(link, title, content)
      content
    end

    def header(text, header_level)
      " #{text} "
    end

    def block_quote(quote)
      " “#{quote}” "
    end

    # Replace all whitespace with single space
    def postprocess(document)
      document.gsub(/\s+/, " ").strip
    end
  end
end

And parse it:

extensions = {
  autolink:                     true,
  disable_indented_code_blocks: true,
  fenced_code_blocks:           true,
  lax_spacing:                  true,
  no_intra_emphasis:            true,
  strikethrough:                true,
  superscript:                  true,
  tables:                       true
}

markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)

markdown.render(md).html_safe

回复收藏 0 原文

~没有更多了~