我正在尝试提取一篇文章的摘录(标记解析为 HTML),其中仅包含段落中的纯文本。所有 HTML 都需要被删除,换行符、制表符和连续空格需要替换为单个空格。
我的第一步是创建一个简单的测试:
describe "#from_html" do
it "creates an excerpt from given HTML" do
html = "<p>The spice extends <b>life</b>.<br>The spice expands consciousness.</p>\n
<ul><li>Skip me</li></ul>\n
<p>The <i>spice</i> is vital to space travel.</p>"
text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."
expect(R::ExcerptHelper.from_html(html)).to eq(text)
end
end
然后开始摆弄并想出这个:
def from_html(html)
Nokogiri::HTML.parse(html).css("p").map{|node|
node.children.map{|child|
child.name == "br" ? child.replace(" ") : child
} << " "
}.join.strip.gsub(/\s+/, " ")
end
我对 Rails 有点生疏,这可能可以更高效、更优雅地完成。我希望在这里得到一些指点。
提前致谢!
方法 2
转向 sanitize 方法(感谢@max)并根据 Rails::Html: :PermitScrubber
方法 3
意识到我的源文档格式为 Markdown,我冒险探索自定义 Redcarpet 渲染器。
有关完整示例,请参阅我的答案。
I'm trying to extract an excerpt for an article (markdown parsed to HTML), where only plain text from paragraphs is included. All HTML needs to be stripped and line breaks, tabs and sequential whitespace needs to be replaced by a single space.
My first step was creating a simple test:
describe "#from_html" do
it "creates an excerpt from given HTML" do
html = "<p>The spice extends <b>life</b>.<br>The spice expands consciousness.</p>\n
<ul><li>Skip me</li></ul>\n
<p>The <i>spice</i> is vital to space travel.</p>"
text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."
expect(R::ExcerptHelper.from_html(html)).to eq(text)
end
end
And started fiddling and came up with this:
def from_html(html)
Nokogiri::HTML.parse(html).css("p").map{|node|
node.children.map{|child|
child.name == "br" ? child.replace(" ") : child
} << " "
}.join.strip.gsub(/\s+/, " ")
end
I'm a bit Rusty on Rails and this can probably be done much more efficient and elegant. I'm hoping for some pointers here.
Thanks in advance!
Approach 2
Turned to the sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber
Approach 3
Realizing my source document is formatted as Markdown, I ventured forth by exploring a custom Redcarpet renderer.
See my answer for a complete example.
发布评论
评论(1)
我最终编写了一个自定义的 Redcarpet 渲染器(受到
Redcarpet::Render::StripDown
)。这似乎是最干净的方法,格式之间的解析和转换最少。并解析它:
I ended up writing a custom Redcarpet renderer (inspired by
Redcarpet::Render::StripDown
). which seems the cleanest approach with the least parsing and converting between formats.And parse it: