在 ruby 上使用 xpath 获取 html 片段的前几个元素

发布于 2024-09-28 05:44:41 字数 1059 浏览 5 评论 0原文

对于像博客这样的项目，我想从 Markdown 生成的 html 片段中获取前几个段落、标题、列表或字符范围内的任何内容，以显示为摘要。

因此，如果我有

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>

并且假设，我想用前 150 个字符内的文本进行总结（不必过于精确，我可以只获取前 150 个字符，包括标签并继续下去，但可能会创建一些工件在尾部，这可能更难处理...），它应该给我 h1、p 和 ul，但不是最终的 p（它将被截断）。如果第一个元素应该超过 150 个字符，我将采用完整的第一个元素。

我怎样才能得到这个？使用 XPath 还是正则表达式？我对此有点没有想法......

首先编辑

我想对所有回复的人致以深深的感谢！

虽然我在这个线程中得到了非常好的答案，但实际上我发现在 Markdown 解释器插入之前插入要容易得多，取用 \r\n\r\n 分隔的前 n 个文本块，然后将其传递给 md 生成。

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

虽然人们可能可以将此代码减少为一行，但它仍然比任何建议的解决方案更简单且对 CPU 更友好。不管怎样，因为我的问题可以被解释为如果html是起点（而不是md文本），我只会给第一个人答案......我希望这只是......

原文

For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.

So if I have

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>

And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.

How could I get this? Using XPath or a regex? I am a bit without ideas on that...

Edit

First I want to give a big THANK YOU to all of you who replied!

While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions.
Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你如我软肋 2024-10-05 05:44:42

我怎样才能得到这个？

当然是 XSLT！

这个样式表：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:strip-space elements="*"/>
    <xsl:param name="pMaxLength" select="73"/>
    <xsl:template match="node()">
        <xsl:param name="pPrecedingLength" select="0"/>
        <xsl:variable name="vContent">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="node()[1]">
                    <xsl:with-param name="pPrecedingLength"
                                    select="$pPrecedingLength"/>
                </xsl:apply-templates>
            </xsl:copy>
        </xsl:variable>
        <xsl:variable name="vLength"
                      select="$pPrecedingLength + string-length($vContent)"/>
        <xsl:if test="$pMaxLength > $vLength and
                      (string-length($vContent) or not(node()))
                      or not($pPrecedingLength)">
            <xsl:copy-of select="$vContent"/>
            <xsl:apply-templates select="following-sibling::node()[1]">
                <xsl:with-param name="pPrecedingLength" select="$vLength"/>
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

输出：

<html>
    <h1>hello world</h1>
    <p>Lets say these are 100 chars</p>
    <ul>
        <li>some bla bla, 40 chars</li>
    </ul>
</html>

How could I get this?

XSLT, of course!

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:strip-space elements="*"/>
    <xsl:param name="pMaxLength" select="73"/>
    <xsl:template match="node()">
        <xsl:param name="pPrecedingLength" select="0"/>
        <xsl:variable name="vContent">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="node()[1]">
                    <xsl:with-param name="pPrecedingLength"
                                    select="$pPrecedingLength"/>
                </xsl:apply-templates>
            </xsl:copy>
        </xsl:variable>
        <xsl:variable name="vLength"
                      select="$pPrecedingLength + string-length($vContent)"/>
        <xsl:if test="$pMaxLength > $vLength and
                      (string-length($vContent) or not(node()))
                      or not($pPrecedingLength)">
            <xsl:copy-of select="$vContent"/>
            <xsl:apply-templates select="following-sibling::node()[1]">
                <xsl:with-param name="pPrecedingLength" select="$vLength"/>
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Output:

<html>
    <h1>hello world</h1>
    <p>Lets say these are 100 chars</p>
    <ul>
        <li>some bla bla, 40 chars</li>
    </ul>
</html>

回复收藏 0 原文

傾旎 2024-10-05 05:44:42

对于我的使用，我总是想剥离标签，因为它们可能包含各种肮脏的内容，这些内容会完全破坏摘要的显示。它们还可能严重扭曲字母计数，具体取决于标签以及它们是否包含参数。

我已经多次使用过类似的东西。

require 'nokogiri'

html = %q{
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
}

doc = Nokogiri::HTML(html)
puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

哪些输出

hello world Lets say these are 100 chars some bla bla, 40 chars some other text

我将留给您弄清楚如何忽略或减去最终

标记中的文本，但查找该标记并获取其内容，然后将其从绳子的末端不应该太硬。

For my uses I always wanted to strip tags because they could include all sorts of nastiness that would totally hose the display of the summary. They could also seriously skew the letter count, depending on the tags and whether they contain parameters.

I've used something like this many times.

require 'nokogiri'

html = %q{
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
}

doc = Nokogiri::HTML(html)
puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

Which outputs

hello world Lets say these are 100 chars some bla bla, 40 chars some other text

I'll leave it to you to figure out how to ignore or subtract the text from the final <p> tag, but looking up that tag and grabbing its content and then stripping it from the end of the string shouldn't be too hard.

回复收藏 0 原文

萌逼全场 2024-10-05 05:44:42

使用 XPath 是最健壮和灵活的。下面是一个示例应用程序：

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

XPath //text() 仅选择文档中的所有文本节点。如果您想更具体地了解您感兴趣的元素，您可以。

Using XPath is the most robust and flexible. Here's a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

回复收藏 0 原文

因为看清所以看轻 2024-10-05 05:44:41

纯 XPath 1.0 解决方案：

substring(/*,1,150)

，其中提供的 XHTML 片段的父元素是顶部元素（/* 或 /html）。

存在一个非常精确的 XPath 2.0 解决方案：

   for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)]
     return
       ($t, '
')

请注意：必须以丢弃纯空白文本节点的模式来解析 XML 文档。否则 string-length(.) 必须替换为 string-length(normalize-space(.))

A pure XPath 1.0 solution:

substring(/*,1,150)

where the parent of the provided XHTML fragment is the top element (/* or /html).

A very exact XPath 2.0 solution exists:

   for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)]
     return
       ($t, '
')

Do note: The XML document must be parsed in a mode that discards the white-space-only text nodes. Otherwise string-length(.) must be replaced by string-length(normalize-space(.))

回复收藏 0 原文

~没有更多了~