在 ruby 上使用 xpath 获取 html 片段的前几个元素
对于像博客这样的项目,我想从 Markdown 生成的 html 片段中获取前几个段落、标题、列表或字符范围内的任何内容,以显示为摘要。
因此,如果我有
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
<li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
并且假设,我想用前 150 个字符内的文本进行总结(不必过于精确,我可以只获取前 150 个字符,包括标签并继续下去,但可能会创建一些工件在尾部,这可能更难处理...),它应该给我 h1、p 和 ul,但不是最终的 p(它将被截断)。如果第一个元素应该超过 150 个字符,我将采用完整的第一个元素。
我怎样才能得到这个?使用 XPath 还是正则表达式?我对此有点没有想法......
首先编辑
我想对所有回复的人致以深深的感谢!
虽然我在这个线程中得到了非常好的答案,但实际上我发现在 Markdown 解释器插入之前插入要容易得多,取用 \r\n\r\n 分隔的前 n 个文本块,然后将其传递给 md 生成。
class String
def summarize_md length
arr = self.split(/\r\n\r\n/)
sum =""
arr.each do |ea|
break if sum.length + ea.length > length
sum = sum+"#{ea}\r\n\r\n"
end
sum
end
end
虽然人们可能可以将此代码减少为一行,但它仍然比任何建议的解决方案更简单且对 CPU 更友好。 不管怎样,因为我的问题可以被解释为如果html是起点(而不是md文本),我只会给第一个人答案......我希望这只是......
For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.
So if I have
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
<li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.
How could I get this? Using XPath or a regex? I am a bit without ideas on that...
Edit
First I want to give a big THANK YOU to all of you who replied!
While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.
class String
def summarize_md length
arr = self.split(/\r\n\r\n/)
sum =""
arr.each do |ea|
break if sum.length + ea.length > length
sum = sum+"#{ea}\r\n\r\n"
end
sum
end
end
while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions.
Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
当然是 XSLT!
这个样式表:
输出:
XSLT, of course!
This stylesheet:
Output:
对于我的使用,我总是想剥离标签,因为它们可能包含各种肮脏的内容,这些内容会完全破坏摘要的显示。它们还可能严重扭曲字母计数,具体取决于标签以及它们是否包含参数。
我已经多次使用过类似的东西。
哪些输出
我将留给您弄清楚如何忽略或减去最终
标记中的文本,但查找该标记并获取其内容,然后将其从绳子的末端不应该太硬。
For my uses I always wanted to strip tags because they could include all sorts of nastiness that would totally hose the display of the summary. They could also seriously skew the letter count, depending on the tags and whether they contain parameters.
I've used something like this many times.
Which outputs
I'll leave it to you to figure out how to ignore or subtract the text from the final
<p>
tag, but looking up that tag and grabbing its content and then stripping it from the end of the string shouldn't be too hard.使用 XPath 是最健壮和灵活的。下面是一个示例应用程序:
XPath
//text()
仅选择文档中的所有文本节点。如果您想更具体地了解您感兴趣的元素,您可以。Using XPath is the most robust and flexible. Here's a sample app:
The XPath
//text()
simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.纯 XPath 1.0 解决方案:
substring(/*,1,150)
,其中提供的 XHTML 片段的父元素是顶部元素(
/*
或/html)。
存在一个非常精确的 XPath 2.0 解决方案:
请注意:必须以丢弃纯空白文本节点的模式来解析 XML 文档。否则
string-length(.)
必须替换为string-length(normalize-space(.))
A pure XPath 1.0 solution:
substring(/*,1,150)
where the parent of the provided XHTML fragment is the top element (
/*
or/html
).A very exact XPath 2.0 solution exists:
Do note: The XML document must be parsed in a mode that discards the white-space-only text nodes. Otherwise
string-length(.)
must be replaced bystring-length(normalize-space(.))