Jsoup：在无 CSS HTML 中提取两个块之间的所有 HTML

发布于 2024-12-03 07:42:33 字数 1102 浏览 0 评论 0原文

使用 Jsoup 提取符合此模式的两个块之间的所有 HTML（字符串、文档或元素）的最佳方法是什么：

<strong>
 {any HTML could appear here, except for a <strong> pair}
</strong>

 ...
 {This is the HTML I need to extract. 
  any HTML could appear here, except for a <strong> pair}
 ... 

<strong>
 {any HTML could appear here, except for a <strong> pair}
</strong>

使用正则表达式，如果我将其应用于整个主体，这可能很简单。 html():

(<strong>.+</strong>)(.+)(<strong>.+</strong>)
                       ^
                       +----- There I have my HTML content

但正如我从类似的挑战，性能可能会如果我使用已经 Jsoup 解析的 DOM，则改进（即使代码稍长）——除了这次 Element.nextSibling() 和Element.nextElementSibling() 可以解决这个问题。

我搜索了类似 jQuery 的 nextUntil 例如，在 Jsoup 中，但找不到类似的东西。

是否有可能想出比上述基于正则表达式的方法更好的方法？

原文

What would be an optimal way, using Jsoup, to extract all HTML (either to a String, Document or Elements) between two blocks that conform to this pattern:

<strong>
 {any HTML could appear here, except for a <strong> pair}
</strong>

 ...
 {This is the HTML I need to extract. 
  any HTML could appear here, except for a <strong> pair}
 ... 

<strong>
 {any HTML could appear here, except for a <strong> pair}
</strong>

Using a regex this could be simple, if I apply it on the entire body.html():

(<strong>.+</strong>)(.+)(<strong>.+</strong>)
                       ^
                       +----- There I have my HTML content

But as I learned from a similar challenge, performance could be improved (even if the code is slightly longer) if I use an already Jsoup-parsed DOM -- except that this time neither Element.nextSibling() nor Element.nextElementSibling() can come to the rescue.

I searched for something like jQuery's nextUntil in Jsoup, for example, but couldn't really find something similar.

Is it possible to come up with something better than the above regex-based approach?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小矜持 2024-12-10 07:42:33

我不知道它是否更快，但也许这样的东西会起作用：

Elements strongs = doc.select("strong");
Element f = strongs.first();
Element l = strongs.last();
Elements siblings = f.siblingElements();
List<Element> result = siblings.subList(siblings.firstIndexOf(f) + 1,siblings.lastIndexOf(l));

I don't know if it's faster but maybe something like this will work:

Elements strongs = doc.select("strong");
Element f = strongs.first();
Element l = strongs.last();
Elements siblings = f.siblingElements();
List<Element> result = siblings.subList(siblings.firstIndexOf(f) + 1,siblings.lastIndexOf(l));

回复收藏 0 原文

~没有更多了~