Jsoup:在无 CSS HTML 中提取两个块之间的所有 HTML
使用 Jsoup 提取符合此模式的两个块之间的所有 HTML(字符串、文档或元素)的最佳方法是什么:
<strong>
{any HTML could appear here, except for a <strong> pair}
</strong>
...
{This is the HTML I need to extract.
any HTML could appear here, except for a <strong> pair}
...
<strong>
{any HTML could appear here, except for a <strong> pair}
</strong>
使用正则表达式,如果我将其应用于整个主体,这可能很简单。 html():
(<strong>.+</strong>)(.+)(<strong>.+</strong>)
^
+----- There I have my HTML content
但正如我从 类似的挑战,性能可能会如果我使用已经 Jsoup 解析的 DOM,则改进(即使代码稍长)——除了这次 Element.nextSibling()
和Element.nextElementSibling()
可以解决这个问题。
我搜索了类似 jQuery 的 nextUntil 例如,在 Jsoup 中,但找不到类似的东西。
是否有可能想出比上述基于正则表达式的方法更好的方法?
What would be an optimal way, using Jsoup, to extract all HTML (either to a String, Document or Elements) between two blocks that conform to this pattern:
<strong>
{any HTML could appear here, except for a <strong> pair}
</strong>
...
{This is the HTML I need to extract.
any HTML could appear here, except for a <strong> pair}
...
<strong>
{any HTML could appear here, except for a <strong> pair}
</strong>
Using a regex this could be simple, if I apply it on the entire body.html():
(<strong>.+</strong>)(.+)(<strong>.+</strong>)
^
+----- There I have my HTML content
But as I learned from a similar challenge, performance could be improved (even if the code is slightly longer) if I use an already Jsoup-parsed DOM -- except that this time neither Element.nextSibling()
nor Element.nextElementSibling()
can come to the rescue.
I searched for something like jQuery's nextUntil in Jsoup, for example, but couldn't really find something similar.
Is it possible to come up with something better than the above regex-based approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不知道它是否更快,但也许这样的东西会起作用:
I don't know if it's faster but maybe something like this will work: