提取两个
之间的文本无 CSS HTML 中的标签
使用 Jsoup,提取文本的最佳方法是什么,其模式已知 ([number]%%[number]
),但驻留在既不使用 CSS 也不使用 div 的 HTML 页面中,跨度、类或其他任何类型的标识(是的,我无法控制的旧 HTML 页面)?
唯一一致地标识该文本段(并保证保持这样)的是 HTML总是看起来像这样(在更大的 HTML 正文中):(
<hr>
2%%17
<hr>
数字 2 和 17 是示例它们可以是任何数字,事实上,这是我需要从该 HTML 页面可靠提取的两个变量)。
如果该文本位于封闭且唯一标识的 或
中,那么使用 Jsoup 提取它不会有任何问题。问题是事实并非如此,我现在能想到的唯一方法(一点也不优雅)是通过正则表达式。然而,通过正则表达式处理原始 HTML 似乎效率很低,因为我已经通过 Jsoup 将其解析为 DOM。
建议?
Using Jsoup, what would be an optimal approach to extract text, of which its pattern is known ([number]%%[number]
) but resides in an HTML page that uses neither CSS nor divs, spans, classes or other identifying of any type (yup, old HTML page of which I have no control over)?
The only thing that consistently identifies that text segment (and is guaranteed to remain like that) is that is HTML always looks like this (within a larger body of HTML):
<hr>
2%%17
<hr>
(The number 2 and 17 are examples only. They could be any numbers and, in fact, these are the two variables that I need to reliably extract from that HTML page).
If that text were within an enclosing and uniquely identifying <span>
or <div>
, I would have no problem extracting it using Jsoup. The problem is that this isn't the case and the only way I can think of right now (which is not elegant at all) is to process the raw HTML through a regex.
Processing the raw HTML through a regex seems inefficient however because I already have it parsed via Jsoup into a DOM.
Suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个怎么样?
How about this?