提取两个
之间的文本无 CSS HTML 中的标签

发布于 2024-12-02 21:22:45 字数 549 浏览 0 评论 0原文

使用 Jsoup,提取文本的最佳方法是什么,其模式已知 ([number]%%[number]),但驻留在既不使用 CSS 也不使用 div 的 HTML 页面中,跨度、类或其他任何类型的标识(是的,我无法控制的旧 HTML 页面)?

唯一一致地标识该文本段(并保证保持这样)的是 HTML总是看起来像这样(在更大的 HTML 正文中):(

<hr>
2%%17
<hr>

数字 2 和 17 是示例它们可以是任何数字,事实上,这是我需要从该 HTML 页面可靠提取的两个变量)。

如果该文本位于封闭且唯一标识的

中,那么使用 Jsoup 提取它不会有任何问题。问题是事实并非如此,我现在能想到的唯一方法(一点也不优雅)是通过正则表达式。

然而,通过正则表达式处理原始 HTML 似乎效率很低,因为我已经通过 Jsoup 将其解析为 DOM。

建议?

Using Jsoup, what would be an optimal approach to extract text, of which its pattern is known ([number]%%[number]) but resides in an HTML page that uses neither CSS nor divs, spans, classes or other identifying of any type (yup, old HTML page of which I have no control over)?

The only thing that consistently identifies that text segment (and is guaranteed to remain like that) is that is HTML always looks like this (within a larger body of HTML):

<hr>
2%%17
<hr>

(The number 2 and 17 are examples only. They could be any numbers and, in fact, these are the two variables that I need to reliably extract from that HTML page).

If that text were within an enclosing and uniquely identifying <span> or <div>, I would have no problem extracting it using Jsoup. The problem is that this isn't the case and the only way I can think of right now (which is not elegant at all) is to process the raw HTML through a regex.

Processing the raw HTML through a regex seems inefficient however because I already have it parsed via Jsoup into a DOM.

Suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

七堇年 2024-12-09 21:22:45

这个怎么样?

Document document = Jsoup.connect(url).get();
Elements hrs = document.select("hr");
Pattern pattern = Pattern.compile("(\\d+%%\\d+)");

for (Element hr : hrs) {
    String textAfterHr = hr.nextSibling().toString();
    Matcher matcher = pattern.matcher(textAfterHr);

    while (matcher.find()) {
        System.out.println(matcher.group(1)); // <-- There, your data.
    }
}

How about this?

Document document = Jsoup.connect(url).get();
Elements hrs = document.select("hr");
Pattern pattern = Pattern.compile("(\\d+%%\\d+)");

for (Element hr : hrs) {
    String textAfterHr = hr.nextSibling().toString();
    Matcher matcher = pattern.matcher(textAfterHr);

    while (matcher.find()) {
        System.out.println(matcher.group(1)); // <-- There, your data.
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文