解析出所有 HTML 标签/非文本；爪哇

发布于 2024-12-20 23:46:24 字数 543 浏览 2 评论 0原文

从网页中获取 html、剥离所有 HTML 标签/javascript 代码/任何非要显示的文本的内容，最后能够返回此信息，并为包含在其中的每一段文本添加一些分隔符，这是最好的方法是什么？不同的html标签？

首先，我尝试使用 JSOUP：

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
String html = doc.body().text();

这对于取出所有非文本很有用，但不会返回任何类型的划分。

我目前正在尝试使用正则表达式，例如：

html.replaceAll("\\<.*?\\>", "")

但我真的不熟悉正则表达式，并且在取出 javascript 时遇到问题。然而，此方法确实有换行符，我可以使用它来跟踪来自不同标签包装的单独文本组。

我只是想知道在尝试更多正则表达式以使其工作之前是否有一些简单的方法可以做到这一点。

谢谢

原文

What is the best way to take html from a webpage, strip all of the HTML tags/javascript code/ anything that's not text to be displayed, and finally be able to return this information with some separators for every piece of text that was wrapped in a different html tag?

First I tried using JSOUP:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
String html = doc.body().text();

This is good for taking out all the non-text but doesn't return me any sort of division.

I'm currently trying to use regex like:

html.replaceAll("\\<.*?\\>", "")

But I'm really not familiar with regex, and I have problems taking out javascript. This method however does have newlines that I can use to track down seperate text groups from different tag wrappings.

I was just wondering if there was some easy way of doing this before I try more regex to get it to work.

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

看春风乍起 2024-12-27 23:46:24

看起来 jsoup 没有提供一种立即明显的方法来做到这一点，因此我通过编辑源代码并将方法 text_mod() 添加到 Element 进行了快速修改。这种方法有局限性，但如果您发现它有用，您可以在 http://ge.tt/ 下载修改后的 jar 9PAMpzA。

这是补充：

public String text_mod(){
    StringBuilder sb = new StringBuilder();
    text_mod(sb);
    return sb.toString().trim().replaceAll("\n+", "\n");
}

private void text_mod(StringBuilder accum) {
    appendWhitespaceIfBr(this, accum);

    for (Node child : childNodes) {
        if (child instanceof TextNode) {
            TextNode textNode = (TextNode) child;
            appendNormalisedText(accum, textNode);
        } else if (child instanceof Element) {
            Element element = (Element) child;
    //        if (accum.length() > 0 && element.isBlock() && !TextNode.lastCharIsWhitespace(accum))
    //            accum.append("\n");
            element.text_mod(accum);
        }
        accum.append("\n");
    }
}

例如，试试这个：

import org.jsoup.Jsoup;

public class Test {
    public static void main(String[] args){
        String html = "<html><head><title>HTML</title></head>"
              + "<body><p>Paragraph 1.</p><p>Paragraph 2.</p></body></html>";
        System.out.println(Jsoup.parse(html).body().text_mod());
    }
}

我得到

Paragraph 1.
Paragraph 2.

It looks like jsoup doesn't provide an immediately obvious way to do that, so I made a quick hack by editing the source code and adding the method text_mod() to Element. There are limitations to this approach, but if you find it useful, you can download the modified jar at http://ge.tt/9PAMpzA.

Here's the addition:

public String text_mod(){
    StringBuilder sb = new StringBuilder();
    text_mod(sb);
    return sb.toString().trim().replaceAll("\n+", "\n");
}

private void text_mod(StringBuilder accum) {
    appendWhitespaceIfBr(this, accum);

    for (Node child : childNodes) {
        if (child instanceof TextNode) {
            TextNode textNode = (TextNode) child;
            appendNormalisedText(accum, textNode);
        } else if (child instanceof Element) {
            Element element = (Element) child;
    //        if (accum.length() > 0 && element.isBlock() && !TextNode.lastCharIsWhitespace(accum))
    //            accum.append("\n");
            element.text_mod(accum);
        }
        accum.append("\n");
    }
}

For example, try this:

import org.jsoup.Jsoup;

public class Test {
    public static void main(String[] args){
        String html = "<html><head><title>HTML</title></head>"
              + "<body><p>Paragraph 1.</p><p>Paragraph 2.</p></body></html>";
        System.out.println(Jsoup.parse(html).body().text_mod());
    }
}

I get

Paragraph 1.
Paragraph 2.

回复收藏 0 原文

寄风 2024-12-27 23:46:24

正则表达式通常不适用于任意 HTML，因为正则表达式无法完全解析 HTML（技术原因称为泵引理，这对于手头的任务并不重要）。

我建议从 XML 解析器开始（假设您的 HTML 没有做任何太奇怪的事情），然后在解析树中查找可显示标签中的数据。 XPath 表达式在这里会非常有用。

回复收藏 0 原文

眉目亦如画i 2024-12-27 23:46:24

在使用 DOM 的 JavaScript 中，您可以使用 DOM 元素的 textContent 或 innerText 属性获取任何 HTML 元素的文本。如果您对 BODY 元素执行此操作，您将获得页面的“文本”版本。

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent || body.innerText;

In JavaScript with the DOM you can get the text of any HTML element with the textContent or innerText properties of the DOM element. If you do this for the BODY element, you have a "text" version of the page.

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent || body.innerText;

回复收藏 0 原文

~没有更多了~

关于作者

梦幻之岛

暂无简介

文章

30 人气

关注发私信

友情链接

文江博客

解析出所有 HTML 标签/非文本；爪哇

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

解析出所有 HTML 标签/非文本；爪哇

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。