当前位置：文江博客话题详情

如何使用 Jsoup 搜索评论（“”）？

发布于 2024-12-06 03:08:43 字数 29 浏览 1 评论 0原文

我想从源 HTML 中删除这些标签及其内容。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

金橙橙 2024-12-13 03:08:43

搜索时，您基本上使用 Elements.select(selector) ，其中 selector 由此 API。然而，从技术上讲，注释并不是元素，因此您可能会感到困惑，它们仍然是由节点名称 #comment 标识的节点。

让我们看看它是如何工作的：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

public class RemoveComments {
    public static void main(String... args) {
        String h = "<html><head></head><body>" +
          "<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
        Document doc = Jsoup.parse(h);
        removeComments(doc);
        doc.html(System.out);
    }

    private static void removeComments(Node node) {
        for (int i = 0; i < node.childNodeSize();) {
            Node child = node.childNode(i);
            if (child.nodeName().equals("#comment"))
                child.remove();
            else {
                removeComments(child);
                i++;
            }
        }
    }        
}

When searching you basically use Elements.select(selector) where selector is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment.

Let's see how that might work:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

public class RemoveComments {
    public static void main(String... args) {
        String h = "<html><head></head><body>" +
          "<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
        Document doc = Jsoup.parse(h);
        removeComments(doc);
        doc.html(System.out);
    }

    private static void removeComments(Node node) {
        for (int i = 0; i < node.childNodeSize();) {
            Node child = node.childNode(i);
            if (child.nodeName().equals("#comment"))
                child.remove();
            else {
                removeComments(child);
                i++;
            }
        }
    }        
}

回复收藏 0 原文

冰雪之触 2024-12-13 03:08:43

使用 JSoup 1.11+（可能是旧版本），您可以应用过滤器：

private void removeComments(Element article) {
    article.filter(new NodeFilter() {
        @Override
        public FilterResult tail(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }

        @Override
        public FilterResult head(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }
    });
}

With JSoup 1.11+ (possibly older version) you can apply a filter:

private void removeComments(Element article) {
    article.filter(new NodeFilter() {
        @Override
        public FilterResult tail(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }

        @Override
        public FilterResult head(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }
    });
}

回复收藏 0 原文

飞烟轻若梦 2024-12-13 03:08:43

参考@dlamblin https://stackoverflow.com/a/7541875/4712855此代码获取评论html

public static void getHtmlComments(Node node) {
    for (int i = 0; i < node.childNodeSize();i++) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment")) {
            Comment comment = (Comment) child;
            child.after(comment.getData());
            child.remove();
        }
        else {
            getHtmlComments(child);
        }
    }
}

reference @dlamblin https://stackoverflow.com/a/7541875/4712855 this code get comment html

public static void getHtmlComments(Node node) {
    for (int i = 0; i < node.childNodeSize();i++) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment")) {
            Comment comment = (Comment) child;
            child.after(comment.getData());
            child.remove();
        }
        else {
            getHtmlComments(child);
        }
    }
}

回复收藏 0 原文

萝莉病 2024-12-13 03:08:43

基于 @dlamblin 的答案，Java 8 函数式方法（抱歉，但这似乎比 @Feuerrabe 的答案更简单、更清晰）

private void removeComments(Node node) {
    node.childNodes().stream().filter(n -> "#comment".equals(n.nodeName())).forEach(Node::remove);
    node.childNodes().forEach(this::removeComments);
}

Document doc = Jsoup.parse(html);
removeComments(doc);
// ...

Based on the answer from @dlamblin a Java 8 functional approach (sorry but this seems to be a littler simpler and cleaner than the aswer from @Feuerrabe)

private void removeComments(Node node) {
    node.childNodes().stream().filter(n -> "#comment".equals(n.nodeName())).forEach(Node::remove);
    node.childNodes().forEach(this::removeComments);
}

Document doc = Jsoup.parse(html);
removeComments(doc);
// ...

回复收藏 0 原文

呆 2024-12-13 03:08:43

这是使用函数式编程方法的第一个示例的变体。查找所有注释（当前节点的直接子节点）的最简单方法是在 .childNodes() 流上使用 .filter()

public void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

完整示例：

package demo;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {

public static void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

public static void main(String[] args) throws MalformedURLException, IOException {
    Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);

    // do not try this with JDK < 8
    String userHome = System.getProperty("user.home");
    PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
    out.print(doc.outerHtml());
    out.close();

    removeComments(doc);
    out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
    out.print(doc.outerHtml());
    out.close();
}

}

This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter() on a stream of .childNodes()

public void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

Full example:

package demo;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {

public static void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

public static void main(String[] args) throws MalformedURLException, IOException {
    Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);

    // do not try this with JDK < 8
    String userHome = System.getProperty("user.home");
    PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
    out.print(doc.outerHtml());
    out.close();

    removeComments(doc);
    out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
    out.print(doc.outerHtml());
    out.close();
}

}

回复收藏 0 原文

~没有更多了~