如何使用jsoup用span标签替换单词？

发布于 2024-11-17 22:35:35 字数 638 浏览 0 评论 0 原文

假设我有以下 html：

<html>
<head>
</head>
<body>
    <div id="wrapper" >
         <div class="s2">I am going <a title="some title" href="">by flying</a>
           <p>mr tt</p>
         </div> 
    </div>
</body>    
</html>

文本节点中等于或大于 4 个字符的任何单词（例如单词“going”）将替换为 html 内容（不是文本）going< /code> 在原始 html 中，无需更改任何其他内容。

如果我尝试执行类似 element.html(replacement) 的操作，问题是如果让当前元素为

它也会擦除

原文

Assume I have the following html:

<html>
<head>
</head>
<body>
    <div id="wrapper" >
         <div class="s2">I am going <a title="some title" href="">by flying</a>
           <p>mr tt</p>
         </div> 
    </div>
</body>    
</html>

Any words in the text nodes that are equal to or greater than 4 characters for example the word 'going' is replaced with html content (not text) <span>going<span> in the original html without changing anything else.

If I try do something like element.html(replacement), the problem is if lets the current element is <div class="s2"> it will also wipe off <a title="some title"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜｀诱少女 2024-11-24 22:35:35

在这种情况下，您必须按照这个答案。这是使用 Jsoup API 的一种方法：

NodeTraversor 和 NodeVisitor 允许您遍历 DOM
Node.replaceWith(...) 允许替换 DOM 中的节点

代码如下：

public class JsoupReplacer {

  public static void main(String[] args) {
    so6527876();
  }

  public static void so6527876() {
    String html = 
    "<html>" +
    "<head>" +
    "</head>" +
    "<body>" +
    "    <div id=\"wrapper\" >" +
    "         <div class=\"s2\">I am going <a title=\"some title\" href=\"\">by flying</a>" +
    "           <p>mr tt</p>" +
    "         </div> " +
    "    </div>" +
    "</body>    " +
    "</html>";
    Document doc = Jsoup.parse(html);

    final List<TextNode> nodesToChange = new ArrayList<TextNode>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

      @Override
      public void tail(Node node, int depth) {
        if (node instanceof TextNode) {
          TextNode textNode = (TextNode) node;
          String text = textNode.getWholeText();
          String[] words = text.trim().split(" ");
          for (String word : words) {
            if (word.length() > 4) {
              nodesToChange.add(textNode);
              break;
            }
          }
        }
      }

      @Override
      public void head(Node node, int depth) {        
      }
    });

    nd.traverse(doc.body());

    for (TextNode textNode : nodesToChange) {
      Node newNode = buildElementForText(textNode);
      textNode.replaceWith(newNode);
    }

    System.out.println("result: ");
    System.out.println();
    System.out.println(doc);
  }

  private static Node buildElementForText(TextNode textNode) {
    String text = textNode.getWholeText();
    String[] words = text.trim().split(" ");
    Set<String> longWords = new HashSet<String>();
    for (String word : words) {
      if (word.length() > 4) {
        longWords.add(word);
      } 
    }
    String newText = text;
    for (String longWord : longWords) {
      newText = newText.replaceAll(longWord, 
          "<span>" + longWord + "</span>");
    }
    return new DataNode(newText, textNode.baseUri());
  }

}

In this case you must traverse your document as suggested by this answer. Here's a way of doing it using Jsoup APIs:

NodeTraversor and NodeVisitor allow you to traverse the DOM
Node.replaceWith(...) allows for replacing a node in the DOM

Here's the code:

public class JsoupReplacer {

  public static void main(String[] args) {
    so6527876();
  }

  public static void so6527876() {
    String html = 
    "<html>" +
    "<head>" +
    "</head>" +
    "<body>" +
    "    <div id=\"wrapper\" >" +
    "         <div class=\"s2\">I am going <a title=\"some title\" href=\"\">by flying</a>" +
    "           <p>mr tt</p>" +
    "         </div> " +
    "    </div>" +
    "</body>    " +
    "</html>";
    Document doc = Jsoup.parse(html);

    final List<TextNode> nodesToChange = new ArrayList<TextNode>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

      @Override
      public void tail(Node node, int depth) {
        if (node instanceof TextNode) {
          TextNode textNode = (TextNode) node;
          String text = textNode.getWholeText();
          String[] words = text.trim().split(" ");
          for (String word : words) {
            if (word.length() > 4) {
              nodesToChange.add(textNode);
              break;
            }
          }
        }
      }

      @Override
      public void head(Node node, int depth) {        
      }
    });

    nd.traverse(doc.body());

    for (TextNode textNode : nodesToChange) {
      Node newNode = buildElementForText(textNode);
      textNode.replaceWith(newNode);
    }

    System.out.println("result: ");
    System.out.println();
    System.out.println(doc);
  }

  private static Node buildElementForText(TextNode textNode) {
    String text = textNode.getWholeText();
    String[] words = text.trim().split(" ");
    Set<String> longWords = new HashSet<String>();
    for (String word : words) {
      if (word.length() > 4) {
        longWords.add(word);
      } 
    }
    String newText = text;
    for (String longWord : longWords) {
      newText = newText.replaceAll(longWord, 
          "<span>" + longWord + "</span>");
    }
    return new DataNode(newText, textNode.baseUri());
  }

}

回复收藏 0 原文

城歌 2024-11-24 22:35:35

我认为你需要遍历这棵树。元素上 text() 的结果将是该元素的所有文本，包括子元素内的文本。希望类似以下代码的内容对您有所帮助：

import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

public class ScreenScrape {

    public static void main(String[] args) throws IOException {
        String content = FileUtils.readFileToString(new File("test.html"));
        Document doc = Jsoup.parse(content);
        Element body = doc.body();
        //System.out.println(body.toString());

        StringBuilder sb = new StringBuilder();
        traverse(body, sb);

        System.out.println(sb.toString());
    }

    private static void traverse(Node n, StringBuilder sb) {
        if (n instanceof Element) {
            sb.append('<');
            sb.append(n.nodeName());            
            if (n.attributes().size() > 0) {
                sb.append(n.attributes().toString());
            }
            sb.append('>');
        }
        if (n instanceof TextNode) {
            TextNode tn = (TextNode) n;
            if (!tn.isBlank()) {
                sb.append(spanifyText(tn.text()));
            }
        }
        for (Node c : n.childNodes()) {
            traverse(c, sb);
        }
        if (n instanceof Element) {
            sb.append("</");
            sb.append(n.nodeName());
            sb.append('>');
        }        
    }

    private static String spanifyText(String text){
        StringBuilder sb = new StringBuilder();
        StringTokenizer st = new StringTokenizer(text);
        String token;
        while (st.hasMoreTokens()) {
             token = st.nextToken();
             if(token.length() > 3){
                 sb.append("<span>");
                 sb.append(token);
                 sb.append("</span>");
             } else {
                 sb.append(token);
             }             
             sb.append(' ');
        }
        return sb.substring(0, sb.length() - 1).toString();
    }

}

UPDATE

使用 Jonathan 的新 Jsoup List element.textNode() 方法并将其与 MarcoS 建议的 NodeTraversor/NodeVisitor 技术相结合更新（尽管我在遍历树时修改它 - 可能是一个坏主意）：

Document doc = Jsoup.parse(content);
Element body = doc.body();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {

    @Override
    public void tail(Node node, int depth) {
        if (node instanceof Element) {
            boolean foundLongWord;
            Element elem = (Element) node;
            Element span;
            String token;
            StringTokenizer st;
            ArrayList<Node> changedNodes;
            Node currentNode;
            for (TextNode tn : elem.textNodes()) {
                foundLongWord = Boolean.FALSE;
                changedNodes = new ArrayList<Node>();
                st = new StringTokenizer(tn.text());
                while (st.hasMoreTokens()) {
                    token = st.nextToken();
                    if (token.length() > 3) {
                        foundLongWord = Boolean.TRUE;
                        span = new Element(Tag.valueOf("span"), elem.baseUri());
                        span.appendText(token);
                        changedNodes.add(span);
                    } else {
                        changedNodes.add(new TextNode(token + " ", elem.baseUri()));
                    }
                }
                if (foundLongWord) {
                    currentNode = changedNodes.remove(0);
                    tn.replaceWith(currentNode);
                    for (Node n : changedNodes) {
                        currentNode.after(n);
                        currentNode = n;
                    }
                }
            }
        }
    }

    @Override
    public void head(Node node, int depth) {
    }
});    
nd.traverse(body);
System.out.println(body.toString());

I think you need to traverse the tree. The result of text() on an Element will be all of the Element's text including text within child elements. Hopefully something like the following code will be helpful to you:

import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

public class ScreenScrape {

    public static void main(String[] args) throws IOException {
        String content = FileUtils.readFileToString(new File("test.html"));
        Document doc = Jsoup.parse(content);
        Element body = doc.body();
        //System.out.println(body.toString());

        StringBuilder sb = new StringBuilder();
        traverse(body, sb);

        System.out.println(sb.toString());
    }

    private static void traverse(Node n, StringBuilder sb) {
        if (n instanceof Element) {
            sb.append('<');
            sb.append(n.nodeName());            
            if (n.attributes().size() > 0) {
                sb.append(n.attributes().toString());
            }
            sb.append('>');
        }
        if (n instanceof TextNode) {
            TextNode tn = (TextNode) n;
            if (!tn.isBlank()) {
                sb.append(spanifyText(tn.text()));
            }
        }
        for (Node c : n.childNodes()) {
            traverse(c, sb);
        }
        if (n instanceof Element) {
            sb.append("</");
            sb.append(n.nodeName());
            sb.append('>');
        }        
    }

    private static String spanifyText(String text){
        StringBuilder sb = new StringBuilder();
        StringTokenizer st = new StringTokenizer(text);
        String token;
        while (st.hasMoreTokens()) {
             token = st.nextToken();
             if(token.length() > 3){
                 sb.append("<span>");
                 sb.append(token);
                 sb.append("</span>");
             } else {
                 sb.append(token);
             }             
             sb.append(' ');
        }
        return sb.substring(0, sb.length() - 1).toString();
    }

}

UPDATE

Using Jonathan's new Jsoup List element.textNode() method and combining it with MarcoS's suggested NodeTraversor/NodeVisitor technique I came up with (although I am modifying the tree whilst traversing it - probably a bad idea):

Document doc = Jsoup.parse(content);
Element body = doc.body();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {

    @Override
    public void tail(Node node, int depth) {
        if (node instanceof Element) {
            boolean foundLongWord;
            Element elem = (Element) node;
            Element span;
            String token;
            StringTokenizer st;
            ArrayList<Node> changedNodes;
            Node currentNode;
            for (TextNode tn : elem.textNodes()) {
                foundLongWord = Boolean.FALSE;
                changedNodes = new ArrayList<Node>();
                st = new StringTokenizer(tn.text());
                while (st.hasMoreTokens()) {
                    token = st.nextToken();
                    if (token.length() > 3) {
                        foundLongWord = Boolean.TRUE;
                        span = new Element(Tag.valueOf("span"), elem.baseUri());
                        span.appendText(token);
                        changedNodes.add(span);
                    } else {
                        changedNodes.add(new TextNode(token + " ", elem.baseUri()));
                    }
                }
                if (foundLongWord) {
                    currentNode = changedNodes.remove(0);
                    tn.replaceWith(currentNode);
                    for (Node n : changedNodes) {
                        currentNode.after(n);
                        currentNode = n;
                    }
                }
            }
        }
    }

    @Override
    public void head(Node node, int depth) {
    }
});    
nd.traverse(body);
System.out.println(body.toString());

回复收藏 0 原文

彻夜缠绵 2024-11-24 22:35:35

我正在用 hello(span 标签) 替换单词 hello

Document doc = Jsoup.parse(content);
    Element test =  doc.body();
    Elements elemenets = test.getAllElements();
    for(int i =0 ;i <elemenets .size();i++){
        String elementText = elemenets .get(i).text();
        if(elementText.contains("hello"))
            elemenets .get(i).html(l.get(i).text().replaceAll("hello","<span style=\"color:blue\">hello</span>"));
    }

I am replacing word hello with hello(span tag)

Document doc = Jsoup.parse(content);
    Element test =  doc.body();
    Elements elemenets = test.getAllElements();
    for(int i =0 ;i <elemenets .size();i++){
        String elementText = elemenets .get(i).text();
        if(elementText.contains("hello"))
            elemenets .get(i).html(l.get(i).text().replaceAll("hello","<span style=\"color:blue\">hello</span>"));
    }

回复收藏 0 原文

~没有更多了~