java中的html截断器

发布于 2024-08-26 07:52:33 字数 165 浏览 3 评论 0原文

是否有任何实用程序(或示例源代码)可以在 Java 中截断 HTML(用于预览)?我想在服务器上而不是在客户端上进行截断。

我正在使用 HTMLUnit 来解析 HTML。

更新:
我希望能够预览 HTML,因此截断器将保持 HTML 结构,同时在所需的输出长度之后删除元素。

Is there any utility (or sample source code) that truncates HTML (for preview) in Java? I want to do the truncation on the server and not on the client.

I'm using HTMLUnit to parse HTML.

UPDATE:
I want to be able to preview the HTML, so the truncator would maintain the HTML structure while stripping out the elements after the desired output length.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

十年不长 2024-09-02 07:52:33

我编写了另一个 java 版本的 truncateHTML。此函数将字符串截断至多个字符,同时保留整个单词和 HTML 标签。

public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll("<.*?>", "").length() <= length) {
        return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
        suffix = "...";
    }

    /*
     * This pattern creates tokens, where each line starts with the tag.
     * For example, "One, <b>Two</b>, Three" produces the following:
     *     One,
     *     <b>Two
     *     </b>, Three
     */
    Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");

    /*
     * Checks for an empty tag, for example img, br, etc.
     */
    Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for closing tags, allowing leading and ending space inside the brackets
     */
    Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for opening tags, allowing leading and ending space inside the brackets
     */
    Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");

    /*
     * Find   > ...
     */
    Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");

    // splits all html-tags to scanable lines
    Matcher tagMatcher =  tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List<String> openTags = new ArrayList<String>();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
        String tagText = tagMatcher.group(1);
        String plainText = tagMatcher.group(2);

        if (proposingChop &&
                tagText != null && tagText.length() != 0 &&
                plainText != null && plainText.length() != 0) {
            trimmed = true;
            break;
        }

        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (tagText != null && tagText.length() > 0) {
            boolean foundMatch = false;

            // if it's an "empty element" with or without xhtml-conform closing slash
            Matcher matcher = emptyTagPattern.matcher(tagText);
            if (matcher.find()) {
                foundMatch = true;
                // do nothing
            }

            // closing tag?
            if (!foundMatch) {
                matcher = closingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    foundMatch = true;
                    // delete tag from openTags list
                    String tagName = matcher.group(1);
                    openTags.remove(tagName.toLowerCase());
                }
            }

            // opening tag?
            if (!foundMatch) {
                matcher = openingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    // add tag to the beginning of openTags list
                    String tagName = matcher.group(1);
                    openTags.add(0, tagName.toLowerCase());
                }
            }

            // add html-tag to result
            result += tagText;
        }

        // calculate the length of the plain text part of the line; handle entities (e.g.  ) as one character
        int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
        if (totalLength + contentLength > length) {
            // the number of characters which are left
            int numCharsRemaining = length - totalLength;
            int entitiesLength = 0;
            Matcher entityMatcher = entityPattern.matcher(plainText);
            while (entityMatcher.find()) {
                String entity = entityMatcher.group(1);
                if (numCharsRemaining > 0) {
                    numCharsRemaining--;
                    entitiesLength += entity.length();
                } else {
                    // no more characters left
                    break;
                }
            }

            // keep us from chopping words in half
            int proposedChopPosition = numCharsRemaining + entitiesLength;
            int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
            if (endOfWordPosition == -1) {
                endOfWordPosition = plainText.length();
            }
            int endOfWordOffset = endOfWordPosition - proposedChopPosition;
            if (endOfWordOffset > 6) { // chop the word if it's extra long
                endOfWordOffset = 0;
            }

            proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
            if (plainText.length() >= proposedChopPosition) {
                result += plainText.substring(0, proposedChopPosition);
                proposingChop = true;
                if (proposedChopPosition < plainText.length()) {
                    trimmed = true;
                    break; // maximum length is reached, so get off the loop
                }
            } else {
                result += plainText;
            }
        } else {
            result += plainText;
            totalLength += contentLength;
        }
        // if the maximum length is reached, get off the loop
        if(totalLength >= length) {
            trimmed = true;
            break;
        }
    }

    for (String openTag : openTags) {
        result += "</" + openTag + ">";
    }
    if (trimmed) {
        result += suffix;
    }
    return result;
}

I've written another java version of truncateHTML. This function truncates a string up to a number of characters while preserving whole words and HTML tags.

public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll("<.*?>", "").length() <= length) {
        return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
        suffix = "...";
    }

    /*
     * This pattern creates tokens, where each line starts with the tag.
     * For example, "One, <b>Two</b>, Three" produces the following:
     *     One,
     *     <b>Two
     *     </b>, Three
     */
    Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");

    /*
     * Checks for an empty tag, for example img, br, etc.
     */
    Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for closing tags, allowing leading and ending space inside the brackets
     */
    Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for opening tags, allowing leading and ending space inside the brackets
     */
    Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");

    /*
     * Find   > ...
     */
    Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");

    // splits all html-tags to scanable lines
    Matcher tagMatcher =  tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List<String> openTags = new ArrayList<String>();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
        String tagText = tagMatcher.group(1);
        String plainText = tagMatcher.group(2);

        if (proposingChop &&
                tagText != null && tagText.length() != 0 &&
                plainText != null && plainText.length() != 0) {
            trimmed = true;
            break;
        }

        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (tagText != null && tagText.length() > 0) {
            boolean foundMatch = false;

            // if it's an "empty element" with or without xhtml-conform closing slash
            Matcher matcher = emptyTagPattern.matcher(tagText);
            if (matcher.find()) {
                foundMatch = true;
                // do nothing
            }

            // closing tag?
            if (!foundMatch) {
                matcher = closingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    foundMatch = true;
                    // delete tag from openTags list
                    String tagName = matcher.group(1);
                    openTags.remove(tagName.toLowerCase());
                }
            }

            // opening tag?
            if (!foundMatch) {
                matcher = openingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    // add tag to the beginning of openTags list
                    String tagName = matcher.group(1);
                    openTags.add(0, tagName.toLowerCase());
                }
            }

            // add html-tag to result
            result += tagText;
        }

        // calculate the length of the plain text part of the line; handle entities (e.g.  ) as one character
        int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
        if (totalLength + contentLength > length) {
            // the number of characters which are left
            int numCharsRemaining = length - totalLength;
            int entitiesLength = 0;
            Matcher entityMatcher = entityPattern.matcher(plainText);
            while (entityMatcher.find()) {
                String entity = entityMatcher.group(1);
                if (numCharsRemaining > 0) {
                    numCharsRemaining--;
                    entitiesLength += entity.length();
                } else {
                    // no more characters left
                    break;
                }
            }

            // keep us from chopping words in half
            int proposedChopPosition = numCharsRemaining + entitiesLength;
            int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
            if (endOfWordPosition == -1) {
                endOfWordPosition = plainText.length();
            }
            int endOfWordOffset = endOfWordPosition - proposedChopPosition;
            if (endOfWordOffset > 6) { // chop the word if it's extra long
                endOfWordOffset = 0;
            }

            proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
            if (plainText.length() >= proposedChopPosition) {
                result += plainText.substring(0, proposedChopPosition);
                proposingChop = true;
                if (proposedChopPosition < plainText.length()) {
                    trimmed = true;
                    break; // maximum length is reached, so get off the loop
                }
            } else {
                result += plainText;
            }
        } else {
            result += plainText;
            totalLength += contentLength;
        }
        // if the maximum length is reached, get off the loop
        if(totalLength >= length) {
            trimmed = true;
            break;
        }
    }

    for (String openTag : openTags) {
        result += "</" + openTag + ">";
    }
    if (trimmed) {
        result += suffix;
    }
    return result;
}
要走就滚别墨迹 2024-09-02 07:52:33

我认为您需要编写自己的 XML 解析器来完成此任务。拉出body节点,添加节点直到二进制长度<一些固定的大小,然后重建文档。如果 HTMLUnit 不创建语义 XHTML,我建议使用 tagsoup

如果您需要 XML 解析器/处理程序,我推荐 XOM

I think you're going to need to write your own XML parser to accomplish this. Pull out the body node, add nodes until binary length < some fixed size, and then rebuild the document. If HTMLUnit doesn't create semantic XHTML, I'd recommend tagsoup.

If you need an XML parser/handler, I'd recommend XOM.

記柔刀 2024-09-02 07:52:33

有一个 PHP 函数可以在这里执行此操作: http://snippets.dzone.com/posts/show /7125

我已经对初始版本做了一个快速而肮脏的 Java 移植,但是注释中有后续的改进版本可能值得考虑(尤其是处理整个单词的版本):

public static String truncateHtml(String s, int l) {
  Pattern p = Pattern.compile("<[^>]+>([^<]*)");

  int i = 0;
  List<String> tags = new ArrayList<String>();

  Matcher m = p.matcher(s);
  while(m.find()) {
      if (m.start(0) - i >= l) {
          break;
      }

      String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
      if (t.charAt(0) != '/') {
          tags.add(t);
      } else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
          tags.remove(tags.size()-1);
      }
      i += m.start(1) - m.start(0);
  }

  Collections.reverse(tags);
  return s.substring(0, Math.min(s.length(), l+i))
      + ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
      + ((s.length() > l) ? "\u2026" : "");

}

注意: 您需要 Apache Commons Lang 来执行 StringUtils.join()

There is a PHP function that does it here: http://snippets.dzone.com/posts/show/7125

I've made a quick and dirty Java port of the initial version, but there are subsequent improved versions in the comments that could be worth considering (especially one that deals with whole words):

public static String truncateHtml(String s, int l) {
  Pattern p = Pattern.compile("<[^>]+>([^<]*)");

  int i = 0;
  List<String> tags = new ArrayList<String>();

  Matcher m = p.matcher(s);
  while(m.find()) {
      if (m.start(0) - i >= l) {
          break;
      }

      String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
      if (t.charAt(0) != '/') {
          tags.add(t);
      } else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
          tags.remove(tags.size()-1);
      }
      i += m.start(1) - m.start(0);
  }

  Collections.reverse(tags);
  return s.substring(0, Math.min(s.length(), l+i))
      + ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
      + ((s.length() > l) ? "\u2026" : "");

}

Note: You'll need Apache Commons Lang for the StringUtils.join().

还在原地等你 2024-09-02 07:52:33

我可以为您提供一个我为此编写的 Python 脚本: http://www.ellipsix .net/ext-tmp/summarize.txt。不幸的是,我没有 Java 版本,但如果您愿意,可以自行翻译并修改它以满足您的需要。它并不是很复杂,只是我为我的网站拼凑而成的东西,但我已经使用它一年多了,它通常看起来运行得很好。

如果您想要健壮的东西,XML(或 SGML)解析器几乎肯定比我所做的更好。

I can offer you a Python script I wrote to do this: http://www.ellipsix.net/ext-tmp/summarize.txt. Unfortunately I don't have a Java version, but feel free to translate it yourself and modify it to suit your needs if you want. It's not very complicated, just something I hacked together for my website, but I've been using it for a little more than a year and it generally seems to work pretty well.

If you want something robust, an XML (or SGML) parser is almost certainly a better idea than what I did.

你是年少的欢喜 2024-09-02 07:52:33

我找到了这个博客: dencat:在 Java 中截断 HTML

它包含Python的java端口,Django模板函数truncate_html_words

I found this blog: dencat: Truncating HTML in Java

It contains a java port of Pythons, Django template function truncate_html_words

深海少女心 2024-09-02 07:52:33
public class SimpleHtmlTruncator {

    public static String truncateHtmlWords(String text, int max_length) {
        String input = text.trim();
        if (max_length > input.length()) {
            return input;
        }
        if (max_length < 0) {
            return new String();
        }
        StringBuilder output = new StringBuilder();
        /**
         * Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
         * Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
         * pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
         */
        String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
        Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
        Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
        Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
        int characters = 0;
        Matcher all = pattern_overall.matcher(input);
        while (all.find()) {
            String matched = all.group();
            Matcher html_matcher = pattern_html.matcher(matched);
            Matcher word_matcher = pattern_words.matcher(matched);
            if (html_matcher.matches()) {
                output.append(html_matcher.group());
            } else if (word_matcher.matches()) {
                if (characters < max_length) {
                    String word = word_matcher.group();
                    if (characters + word.length() < max_length) {
                        output.append(word);
                    } else {
                        output.append(word.substring(0,
                                (max_length - characters) > word.length()
                                ? word.length() : (max_length - characters)));
                    }
                    characters += word.length();
                }
            }
        }
        return output.toString();
    }

    public static void main(String[] args) {
        String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
        System.out.println(text);
    }
}
public class SimpleHtmlTruncator {

    public static String truncateHtmlWords(String text, int max_length) {
        String input = text.trim();
        if (max_length > input.length()) {
            return input;
        }
        if (max_length < 0) {
            return new String();
        }
        StringBuilder output = new StringBuilder();
        /**
         * Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
         * Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
         * pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
         */
        String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
        Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
        Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
        Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
        int characters = 0;
        Matcher all = pattern_overall.matcher(input);
        while (all.find()) {
            String matched = all.group();
            Matcher html_matcher = pattern_html.matcher(matched);
            Matcher word_matcher = pattern_words.matcher(matched);
            if (html_matcher.matches()) {
                output.append(html_matcher.group());
            } else if (word_matcher.matches()) {
                if (characters < max_length) {
                    String word = word_matcher.group();
                    if (characters + word.length() < max_length) {
                        output.append(word);
                    } else {
                        output.append(word.substring(0,
                                (max_length - characters) > word.length()
                                ? word.length() : (max_length - characters)));
                    }
                    characters += word.length();
                }
            }
        }
        return output.toString();
    }

    public static void main(String[] args) {
        String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
        System.out.println(text);
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文