当前位置：文江博客话题详情

HTML Java html-encode

JDK 是否有一个类可以进行 HTML 编码（但不能进行 URL 编码）？

发布于 2024-07-15 03:55:40 字数 493 浏览 5 评论 0 原文

我当然熟悉 java.net.URLEncoder 和 java.net.URLDecoder 类。然而，我只需要 HTML 风格的编码。（我不希望 ' ' 替换为 '+' 等）。我不知道有任何 JDK 内置类只能执行 HTML 编码。有吗？我知道其他选择（例如， Jakarta Commons Lang 'StringEscapeUtils'，但我不想在需要它的项目中添加另一个外部依赖项。

我希望最近的 JDK 中已添加一些内容（又名 5 或 6）这会做我不知道的事情，否则我必须自己动手。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁梦 2024-07-22 03:55:41

没有内置的 JDK 类可以执行此操作，但它是 Jakarta commons-lang 库的一部分。

String escaped = StringEscapeUtils.escapeHtml3(stringToEscape);
String escaped = StringEscapeUtils.escapeHtml4(stringToEscape);

查看 JavaDoc

添加依赖项通常就像将 jar 放在某处一样简单，并且 commons-lang 有如此多有用的实用程序，因此通常值得将其加入其中。

There isn't a JDK built in class to do this, but it is part of the Jakarta commons-lang library.

String escaped = StringEscapeUtils.escapeHtml3(stringToEscape);
String escaped = StringEscapeUtils.escapeHtml4(stringToEscape);

Check out the JavaDoc

Adding the dependency is usually as simple as dropping the jar somewhere, and commons-lang has so many useful utilities that it is often worthwhile having it on board.

回复收藏 0 原文

┈┾☆殇 2024-07-22 03:55:41

一种简单的方法似乎是这样的：

/**
 * HTML encode of UTF8 string i.e. symbols with code more than 127 aren't encoded
 * Use Apache Commons Text StringEscapeUtils if it is possible
 *
 * <pre>
 * escapeHtml("\tIt's timeto hack & fun\r<script>alert(\"PWNED\")</script>")
 *    .equals("	It's time to hack & fun
<script>alert("PWNED")</script>")
 * </pre>
 */
public static String escapeHtml(String rawHtml) {
    int rawHtmlLength = rawHtml.length();
    // add 30% for additional encodings
    int capacity = (int) (rawHtmlLength * 1.3);
    StringBuilder sb = new StringBuilder(capacity);
    for (int i = 0; i < rawHtmlLength; i++) {
        char ch = rawHtml.charAt(i);
        if (ch == '<') {
            sb.append("<");
        } else if (ch == '>') {
            sb.append(">");
        } else if (ch == '"') {
            sb.append(""");
        } else if (ch == '&') {
            sb.append("&");
        } else if (ch < ' ' || ch == '\'') {
            // non printable ascii symbols escaped as numeric entity
            // single quote ' in html doesn't have ' so show it as numeric entity '
            sb.append("&#").append((int)ch).append(';');
        } else {
            // any non ASCII char i.e. upper than 127 is still UTF
            sb.append(ch);
        }
    }
    return sb.toString();
}

但如果您确实需要转义所有非 ASCII 符号，即您将在 7 位编码上传输编码文本，则将最后一个 else 替换为：

        } else {
            // encode non ASCII characters if needed
            int c = (ch & 0xFFFF);
            if (c > 127) {
                sb.append("&#").append(c).append(';');
            } else {
                sb.append(ch);
            }
        }

A simple way seem to be this one:

/**
 * HTML encode of UTF8 string i.e. symbols with code more than 127 aren't encoded
 * Use Apache Commons Text StringEscapeUtils if it is possible
 *
 * <pre>
 * escapeHtml("\tIt's timeto hack & fun\r<script>alert(\"PWNED\")</script>")
 *    .equals("	It's time to hack & fun
<script>alert("PWNED")</script>")
 * </pre>
 */
public static String escapeHtml(String rawHtml) {
    int rawHtmlLength = rawHtml.length();
    // add 30% for additional encodings
    int capacity = (int) (rawHtmlLength * 1.3);
    StringBuilder sb = new StringBuilder(capacity);
    for (int i = 0; i < rawHtmlLength; i++) {
        char ch = rawHtml.charAt(i);
        if (ch == '<') {
            sb.append("<");
        } else if (ch == '>') {
            sb.append(">");
        } else if (ch == '"') {
            sb.append(""");
        } else if (ch == '&') {
            sb.append("&");
        } else if (ch < ' ' || ch == '\'') {
            // non printable ascii symbols escaped as numeric entity
            // single quote ' in html doesn't have ' so show it as numeric entity '
            sb.append("&#").append((int)ch).append(';');
        } else {
            // any non ASCII char i.e. upper than 127 is still UTF
            sb.append(ch);
        }
    }
    return sb.toString();
}

But if you do need to escape all non ASCII symbols i.e. you'll transmit encoded text on 7bit encoding then replace the last else with:

        } else {
            // encode non ASCII characters if needed
            int c = (ch & 0xFFFF);
            if (c > 127) {
                sb.append("&#").append(c).append(';');
            } else {
                sb.append(ch);
            }
        }

回复收藏 0 原文

〗斷ホ乔殘χμё〖 2024-07-22 03:55:41

显然，答案是“不”。不幸的是，在这种情况下，我必须做一些事情，并且在短期内无法为其添加新的外部依赖项。我同意大家的看法，使用 Commons Lang 是最好的长期解决方案。一旦我可以向项目添加新库，这就是我将要采用的方法。

遗憾的是 Java API 中没有如此常用的东西。

回复收藏 0 原文

心是晴朗的。 2024-07-22 03:55:41

我发现我审查过的所有现有解决方案（库）都存在以下一个或多个问题：

它们没有在 Javadoc 中准确地告诉您它们所替换的内容。
它们逃逸太多......这使得 HTML 更难阅读。
它们不会记录返回值何时可以安全使用（对于 HTML 实体可以安全使用吗？对于 HTML 属性？等等）
它们没有针对速度进行优化。
它们没有避免双重转义的功能（不转义已经转义的内容）
它们用 ' 替换单引号（错误！）

除此之外，我还遇到了不的问题能够引入外部库，至少没有一定数量的繁文缛节。

所以，我自己推出了。有罪的。

下面是它的样子，但最新版本始终可以在此要点中找到。

/**
 * HTML string utilities
 */
public class SafeHtml {

    /**
     * Escapes a string for use in an HTML entity or HTML attribute.
     * 
     * <p>
     * The returned value is always suitable for an HTML <i>entity</i> but only
     * suitable for an HTML <i>attribute</i> if the attribute value is inside
     * double quotes. In other words the method is not safe for use with HTML
     * attributes unless you put the value in double quotes like this:
     * <pre>
     *    <div title="value-from-this-method" > ....
     * </pre>
     * Putting attribute values in double quotes is always a good idea anyway.
     * 
     * <p>The following characters will be escaped:
     * <ul>
     *   <li>{@code &} (ampersand) -- replaced with {@code &}</li>
     *   <li>{@code <} (less than) -- replaced with {@code <}</li>
     *   <li>{@code >} (greater than) -- replaced with {@code >}</li>
     *   <li>{@code "} (double quote) -- replaced with {@code "}</li>
     *   <li>{@code '} (single quote) -- replaced with {@code '}</li>
     *   <li>{@code /} (forward slash) -- replaced with {@code /}</li>
     * </ul>
     * It is not necessary to escape more than this as long as the HTML page
     * <a href="https://en.wikipedia.org/wiki/Character_encodings_in_HTML">uses
     * a Unicode encoding</a>. (Most web pages uses UTF-8 which is also the HTML5
     * recommendation.). Escaping more than this makes the HTML much less readable.
     * 
     * @param s the string to make HTML safe
     * @param avoidDoubleEscape avoid double escaping, which means for example not 
     *     escaping {@code <} one more time. Any sequence {@code &....;}, as explained in
     *     {@link #isHtmlCharEntityRef(java.lang.String, int) isHtmlCharEntityRef()}, will not be escaped.
     * 
     * @return a HTML safe string 
     */
    public static String htmlEscape(String s, boolean avoidDoubleEscape) {
        if (s == null || s.length() == 0) {
            return s;
        }
        StringBuilder sb = new StringBuilder(s.length()+16);
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            switch (c) {
                case '&':
                    // Avoid double escaping if already escaped
                    if (avoidDoubleEscape && (isHtmlCharEntityRef(s, i))) {
                        sb.append('&');
                    } else {
                        sb.append("&");
                    }
                    break;
                case '<':
                    sb.append("<");
                    break;
                case '>':
                    sb.append(">");
                    break;
                case '"':
                    sb.append("""); 
                    break;
                case '\'':
                    sb.append("'"); 
                    break;
                case '/':
                    sb.append("/"); 
                    break;
                default:
                    sb.append(c);
            }
        }
        return sb.toString();
  }

  /**
   * Checks if the value at {@code index} is a HTML entity reference. This
   * means any of :
   * <ul>
   *   <li>{@code &} or {@code <} or {@code >} or {@code "} </li>
   *   <li>A value of the form {@code &#dddd;} where {@code dddd} is a decimal value</li>
   *   <li>A value of the form {@code &#xhhhh;} where {@code hhhh} is a hexadecimal value</li>
   * </ul>
   * @param str the string to test for HTML entity reference.
   * @param index position of the {@code '&'} in {@code str}
   * @return 
   */
  public static boolean isHtmlCharEntityRef(String str, int index)  {
      if (str.charAt(index) != '&') {
          return false;
      }
      int indexOfSemicolon = str.indexOf(';', index + 1);
      if (indexOfSemicolon == -1) { // is there a semicolon sometime later ?
          return false;
      }
      if (!(indexOfSemicolon > (index + 2))) {   // is the string actually long enough
          return false;
      }
      if (followingCharsAre(str, index, "amp;")
              || followingCharsAre(str, index, "lt;")
              || followingCharsAre(str, index, "gt;")
              || followingCharsAre(str, index, "quot;")) {
          return true;
      }
      if (str.charAt(index+1) == '#') {
          if (str.charAt(index+2) == 'x' || str.charAt(index+2) == 'X') {
              // It's presumably a hex value
              if (str.charAt(index+3) == ';') {
                  return false;
              }
              for (int i = index+3; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  if (c >= 65 && c <=70) {   // A -- F
                      continue;
                  }
                  if (c >= 97 && c <=102) {   // a -- f
                      continue;
                  }
                  return false;  
              }
              return true;   // yes, the value is a hex string
          } else {
              // It's presumably a decimal value
              for (int i = index+2; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  return false;
              }
              return true; // yes, the value is decimal
          }
      }
      return false;
  } 


  /**
   * Tests if the chars following position <code>startIndex</code> in string
   * <code>str</code> are that of <code>nextChars</code>.
   * 
   * <p>Optimized for speed. Otherwise this method would be exactly equal to
   * {@code (str.indexOf(nextChars, startIndex+1) == (startIndex+1))}.
   *
   * @param str
   * @param startIndex
   * @param nextChars
   * @return 
   */  
  private static boolean followingCharsAre(String str, int startIndex, String nextChars)  {
      if ((startIndex + nextChars.length()) < str.length()) {
          for(int i = 0; i < nextChars.length(); i++) {
              if ( nextChars.charAt(i) != str.charAt(startIndex+i+1)) {
                  return false;
              }
          }
          return true;
      } else {
          return false;
      }
  }
}

TODO：保留连续的空格。

I've found that all existing solutions (libraries) I've reviewed suffered from one or several of the below issues:

They don't tell you in the Javadoc exactly what they replace.
They escape too much ... which makes the HTML much harder to read.
They do not document when the returned value is safe to use (safe to use for an HTML entity?, for an HTML attributute?, etc)
They are not optimized for speed.
They do not have a feature for avoiding double escaping (do not escape what is already escaped)
They replace single quote with ' (wrong!)

On top of this I also had the problem of not being able to bring in an external library, at least not without a certain amount of red tape.

So, I rolled my own. Guilty.

Below is what it looks like but the latest version can always be found in this gist.

/**
 * HTML string utilities
 */
public class SafeHtml {

    /**
     * Escapes a string for use in an HTML entity or HTML attribute.
     * 
     * <p>
     * The returned value is always suitable for an HTML <i>entity</i> but only
     * suitable for an HTML <i>attribute</i> if the attribute value is inside
     * double quotes. In other words the method is not safe for use with HTML
     * attributes unless you put the value in double quotes like this:
     * <pre>
     *    <div title="value-from-this-method" > ....
     * </pre>
     * Putting attribute values in double quotes is always a good idea anyway.
     * 
     * <p>The following characters will be escaped:
     * <ul>
     *   <li>{@code &} (ampersand) -- replaced with {@code &}</li>
     *   <li>{@code <} (less than) -- replaced with {@code <}</li>
     *   <li>{@code >} (greater than) -- replaced with {@code >}</li>
     *   <li>{@code "} (double quote) -- replaced with {@code "}</li>
     *   <li>{@code '} (single quote) -- replaced with {@code '}</li>
     *   <li>{@code /} (forward slash) -- replaced with {@code /}</li>
     * </ul>
     * It is not necessary to escape more than this as long as the HTML page
     * <a href="https://en.wikipedia.org/wiki/Character_encodings_in_HTML">uses
     * a Unicode encoding</a>. (Most web pages uses UTF-8 which is also the HTML5
     * recommendation.). Escaping more than this makes the HTML much less readable.
     * 
     * @param s the string to make HTML safe
     * @param avoidDoubleEscape avoid double escaping, which means for example not 
     *     escaping {@code <} one more time. Any sequence {@code &....;}, as explained in
     *     {@link #isHtmlCharEntityRef(java.lang.String, int) isHtmlCharEntityRef()}, will not be escaped.
     * 
     * @return a HTML safe string 
     */
    public static String htmlEscape(String s, boolean avoidDoubleEscape) {
        if (s == null || s.length() == 0) {
            return s;
        }
        StringBuilder sb = new StringBuilder(s.length()+16);
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            switch (c) {
                case '&':
                    // Avoid double escaping if already escaped
                    if (avoidDoubleEscape && (isHtmlCharEntityRef(s, i))) {
                        sb.append('&');
                    } else {
                        sb.append("&");
                    }
                    break;
                case '<':
                    sb.append("<");
                    break;
                case '>':
                    sb.append(">");
                    break;
                case '"':
                    sb.append("""); 
                    break;
                case '\'':
                    sb.append("'"); 
                    break;
                case '/':
                    sb.append("/"); 
                    break;
                default:
                    sb.append(c);
            }
        }
        return sb.toString();
  }

  /**
   * Checks if the value at {@code index} is a HTML entity reference. This
   * means any of :
   * <ul>
   *   <li>{@code &} or {@code <} or {@code >} or {@code "} </li>
   *   <li>A value of the form {@code &#dddd;} where {@code dddd} is a decimal value</li>
   *   <li>A value of the form {@code &#xhhhh;} where {@code hhhh} is a hexadecimal value</li>
   * </ul>
   * @param str the string to test for HTML entity reference.
   * @param index position of the {@code '&'} in {@code str}
   * @return 
   */
  public static boolean isHtmlCharEntityRef(String str, int index)  {
      if (str.charAt(index) != '&') {
          return false;
      }
      int indexOfSemicolon = str.indexOf(';', index + 1);
      if (indexOfSemicolon == -1) { // is there a semicolon sometime later ?
          return false;
      }
      if (!(indexOfSemicolon > (index + 2))) {   // is the string actually long enough
          return false;
      }
      if (followingCharsAre(str, index, "amp;")
              || followingCharsAre(str, index, "lt;")
              || followingCharsAre(str, index, "gt;")
              || followingCharsAre(str, index, "quot;")) {
          return true;
      }
      if (str.charAt(index+1) == '#') {
          if (str.charAt(index+2) == 'x' || str.charAt(index+2) == 'X') {
              // It's presumably a hex value
              if (str.charAt(index+3) == ';') {
                  return false;
              }
              for (int i = index+3; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  if (c >= 65 && c <=70) {   // A -- F
                      continue;
                  }
                  if (c >= 97 && c <=102) {   // a -- f
                      continue;
                  }
                  return false;  
              }
              return true;   // yes, the value is a hex string
          } else {
              // It's presumably a decimal value
              for (int i = index+2; i < indexOfSemicolon; i++) {
                  char c = str.charAt(i);
                  if (c >= 48 && c <=57) {  // 0 -- 9
                      continue;
                  }
                  return false;
              }
              return true; // yes, the value is decimal
          }
      }
      return false;
  } 


  /**
   * Tests if the chars following position <code>startIndex</code> in string
   * <code>str</code> are that of <code>nextChars</code>.
   * 
   * <p>Optimized for speed. Otherwise this method would be exactly equal to
   * {@code (str.indexOf(nextChars, startIndex+1) == (startIndex+1))}.
   *
   * @param str
   * @param startIndex
   * @param nextChars
   * @return 
   */  
  private static boolean followingCharsAre(String str, int startIndex, String nextChars)  {
      if ((startIndex + nextChars.length()) < str.length()) {
          for(int i = 0; i < nextChars.length(); i++) {
              if ( nextChars.charAt(i) != str.charAt(startIndex+i+1)) {
                  return false;
              }
          }
          return true;
      } else {
          return false;
      }
  }
}

TODO: Preserve consecutive whitespace.

回复收藏 0 原文