如何在Java中将UTF-8表示解析为字符串?

发布于 2025-01-06 05:30:49 字数 291 浏览 0 评论 0原文

给出以下代码:

String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");

String result = convertToEffectiveString(tmp); // result contain now "hello\n"

JDK 是否已经提供了一些用于执行此操作的类? 有一个库可以做到这一点吗? (最好在maven下)

我尝试使用ByteArrayOutputStream但没有成功。

Given the following code:

String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");

String result = convertToEffectiveString(tmp); // result contain now "hello\n"

Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)

I have tried with ByteArrayOutputStream with no success.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

平安喜乐 2025-01-13 05:30:49

这有效,但仅适用于 ASCII。如果您使用 ASCCI 范围之外的 unicode 字符,则会遇到问题(因为每个字符都被填充到一个字节中,而不是 UTF-8 允许的完整单词中)。您可以执行下面的类型转换,因为您知道如果您保证输入基本上是 ASCII(正如您在评论中提到的),则 UTF-8 不会溢出一个字节。

package sample;

import java.io.UnsupportedEncodingException;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";

            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];

            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }

            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);

        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

这是另一个解决方案,解决了仅使用 ASCII 字符的问题。这将适用于 UTF-8 范围内的任何 unicode 字符,而不是仅适用于该范围的前 8 位的 ASCII。感谢 deceze 的提问。你让我更多地思考问题和解决方案。

package sample;

import java.io.UnsupportedEncodingException;
import java.util.ArrayList;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";

            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");

            for (String c : codes) {

                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);

                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }

            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);

            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }

    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).

package sample;

import java.io.UnsupportedEncodingException;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";

            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];

            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }

            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);

        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.

package sample;

import java.io.UnsupportedEncodingException;
import java.util.ArrayList;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";

            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");

            for (String c : codes) {

                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);

                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }

            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);

            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }

    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}
凉墨 2025-01-13 05:30:49

首先,您只是想解析字符串文字,还是 tmp 将是一些用户输入的数据?

如果这是一个字符串文字(即硬编码字符串),则可以使用 Unicode 转义对其进行编码。在您的情况下,这仅意味着使用单反斜杠而不是双反斜杠:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

但是,如果您需要使用 Java 的字符串解析规则来解析用户输入,那么 Apache Commons Lang 的 StringEscapeUtils.unescapeJava() 方法。

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?

If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

七分※倦醒 2025-01-13 05:30:49

我确信一定有更好的方法,但仅使用 JDK:

public static String handleEscapes(final String s)
{
    final java.util.Properties props = new java.util.Properties();
    props.setProperty("foo", s);
    final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    try
    {
        props.store(baos, null);
        final String tmp = baos.toString().replace("\\\\", "\\");
        props.load(new java.io.StringReader(tmp));
    }
    catch(final java.io.IOException ioe) // shouldn't happen
        { throw new RuntimeException(ioe); }
    return props.getProperty("foo");
}

使用 java.util.Properties.load(java.io.Reader) 来处理反斜杠转义(第一次使用 java.util.Properties.store(java.io.OutputStream, java.lang.String) 反斜杠转义任何可能导致问题的内容属性文件,然后使用 replace("\\\\", "\\") 反转原始反斜杠的反斜杠转义。

(免责声明:尽管我测试了我能想到的所有情况,但仍然可能有一些我没有想到的情况。)

I'm sure there must be a better way, but using just the JDK:

public static String handleEscapes(final String s)
{
    final java.util.Properties props = new java.util.Properties();
    props.setProperty("foo", s);
    final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    try
    {
        props.store(baos, null);
        final String tmp = baos.toString().replace("\\\\", "\\");
        props.load(new java.io.StringReader(tmp));
    }
    catch(final java.io.IOException ioe) // shouldn't happen
        { throw new RuntimeException(ioe); }
    return props.getProperty("foo");
}

uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).

(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文