如何在Java中将UTF-8表示解析为字符串?
给出以下代码:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
JDK 是否已经提供了一些用于执行此操作的类? 有一个库可以做到这一点吗? (最好在maven下)
我尝试使用ByteArrayOutputStream但没有成功。
Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这有效,但仅适用于 ASCII。如果您使用 ASCCI 范围之外的 unicode 字符,则会遇到问题(因为每个字符都被填充到一个字节中,而不是 UTF-8 允许的完整单词中)。您可以执行下面的类型转换,因为您知道如果您保证输入基本上是 ASCII(正如您在评论中提到的),则 UTF-8 不会溢出一个字节。
这是另一个解决方案,解决了仅使用 ASCII 字符的问题。这将适用于 UTF-8 范围内的任何 unicode 字符,而不是仅适用于该范围的前 8 位的 ASCII。感谢 deceze 的提问。你让我更多地思考问题和解决方案。
This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
首先,您只是想解析字符串文字,还是
tmp
将是一些用户输入的数据?如果这是一个字符串文字(即硬编码字符串),则可以使用 Unicode 转义对其进行编码。在您的情况下,这仅意味着使用单反斜杠而不是双反斜杠:
但是,如果您需要使用 Java 的字符串解析规则来解析用户输入,那么 Apache Commons Lang 的 StringEscapeUtils.unescapeJava() 方法。
Firstly, are you just trying to parse a string literal, or is
tmp
going to be some user-entered data?If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.
我确信一定有更好的方法,但仅使用 JDK:
使用
java.util.Properties.load(java.io.Reader)
来处理反斜杠转义(第一次使用java.util.Properties.store(java.io.OutputStream, java.lang.String)
反斜杠转义任何可能导致问题的内容属性文件,然后使用replace("\\\\", "\\")
反转原始反斜杠的反斜杠转义。(免责声明:尽管我测试了我能想到的所有情况,但仍然可能有一些我没有想到的情况。)
I'm sure there must be a better way, but using just the JDK:
uses
java.util.Properties.load(java.io.Reader)
to process the backslash-escapes (after first usingjava.util.Properties.store(java.io.OutputStream, java.lang.String)
to backslash-escape anything that would cause problems in a properties-file, and then usingreplace("\\\\", "\\")
to reverse the backslash-escaping of the original backslashes).(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)