使用声明的encoding = utf-8从xml中删除非UTF-8字符 - Java

发布于 2024-09-01 20:19:08 字数 305 浏览 5 评论 0原文

我必须在 Java 中处理这种情况:

我从客户端收到一个 XML 形式的请求,声明的编码为 utf-8。不幸的是,它可能不包含 utf-8 字符,并且需要从我这边的 xml 中删除这些字符(旧版)。

让我们考虑一个示例,其中该无效 XML 包含 £(英镑)。

1)我得到xml作为java字符串,其中包含£(我现在无法访问接口,但我可能得到xml作为java字符串)。我可以使用replaceAll(£, "") 来删除这个字符吗?有任何潜在的问题吗?

2)我将 xml 作为字节数组获取 - 在这种情况下如何安全地处理此操作?

I have to handle this scenario in Java:

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

Let's consider an example where this invalid XML contains £ (pound).

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character? Any potential issues?

2) I get xml as an array of bytes - how to handle this operation safely in that case?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

小兔几 2024-09-08 20:19:08

1)我将 xml 作为 java 字符串获取,其中包含 £ (我现在无法访问接口,但我可能将 xml 作为 java 字符串获取)。我可以使用replaceAll(£, "") 来删除这个字符吗?

我假设您的意思是您想要摆脱非 ASCII 字符,因为您正在谈论“遗留”方面。您可以使用以下正则表达式删除 可打印 ASCII 范围之外的任何内容:

string = string.replaceAll("[^\\x20-\\x7e]", "");

2)我将 xml 作为字节数组获取 - 在这种情况下如何安全地处理此操作?

您需要将 byte[] 包装在 ByteArrayInputStream,以便您可以使用 InputStreamReader 其中指定编码,然后使用 BufferedReader 逐行读取它。

例如

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\\x20-\\x7e]", "");
        // ...
    }
    // ...

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

string = string.replaceAll("[^\\x20-\\x7e]", "");

2) I get xml as an array of bytes - how to handle this operation safely in that case?

You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

E.g.

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\\x20-\\x7e]", "");
        // ...
    }
    // ...
年华零落成诗 2024-09-08 20:19:08

UTF-8是一种编码; Unicode 是一种字符集。但 GBP 符号绝对位于 Unicode 字符集中,因此也绝对可以用 UTF-8 表示。

如果您实际上指的是 UTF-8,并且您实际上正在尝试删除不是 UTF-8 中字符的有效编码的字节序列,那么...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ...;
CharBuffer parsed = utf8Decoder.decode(bytes);
...

UTF-8 is an encoding; Unicode is a character set. But the GBP symbol is most definitely in the Unicode character set and therefore most certainly representable in UTF-8.

If you do in fact mean UTF-8, and you are actually trying to remove byte sequences that are not the valid encoding of a character in UTF-8, then...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ...;
CharBuffer parsed = utf8Decoder.decode(bytes);
...
薆情海 2024-09-08 20:19:08
"test text".replaceAll("[^\\u0000-\\uFFFF]", "");

此代码从字符串中删除所有 4 字节 utf8 字符。在执行 Mysql innodb varchar 条目时可能需要出于某些目的

"test text".replaceAll("[^\\u0000-\\uFFFF]", "");

This code removes all 4-byte utf8 chars from string.This can be needed for some purposes while doing Mysql innodb varchar entry

┼── 2024-09-08 20:19:08

我在从本地目录读取文件时遇到了同样的问题并尝试了以下方法:

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "UTF-8"));
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDom = db.parse(new InputSource(in));

您可能必须使用网络输入流而不是 FileInputStream。

--
卡皮尔

I faced the same problem while reading files from a local directory and tried this:

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "UTF-8"));
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDom = db.parse(new InputSource(in));

You might have to use your network input stream instead of FileInputStream.

--
Kapil

月亮是我掰弯的 2024-09-08 20:19:08

请注意,第一步应该是您要求 XML 的创建者(很可能是一个本土的“仅打印数据”XML 生成器)在发送给您之前确保其 XML 是正确的。如果他们使用 Windows,最简单的测试就是要求他们在 Internet Explorer 中查看并查看第一个违规字符处的解析错误。

当他们解决这个问题时,您可以简单地编写一个小程序来更改标头部分以声明编码为 ISO-8859-1:

<?xml version="1.0" encoding="iso-8859-1" ?>

并保持其余部分不变。

Note that the first step should be that you ask the creator of the XML (which is most likely a home grown "just print data" XML generator) to ensure that their XML is correct before sending to you. The simplest possible test if they use Windows is to ask them to view it in Internet Explorer and see the parsing error at the first offending character.

While they fix that, you can simply write a small program that change the header part to declare that the encoding is ISO-8859-1 instead:

<?xml version="1.0" encoding="iso-8859-1" ?>

and leave the rest untouched.

白云不回头 2024-09-08 20:19:08

在 java 机器上将字节数组转换为字符串后,您将获得(默认情况下在大多数机器上)UTF-16 编码的字符串。摆脱非 UTF-8 字符的正确解决方案是使用以下代码:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

或者如果您想验证某个字符串是否包含非 utf8 字符,您可以使用 Pattern.matches ,例如:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

如果您有可用的字节数组,则可以过滤它们甚至更适合:

BufferedReader bufferedReader = null;
try {
    bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String currentLine; (currentLine = bufferedReader.readLine()) != null;) {
        currentLine = currentLine.replaceAll(
                        "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }

要使整个 Web 应用程序兼容 UTF8,请阅读此处:
如何在 Java Web 应用程序中使用 UTF-8
有关字节编码和字符串的更多信息
您可以在此处检查您的模式。
PHP 此处也是如此。

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

If you have the byte array available than you could filter them even more properly with:

BufferedReader bufferedReader = null;
try {
    bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String currentLine; (currentLine = bufferedReader.readLine()) != null;) {
        currentLine = currentLine.replaceAll(
                        "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

看轻我的陪伴 2024-09-08 20:19:08

Copilot 建议如下:

public String removeNonUtf8CompliantCharacters(final String inString) {
        if (null == inString) return null;
       
        byte[] byteArr = inString.getBytes(StandardCharsets.UTF_8);
        String cleanedString = new String(byteArr, StandardCharsets.UTF_8);
        
        return cleanedString;
    }

该代码的工作原理是使用 UTF-8 编码将字符串转换为字节,然后将字节转换回字符串。这有效地删除了 UTF-8 中无效的任何字符。

Here is wat Copilot suggests:

public String removeNonUtf8CompliantCharacters(final String inString) {
        if (null == inString) return null;
       
        byte[] byteArr = inString.getBytes(StandardCharsets.UTF_8);
        String cleanedString = new String(byteArr, StandardCharsets.UTF_8);
        
        return cleanedString;
    }

This code works by converting the string to bytes using UTF-8 encoding and then converting the bytes back to a string. This effectively removes any characters that are not valid in UTF-8.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文