从流中过滤/删除无效的 xml 字符

发布于 2024-09-09 07:39:44 字数 2256 浏览 7 评论 0原文

首先,我无法更改 xml 的输出,它是由第三方生成的。他们在 xml 中插入无效字符。我得到了 xml 字节流表示形式的 InputStream。除了将流消耗到字符串中并对其进行处理之外,是否有一种更干净的方法来过滤掉有问题的字符?我发现了这个: 使用 FilterReader 但是这对我不起作用,因为我有字节流而不是字符流。

无论如何,这都是 jaxb 解组过程的一部分,以防万一提供选项。

如果它有不好的字符,我们不愿意扔掉整个流。我们决定删除它们并继续。

这是我尝试构建的 FilterReader。

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

下面是我如何解组它:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

以及一些错误的 xml 示例

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar&#x01;
</foo>

First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.

For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.

We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.

Here is a FilterReader I tried to build.

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

And here is how I am unmarshalling it:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

and some example bad xml

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar
</foo>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你怎么这么可爱啊 2024-09-16 07:39:44

为了使用过滤器执行此操作,过滤器需要能够识别 XML 实体,因为(至少在您的示例中并且有时可能在实际使用中)坏字符作为实体存在于 xml 中。

过滤器将您的实体视为由 6 个完全可接受的字符组成的序列,因此不会剥离它们。

破坏 JAXB 的转换将在此过程的后期发生。

In order to do this with a filter, the filter needs to be XML entity aware, because (at least in your example and likely sometimes in actual use) the bad characters are in the xml as entities.

The filter is seeing your entity as a sequence of 6 perfectly acceptable characters and thus not stripping them.

The conversion that breaks JAXB is happening later in the process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文