从流中过滤/删除无效的 xml 字符

发布于 2024-09-09 07:39:44 字数 2256 浏览 7 评论 0原文

首先，我无法更改 xml 的输出，它是由第三方生成的。他们在 xml 中插入无效字符。我得到了 xml 字节流表示形式的 InputStream。除了将流消耗到字符串中并对其进行处理之外，是否有一种更干净的方法来过滤掉有问题的字符？我发现了这个：使用 FilterReader 但是这对我不起作用，因为我有字节流而不是字符流。

无论如何，这都是 jaxb 解组过程的一部分，以防万一提供选项。

如果它有不好的字符，我们不愿意扔掉整个流。我们决定删除它们并继续。

这是我尝试构建的 FilterReader。

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

下面是我如何解组它：

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

以及一些错误的 xml 示例

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar&#x01;
</foo>

原文

First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.

For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.

We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.

Here is a FilterReader I tried to build.

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

And here is how I am unmarshalling it:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

and some example bad xml

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar
</foo>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你怎么这么可爱啊 2024-09-16 07:39:44

为了使用过滤器执行此操作，过滤器需要能够识别 XML 实体，因为（至少在您的示例中并且有时可能在实际使用中）坏字符作为实体存在于 xml 中。

过滤器将您的实体视为由 6 个完全可接受的字符组成的序列，因此不会剥离它们。

破坏 JAXB 的转换将在此过程的后期发生。

回复收藏 0 原文

~没有更多了~

关于作者

二手情话

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

从流中过滤/删除无效的 xml 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

从流中过滤/删除无效的 xml 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。