从流中过滤/删除无效的 xml 字符
首先,我无法更改 xml 的输出,它是由第三方生成的。他们在 xml 中插入无效字符。我得到了 xml 字节流表示形式的 InputStream。除了将流消耗到字符串中并对其进行处理之外,是否有一种更干净的方法来过滤掉有问题的字符?我发现了这个: 使用 FilterReader 但是这对我不起作用,因为我有字节流而不是字符流。
无论如何,这都是 jaxb 解组过程的一部分,以防万一提供选项。
如果它有不好的字符,我们不愿意扔掉整个流。我们决定删除它们并继续。
这是我尝试构建的 FilterReader。
public class InvalidXMLCharacterFilterReader extends FilterReader {
private static final Log LOG = LogFactory
.getLog(InvalidXMLCharacterFilterReader.class);
public InvalidXMLCharacterFilterReader(Reader in) {
super(in);
}
public int read() throws IOException {
char[] buf = new char[1];
int result = read(buf, 0, 1);
if (result == -1)
return -1;
else
return (int) buf[0];
}
public int read(char[] buf, int from, int len) throws IOException {
int count = 0;
while (count == 0) {
count = in.read(buf, from, len);
if (count == -1)
return -1;
int last = from;
for (int i = from; i < from + count; i++) {
LOG.debug("" + (char)buf[i]);
if(!isBadXMLChar(buf[i])) {
buf[last++] = buf[i];
}
}
count = last - from;
}
return count;
}
private boolean isBadXMLChar(char c) {
if ((c == 0x9) ||
(c == 0xA) ||
(c == 0xD) ||
((c >= 0x20) && (c <= 0xD7FF)) ||
((c >= 0xE000) && (c <= 0xFFFD)) ||
((c >= 0x10000) && (c <= 0x10FFFF))) {
return false;
}
return true;
}
}
下面是我如何解组它:
jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);
以及一些错误的 xml 示例
<?xml version="1.0" encoding="UTF-8" ?>
<foo>
bar
</foo>
First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.
For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.
We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.
Here is a FilterReader I tried to build.
public class InvalidXMLCharacterFilterReader extends FilterReader {
private static final Log LOG = LogFactory
.getLog(InvalidXMLCharacterFilterReader.class);
public InvalidXMLCharacterFilterReader(Reader in) {
super(in);
}
public int read() throws IOException {
char[] buf = new char[1];
int result = read(buf, 0, 1);
if (result == -1)
return -1;
else
return (int) buf[0];
}
public int read(char[] buf, int from, int len) throws IOException {
int count = 0;
while (count == 0) {
count = in.read(buf, from, len);
if (count == -1)
return -1;
int last = from;
for (int i = from; i < from + count; i++) {
LOG.debug("" + (char)buf[i]);
if(!isBadXMLChar(buf[i])) {
buf[last++] = buf[i];
}
}
count = last - from;
}
return count;
}
private boolean isBadXMLChar(char c) {
if ((c == 0x9) ||
(c == 0xA) ||
(c == 0xD) ||
((c >= 0x20) && (c <= 0xD7FF)) ||
((c >= 0xE000) && (c <= 0xFFFD)) ||
((c >= 0x10000) && (c <= 0x10FFFF))) {
return false;
}
return true;
}
}
And here is how I am unmarshalling it:
jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);
and some example bad xml
<?xml version="1.0" encoding="UTF-8" ?>
<foo>
bar
</foo>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了使用过滤器执行此操作,过滤器需要能够识别 XML 实体,因为(至少在您的示例中并且有时可能在实际使用中)坏字符作为实体存在于 xml 中。
过滤器将您的实体视为由 6 个完全可接受的字符组成的序列,因此不会剥离它们。
破坏 JAXB 的转换将在此过程的后期发生。
In order to do this with a filter, the filter needs to be XML entity aware, because (at least in your example and likely sometimes in actual use) the bad characters are in the xml as entities.
The filter is seeing your entity as a sequence of 6 perfectly acceptable characters and thus not stripping them.
The conversion that breaks JAXB is happening later in the process.