格式化输入流的正确方法
我有以下问题:我的程序传递了一个我无法控制其内容的输入流。我使用 javax 库解组我的输入流,如果 InputStream 包含 & ,它会正确地抛出异常。后面没有“amp;”的字符
我想出的解决方法是创建以下类:
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
/**
* Provide an input stream where all & characters are properly encoded as &
*/
public class FormattedStream extends FilterInputStream {
public FormattedStream(InputStream src) {
super(new ByteArrayInputStream(StringUtil.toString(src)
.replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
}
}
注意:StringUtil 是一个简单的实用程序,我必须将输入流转换为字符串。
有了该类,我现在调用 JAXB 解组器:
unmarshal(new FormattedStream(inputStream));
而不是
unmarshal(inputStream);
这种方法有效,但由于以下几个原因看起来很奇怪:
1 - 由于 super 必须是构造函数中的第一个元素的限制(我未能做到这一点)尽管我读到了有关它的内容,但我还是理解了),我被迫在一行中完成所有处理,使得代码远不可读。
2 - 将整个流转换为字符串,然后再转换回流似乎有些过分
3 - 上面的代码稍微不正确,因为流包含 amp;amp;将被修改为包含 amp;
我可以通过提供具有一种方法的 FormatInputStream 类来解决问题 1:
InputStream preProcess(InputStream inputStream)
我将执行与当前在 FormattedStream 类的构造函数中执行的操作相同的操作,但由于编码限制而必须选择不同的接口似乎很奇怪。
我可以通过保持 FormattedStream 构造函数简单来解决第 2 个问题:
super(src)
并重写三个读取方法,但这将涉及更多编码:通过替换 & 来重写三个读取方法。与我目前可以利用replaceAll String 方法的一行代码相比,即时运行并不是微不足道的。
至于3,这似乎是一个极端的情况,我不担心它,但也许我应该......
关于如何以更优雅的方式解决我的问题有什么建议吗?
I have the following issue: my program is passed an InputStream of which I cannot control the contents. I unmarshal my input stream using the javax library, which rightfully throws exceptions if the InputStream includes the & character not followed by "amp;"
The workaround I came up with was to create the following class:
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
/**
* Provide an input stream where all & characters are properly encoded as &
*/
public class FormattedStream extends FilterInputStream {
public FormattedStream(InputStream src) {
super(new ByteArrayInputStream(StringUtil.toString(src)
.replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
}
}
Note: StringUtil is a simple utility I have to turn an input stream into a String.
With that class in place, I now invoke the JAXB unmarshaller with:
unmarshal(new FormattedStream(inputStream));
instead of
unmarshal(inputStream);
This approach works but does seem odd for a few reasons:
1 - Because of the restriction that super must be the first element in the constructor (restriction which I fail to understand despite what I read about it), I am forced to do all my processing in one line, making the code far from readable.
2 - Converting the entire stream into a String and back to a stream seems overkill
3 - The code above is slightly incorrect in that a stream containing amp;amp; will be modified to containing amp;
I could address 1 by providing a FormatInputStream class with one method:
InputStream preProcess(InputStream inputStream)
where I would do the same operations I am currently doing in the constructor of my FormattedStream class but it seems odd to have to choose a different interface because of a coding limitation.
I could address 2 by keeping my FormattedStream constructor simple:
super(src)
and overriding the three read methods but that would involve much more coding: overriding the three read methods by replacing the & on the fly is not trivial compared to the one-line of code I currently have where I can leverage the replaceAll String method.
As for 3, it seems enough of a corner case that I don't worry about it but maybe I should...
Any suggestions on how to solve my issue in a more elegant way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我同意 McDowell 的回答,即最重要的是首先修复无效的数据源。
无论如何,这里有一个
InputStream
,它会查找孤独的&
字符,并将它们与附加的amp;
结合起来,以防它丢失。同样,大多数情况下,以这种方式修复损坏的数据并没有什么回报。该解决方案修复了OP中提到的三个缺陷,并且仅显示了一种实现转换InputStreams的方法。
&
后面是否跟随amp;
所需的四个字节&
,并且 不会尝试以任何方式清理amp;amp;
,因为它们不会发生在该解决方案已
使用以下输入数据进行了测试(第一个参数是预期输出,第二个参数是原始输入):
I agree with McDowell's answer that the most important thing is to fix the invalid data source in the first place.
Anyway, here is an
InputStream
which looks for lonely&
characters and marries them with an additionalamp;
in case it's missing. Again, fixing broken data this way does not pay off most of the time.This solution fixes the three flaws mentioned in the OP and shows only one way to implement transforming InputStreams.
&
is followed byamp;
or not.&
, and does not try to clean upamp;amp;
in any way, because they don't happen with this solution..
The code has been tested with the following input data (first parameter is expected output, second parameter is raw input):
为了避免将所有数据读入 RAM,您可以实现 FilterInputStream(您必须重写 read() 和 read(byte[],int, int) 并以某种方式缓冲这些额外的字节,这不会导致代码变短。
真正的解决方案是修复无效的数据源(如果您要自动化,则需要考虑编写。您自己的 XML 解析器)。
您的方法有一些缺陷。
String.getBytes()
的结果取决于系统;它也是一个可能与任何StringUtil 不对称的转码操作。 toString
确实 - 许多系统上的默认编码是 有损。您应该使用 XML 执行转码。 像这样的To avoid reading all the data into RAM, you could implement a
FilterInputStream
(you would have to override bothread()
andread(byte[],int,int)
and look at buffering those extra bytes somehow. This will not result in shorter code.The real solution is to fix the invalid data source (and if you're going to automate that, you need to look at writing your own XML parser).
Your approach has a few flaws.
String.getBytes()
is system dependent; it is also a transcoding operation that may not be symmetrical with whateverStringUtil.toString
does - default encodings on many systems are lossy. You should perform the transcoding using the XML document encoding.