格式化输入流的正确方法

发布于 2024-10-10 10:54:19 字数 1482 浏览 6 评论 0原文

我有以下问题:我的程序传递了一个我无法控制其内容的输入流。我使用 javax 库解组我的输入流,如果 InputStream 包含 & ,它会正确地抛出异常。后面没有“amp;”的字符

我想出的解决方法是创建以下类:

import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;

/**
 * Provide an input stream where all & characters are properly encoded as &
 */
public class FormattedStream extends FilterInputStream {
  public FormattedStream(InputStream src) {
    super(new ByteArrayInputStream(StringUtil.toString(src)
      .replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
  }
}

注意:StringUtil 是一个简单的实用程序,我必须将输入流转换为字符串。

有了该类,我现在调用 JAXB 解组器:

unmarshal(new FormattedStream(inputStream));

而不是

unmarshal(inputStream);

这种方法有效,但由于以下几个原因看起来很奇怪:

1 - 由于 super 必须是构造函数中的第一个元素的限制(我未能做到这一点)尽管我读到了有关它的内容,但我还是理解了),我被迫在一行中完成所有处理,使得代码远不可读。

2 - 将整个流转换为字符串,然后再转换回流似乎有些过分

3 - 上面的代码稍微不正确,因为流包含 amp;amp;将被修改为包含 amp;

我可以通过提供具有一种方法的 FormatInputStream 类来解决问题 1:

InputStream preProcess(InputStream inputStream)

我将执行与当前在 FormattedStream 类的构造函数中执行的操作相同的操作,但由于编码限制而必须选择不同的接口似乎很奇怪。

我可以通过保持 FormattedStream 构造函数简单来解决第 2 个问题:

super(src)

并重写三个读取方法,但这将涉及更多编码:通过替换 & 来重写三个读取方法。与我目前可以利用replaceAll String 方法的一行代码相比,即时运行并不是微不足道的。

至于3,这似乎是一个极端的情况,我不担心它,但也许我应该......

关于如何以更优雅的方式解决我的问题有什么建议吗?

I have the following issue: my program is passed an InputStream of which I cannot control the contents. I unmarshal my input stream using the javax library, which rightfully throws exceptions if the InputStream includes the & character not followed by "amp;"

The workaround I came up with was to create the following class:

import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;

/**
 * Provide an input stream where all & characters are properly encoded as &
 */
public class FormattedStream extends FilterInputStream {
  public FormattedStream(InputStream src) {
    super(new ByteArrayInputStream(StringUtil.toString(src)
      .replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
  }
}

Note: StringUtil is a simple utility I have to turn an input stream into a String.

With that class in place, I now invoke the JAXB unmarshaller with:

unmarshal(new FormattedStream(inputStream));

instead of

unmarshal(inputStream);

This approach works but does seem odd for a few reasons:

1 - Because of the restriction that super must be the first element in the constructor (restriction which I fail to understand despite what I read about it), I am forced to do all my processing in one line, making the code far from readable.

2 - Converting the entire stream into a String and back to a stream seems overkill

3 - The code above is slightly incorrect in that a stream containing amp;amp; will be modified to containing amp;

I could address 1 by providing a FormatInputStream class with one method:

InputStream preProcess(InputStream inputStream)

where I would do the same operations I am currently doing in the constructor of my FormattedStream class but it seems odd to have to choose a different interface because of a coding limitation.

I could address 2 by keeping my FormattedStream constructor simple:

super(src)

and overriding the three read methods but that would involve much more coding: overriding the three read methods by replacing the & on the fly is not trivial compared to the one-line of code I currently have where I can leverage the replaceAll String method.

As for 3, it seems enough of a corner case that I don't worry about it but maybe I should...

Any suggestions on how to solve my issue in a more elegant way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我不是你的备胎 2024-10-17 10:54:19

我同意 McDowell 的回答,即最重要的是首先修复无效的数据源

无论如何,这里有一个 InputStream ,它会查找孤独的 & 字符,并将它们与附加的 amp; 结合起来,以防它丢失。同样,大多数情况下,以这种方式修复损坏的数据并没有什么回报。

该解决方案修复了OP中提到的三个缺陷,并且仅显示了一种实现转换InputStreams的方法。

  • 在构造函数中,仅保存对原始 InputStream 的引用。 构造函数中不会进行任何处理,直到真正向流请求数据(通过调用 read())。
  • 内容不会转换为大的单个字符串进行转换。相反,流作为流工作并且仅执行最小的预读(例如,确定 & 后面是否跟随 amp; 所需的四个字节
  • 。 Stream 仅替换孤独的 &,并且 不会尝试以任何方式清理 amp;amp;,因为它们不会发生在该解决方案

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayDeque;
import java.util.Deque;

public class ReplacerInputStream extends InputStream {

private static final byte[] REPLACEMENT = "amp;".getBytes();
    private final byte[] readBuf = new byte[REPLACEMENT.length];
    private final Deque<Byte> backBuf = new ArrayDeque<Byte>();
    private final InputStream in;

    public ReplacerInputStream(InputStream in) {
        this.in = in;
    }

    @Override
    public int read() throws IOException {
        if (!backBuf.isEmpty()) {
            return backBuf.pop();
        }
        int first = in.read();
        if (first == '&') {
            peekAndReplace();
        }
        return first;
    }

    private void peekAndReplace() throws IOException {
        int read = super.read(readBuf, 0, REPLACEMENT.length);
        for (int i1 = read - 1; i1 >= 0; i1--) {
            backBuf.push(readBuf[i1]);
        }
        for (int i = 0; i < REPLACEMENT.length; i++) {
            if (read != REPLACEMENT.length || readBuf[i] != REPLACEMENT[i]) {
                for (int j = REPLACEMENT.length - 1; j >= 0; j--) {
                    // In reverse order
                    backBuf.push(REPLACEMENT[j]);
                }
                return;
            }
        }
    }

}

使用以下输入数据进行了测试(第一个参数是预期输出,第二个参数是原始输入):

    test("Foo & Bar", "Foo & Bar");
    test("&&&", "&&&");
    test("&&& ", "&&& ");
    test(" &&&", " &&&");
    test("&", "&");
    test("&", "&");
    test("&&", "&&");
    test("&&&", "&&&");
    test("test", "test");
    test("", "");
    test("testtesttest&", "testtesttest&");

I agree with McDowell's answer that the most important thing is to fix the invalid data source in the first place.

Anyway, here is an InputStream which looks for lonely & characters and marries them with an additional amp; in case it's missing. Again, fixing broken data this way does not pay off most of the time.

This solution fixes the three flaws mentioned in the OP and shows only one way to implement transforming InputStreams.

  • Within the constructor, only the reference to the original InputStream is held. No processing takes place in the constructor, until the stream is really asked for data (by calls to read()).
  • The contents is not transformed to a large single String for transformation. Instead, the stream works as a stream and only performs minimal read-ahead (e.g. the four bytes necessary to find out whether & is followed by amp; or not.
  • The stream only replaces lonely &, and does not try to clean up amp;amp; in any way, because they don't happen with this solution.

.

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayDeque;
import java.util.Deque;

public class ReplacerInputStream extends InputStream {

private static final byte[] REPLACEMENT = "amp;".getBytes();
    private final byte[] readBuf = new byte[REPLACEMENT.length];
    private final Deque<Byte> backBuf = new ArrayDeque<Byte>();
    private final InputStream in;

    public ReplacerInputStream(InputStream in) {
        this.in = in;
    }

    @Override
    public int read() throws IOException {
        if (!backBuf.isEmpty()) {
            return backBuf.pop();
        }
        int first = in.read();
        if (first == '&') {
            peekAndReplace();
        }
        return first;
    }

    private void peekAndReplace() throws IOException {
        int read = super.read(readBuf, 0, REPLACEMENT.length);
        for (int i1 = read - 1; i1 >= 0; i1--) {
            backBuf.push(readBuf[i1]);
        }
        for (int i = 0; i < REPLACEMENT.length; i++) {
            if (read != REPLACEMENT.length || readBuf[i] != REPLACEMENT[i]) {
                for (int j = REPLACEMENT.length - 1; j >= 0; j--) {
                    // In reverse order
                    backBuf.push(REPLACEMENT[j]);
                }
                return;
            }
        }
    }

}

The code has been tested with the following input data (first parameter is expected output, second parameter is raw input):

    test("Foo & Bar", "Foo & Bar");
    test("&&&", "&&&");
    test("&&& ", "&&& ");
    test(" &&&", " &&&");
    test("&", "&");
    test("&", "&");
    test("&&", "&&");
    test("&&&", "&&&");
    test("test", "test");
    test("", "");
    test("testtesttest&", "testtesttest&");
泪之魂 2024-10-17 10:54:19

为了避免将所有数据读入 RAM,您可以实现 FilterInputStream(您必须重写 read() 和 read(byte[],int, int) 并以某种方式缓冲这些额外的字节,这不会导致代码变短。


真正的解决方案是修复无效的数据源(如果您要自动化,则需要考虑编写。您自己的 XML 解析器)。

您的方法有一些缺陷。

To avoid reading all the data into RAM, you could implement a FilterInputStream (you would have to override both read() and read(byte[],int,int) and look at buffering those extra bytes somehow. This will not result in shorter code.


The real solution is to fix the invalid data source (and if you're going to automate that, you need to look at writing your own XML parser).

Your approach has a few flaws.

  • The result of String.getBytes() is system dependent; it is also a transcoding operation that may not be symmetrical with whatever StringUtil.toString does - default encodings on many systems are lossy. You should perform the transcoding using the XML document encoding.
  • A global search-and-replace like this may corrupt your document - ampersands can exist in CDATA, entities and entity declarations.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文