在 .NET3.5 中处理格式错误的 XML

发布于 2024-11-16 20:37:06 字数 1189 浏览 1 评论 0原文

给定一个通过 TCP 将 XML 传输给我的第三方系统。传输的 XML 内容总数(不是流中的一条消息,而是串联的消息)如下所示:

   <root>
      <insert ....><remark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</remark></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

上述示例的每一行都可以单独处理。由于这是一个流处理过程,我不能只是等待一切都到达,我必须在内容到来时对其进行处理。问题是内容块可以按任意点进行切片,不考虑任何标签。 如果内容以这样的片段形式到达,您对如何处理内容有一些好的建议吗?

块 1:

  <root>
      <insert ....><rem

块 2:

                      ark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</rema

块 N:

                                    rk></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

编辑:

虽然处理速度不是问题(没有实时问题),但我不能等待整个消息。实际上最后一块永远不会到达。第三方系统每当遇到变化时都会发送消息。这个过程永远不会结束,它是一条永不停歇的溪流。

Given a third party system that streams XML to me via TCP. The TOTAL transmitted XML content (not one message of the stream, but concatenated messages) looks like this :

   <root>
      <insert ....><remark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</remark></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

Every line of the above sample is individually processable. Since it is a streaming process, I cannot just wait out until everything arrives, I have to process the content as it comes. The problem is the content chunks can be sliced by any point, no tags are respected.
Do you have some good advice on how to process the content if it arrives in fragments like this?

Chunk 1:

  <root>
      <insert ....><rem

Chunk 2:

                      ark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</rema

Chunk N:

                                    rk></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

EDIT:

While processing speed is not a concern (no realtime troubles), I cannot wait for the entire message. Practically the last chunk never arrives. The third party system sends messages whenever it encounters changes. The process never ends, it is a stream that never stops.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情释 2024-11-23 20:37:06

我对这个问题的第一个想法是创建一个简单的 TextReader 衍生物,负责缓冲来自流的输入。然后,该类将用于提供 XmlReader。 TextReader 衍生产品可以相当轻松地扫描传入的内容,查找完整的 XML“块”(带有开始和结束括号的完整元素、文本片段、完整属性等)。它还可以向调用代码提供一个标志,以指示一个或多个“块”何时可用,以便它可以从 XmlReader 请求下一个 XML 节点,这将触发从 TextReader 派生程序发送该块并将其从缓冲区中删除。

编辑:这是一个快速而肮脏的例子。我不知道它是否完美运行(我还没有测试过),但它传达了我试图传达的想法。

public class StreamingXmlTextReader : TextReader
{
    private readonly Queue<string> _blocks = new Queue<string>();
    private string _buffer = String.Empty;
    private string _currentBlock = null;
    private int _currentPosition = 0;

    //Returns if there are blocks available and the XmlReader can go to the next XML node
    public bool AddFromStream(string content)
    {
        //Here is where we would can for simple blocks of XML
        //This simple chunking algorithm just uses a closing angle bracket
        //Not sure if/how well this will work in practice, but you get the idea
        _buffer = _buffer + content;
        int start = 0;
        int end = _buffer.IndexOf('>');
        while(end != -1)
        {
            _blocks.Enqueue(_buffer.Substring(start, end - start));
            start = end + 1;
            end = _buffer.IndexOf('>', start);
        }

        //Store the leftover if there is any
        _buffer = end < _buffer.Length
            ? _buffer.Substring(start, _buffer.Length - start) : String.Empty;

        return BlocksAvailable;
    }

    //Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
    public bool BlocksAvailable { get { return _blocks.Count > 0; } }

    public override int Read()
    {
        if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
        {
            //Get the next character in this block
            return _currentBlock[_currentPosition++];
        }
        if(BlocksAvailable)
        {
            _currentBlock = _blocks.Dequeue();
            _currentPosition = 0;
            return _currentBlock[0];
        }
        return -1;
    }
}

My first thought for this problem is to create a simple TextReader derivative that is responsible for buffering input from the stream. This class would then be used to feed an XmlReader. The TextReader derivative could fairly easily scan the incoming content looking for complete "blocks" of XML (a complete element with starting and ending brackets, a text fragment, a full attribute, etc.). It could also provide a flag to the calling code to indicate when one or more "blocks" are available so it can ask for the next XML node from the XmlReader, which would trigger sending that block from the TextReader derivative and removing it from the buffer.

Edit: Here's a quick and dirty example. I have no idea if it works perfectly (I haven't tested it), but it gets across the idea I was trying to convey.

public class StreamingXmlTextReader : TextReader
{
    private readonly Queue<string> _blocks = new Queue<string>();
    private string _buffer = String.Empty;
    private string _currentBlock = null;
    private int _currentPosition = 0;

    //Returns if there are blocks available and the XmlReader can go to the next XML node
    public bool AddFromStream(string content)
    {
        //Here is where we would can for simple blocks of XML
        //This simple chunking algorithm just uses a closing angle bracket
        //Not sure if/how well this will work in practice, but you get the idea
        _buffer = _buffer + content;
        int start = 0;
        int end = _buffer.IndexOf('>');
        while(end != -1)
        {
            _blocks.Enqueue(_buffer.Substring(start, end - start));
            start = end + 1;
            end = _buffer.IndexOf('>', start);
        }

        //Store the leftover if there is any
        _buffer = end < _buffer.Length
            ? _buffer.Substring(start, _buffer.Length - start) : String.Empty;

        return BlocksAvailable;
    }

    //Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
    public bool BlocksAvailable { get { return _blocks.Count > 0; } }

    public override int Read()
    {
        if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
        {
            //Get the next character in this block
            return _currentBlock[_currentPosition++];
        }
        if(BlocksAvailable)
        {
            _currentBlock = _blocks.Dequeue();
            _currentPosition = 0;
            return _currentBlock[0];
        }
        return -1;
    }
}
跨年 2024-11-23 20:37:06

经过进一步调查,我们发现每当 TCP 缓冲区已满时,XML 流就会被分割。因此,切片实际上是在字节流中随机发生的,甚至在 unicode 字符内部也会导致剪切。
因此,我们必须在字节级别组装各个部分并将其转换回文本。如果转换失败,我们等待下一个字节块,然后重试。

After further investigation we figured out that the XML stream has been sliced up by the TCP buffer, whenever it got full. Therefore, slicing happened actually randomly in the byte stream causing cuts even inside unicode characters.
Therefore, we had to assemble the parts on byte level and convert that back to text. Should converstion fail, we waited for the next byte chunk, and tried again.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文