如何最好地读取和 UTF-8 解码字节缓冲区?
我有一个 Stream
生成 UTF-8 编码的字符串。这些字符串代表我需要解析的 XML 文档。该流是从 TcpClient
。
假设我将流读入大小为 64 的缓冲区(我知道有点小)。将这些 64 字节缓冲区直接传递到字符串解码步骤可能会失败,因为某些 UTF-8 编码字符可能会沿着 64 字节边界分割。缓冲区可能以字符的前两个字节结束,下一个缓冲区包含该字符的最后一个字节。
我现在要做的是连接缓冲区,直到执行一次未读取完整 64 个字节的读取,这表明我已读取到某些内容的末尾(在我的例子中是一个 XML 文档)。然而,偶尔我读到的 XML 文档恰好在 64 字节边界处结束。在这种情况下,我不知道我可以将字节数组传递到解码步骤(并且我需要等待下一个文档)。
我意识到我可以通过增加缓冲区大小来降低机会。然而,这种情况发生的可能性总是很小。我还可以增加缓冲区大小,以便我遇到的任何 XML 文档都适合,但我只是想知道是否有另一种解决方案,以某种方式从字节流中检测字符边界在哪里。
I have a Stream
that produces UTF-8 encoded strings. The strings represent XML documents that I need to parse. The stream is obtained from a TcpClient
.
Suppose I read the stream into buffers of size 64 (a little small, I know). Passing these 64 byte buffers directly to the string decoding step could fail because some UTF-8 encoded characters may be split along the 64 byte boundary. The buffer may end with the first two bytes of a character and the next buffer has the last byte for this character.
What I do now, is concatenate buffers until I perform a read that doesn't read the full 64 bytes, indicating that I have read to the end of something (in my case, an XML document). However, once in a while, an XML documents I read ends exactly at the 64 byte boundary. In such a case, I do not know I can pass the byte array to the decoding step (and I need to wait for the next document).
I realize I can lower the chances by increasing the buffer size. However, a small chance always remains that it happens. I could also increase the buffer size such that any XML document I encounter will fit, but I just wonder whether there is another solution, somehow detecting from the byte stream where the character boundaries are.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您对问题和陷阱的看法是正确的。
解决方案已经存在:在您的流周围包装一个
StreamReader
并使用Read()
和ReadLine()
如果您确实想要一个 DIY 解决方案,那么您'必须查看编码器状态属性。超出了我的能力范围。
You are right about the problems and pitfalls.
The solution already exists: wrap a
StreamReader
around your stream and useRead()
andReadLine()
If you do want a DIY solution you'll have to look at the Encoder state properties. Beyond my capabilities.
我相信你的方法在理论上是有缺陷的,即使它在实践中应该总是正确工作:不能保证成功读取小于(缓冲区大小)表明已完整接收 XML 文档。 TCP 堆栈完全有权利每次向您返回一个字节的文档。将缓冲区大小增加到几 KB 应该会导致此问题显现出来。
正确解决上述缺陷也将解决您当前的问题:在 TCP 流中的每个 XML 文档之前添加某种固定长度标头(例如 8 字节),其中包含以下文档的长度。当您阅读完整的标题(因为它的大小是固定的)时,您将始终知道,并且根据标题,您将知道何时收到整个文档。
I believe that your approach is theoretically flawed, even if it should always work correctly in practice: there is no guarantee that a successful read of less than (buffer size) indicates that an XML document has been received in its entirety. The TCP stack is fully within its rights to give you back the document one byte at a time. Increasing the buffer size to several KB should cause this problem to manifest itself.
Addressing the above flaw properly will also solve your current issue: prepend some kind of fixed-length header (e.g. 8 bytes) that contains the following document's length before each XML document in your TCP stream. You will always know when you have read a full header (because it's fixed size), and given the header you will know when you have received the whole document.