从字节流中解析可变长度描述符并对其类型进行操作
我正在从包含一系列可变长度描述符的字节流中读取数据,我在代码中将其表示为各种结构/类。每个描述符都有一个与所有其他描述符相同的固定长度标头,用于标识其类型。
是否有合适的模型或模式可以用来最好地解析和表示每个描述符,然后根据其类型执行适当的操作?
I'm reading from a byte stream that contains a series of variable length descriptors which I'm representing as various structs/classes in my code. Each descriptor has a fixed length header in common with all the other descriptors, which are used to identify its type.
Is there an appropriate model or pattern I can use to best parse and represent each descriptor, and then perform an appropriate action depending on it's type?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我已经编写了很多此类解析器。
我建议您读取固定长度标头,然后使用简单的 switch-case 将正确的构造函数分派到您的结构,将固定标头和流传递给该构造函数,以便它可以使用流的可变部分。
I've written lots of these types of parser.
I recommend that you read the fixed length header, and then dispatch to the correct constructor to your structures using a simple switch-case, passing the fixed header and stream to that constructor so that it can consume the variable part of the stream.
这是文件解析中的常见问题。通常,您读取描述符的已知部分(幸运的是在这种情况下是固定长度的,但并非总是如此),并将其分支到那里。一般来说,我在这里使用 策略模式,因为我通常希望系统具有广泛的灵活性 - 但是直接开关或工厂也可以工作。
另一个问题是:你控制并信任下游代码吗?含义:工厂/策略实施?如果这样做,那么您只需为它们提供流和您期望它们消耗的字节数(也许放置一些调试断言,以验证它们是否读取了正确的数量)。
如果您不能信任工厂/策略实现(也许您允许用户代码使用自定义反序列化器),那么我将在流顶部构建一个包装器(示例:来自 protobuf-net 的
SubStream
),只允许消耗预期的字节数(之后报告 EOF),并且不允许在该块之外进行查找/等操作。我还会进行运行时检查(即使在发布版本中)是否已消耗足够的数据 - 但在这种情况下,我可能只会读取任何未读的数据 - 即,如果我们预计下游代码消耗 20 个字节,但它只读取 12 个字节,然后跳过接下来的 8 个并读取我们的下一个描述符。对此进行扩展;这里的一个策略设计可能是这样的:
您可以为每个预期标记构建此类序列化器的字典(或者只是一个列表,如果数量很小),并解析您的序列化器,然后调用
Deserialize
方法。如果您不认识标记,则(其中之一):作为上述内容的旁注- 如果系统是在运行时确定的,无论是通过反射还是通过运行时 DSL(等等),这种方法(策略)非常有用。如果系统在编译时完全是可预测的(因为它不会改变,或者因为您正在使用代码生成),那么直接
switch
方法可能会更有效适当的 - 并且您可能不需要任何额外的接口,因为您可以直接注入适当的代码。This is a common problem in file parsing. Commonly, you read the known part of the descriptor (which luckily is fixed-length in this case, but isn't always), and branch it there. Generally I use a strategy pattern here, since I generally expect the system to be broadly flexible - but a straight switch or factory may work as well.
The other question is: do you control and trust the downstream code? Meaning: the factory / strategy implementation? If you do, then you can just give them the stream and the number of bytes you expect them to consume (perhaps putting some debug assertions in place, to verify that they do read exactly the right amount).
If you can't trust the factory/strategy implementation (perhaps you allow the user-code to use custom deserializers), then I would construct a wrapper on top of the stream (example:
SubStream
from protobuf-net), that only allows the expected number of bytes to be consumed (reporting EOF afterwards), and doesn't allow seek/etc operations outside of this block. I would also have runtime checks (even in release builds) that enough data has been consumed - but in this case I would probably just read past any unread data - i.e. if we expected the downstream code to consume 20 bytes, but it only read 12, then skip the next 8 and read our next descriptor.To expand on that; one strategy design here might have something like:
You might build a dictionary (or just a list if the number is small) of such serializers per expected markers, and resolve your serializer, then invoke the
Deserialize
method. If you don't recognise the marker, then (one of):As a side-note to the above - this approach (strategy) is useful if the system is determined at runtime, either via reflection or via a runtime DSL (etc). If the system is entirely predictable at compile-time (because it doesn't change, or because you are using code-generation), then a straight
switch
approach may be more appropriate - and you probably don't need any extra interfaces, since you can inject the appropriate code directly.要记住的一个关键事情是,如果您正在从流中读取并且没有检测到有效的标头/消息,请在重试之前仅丢弃第一个字节。我多次看到整个数据包或消息被丢弃,这可能导致有效数据丢失。
One key thing to remember, if you're reading from the stream and do not detect a valid header/message, throw away only the first byte before trying again. Many times I've seen a whole packet or message get thrown away instead, which can result in valid data being lost.
听起来这可能是 工厂方法 的工作,或者可能是 抽象工厂。根据标头,您选择要调用的工厂方法,并返回相关类型的对象。
这是否比简单地将构造函数添加到 switch 语句更好取决于您所创建的对象的复杂性和一致性。
This sounds like it might be a job for the Factory Method or perhaps Abstract Factory. Based on the header you choose which factory method to call, and that returns an object of the relevant type.
Whether this is better than simply adding constructors to a switch statement depends on the complexity and the uniformity of the objects you're creating.
我建议:
使用这种方法:
I would suggest:
With this method:
如果您希望它是良好的 OO,您可以在对象层次结构中使用访问者模式。我的做法是这样的(用于识别从网络捕获的数据包,几乎与您可能需要的东西相同):
巨大的对象层次结构,有一个父类
每个类都有一个向其父类注册的静态构造函数,因此父类知道其直接子类(这是 C++,我认为在具有良好反射支持的语言中不需要此步骤)
每个类都有一个静态构造函数方法,该方法获取剩余的字节流的一部分,并基于此,它决定是否有责任处理该数据
当数据包传入时,我只是将其传递给主父类(称为 Packet)的静态构造函数方法,该方法又检查其所有子类是否有责任处理该数据包,这会递归进行,直到层次结构底部的一个类返回实例化的类。
每个静态“构造函数”方法从字节流中剪切自己的标头,并仅将有效负载传递给其子级。
每个静态“构造
这种方法的优点是,您可以在对象层次结构中的任何位置添加新类型,而无需查看/更改任何其他类。对于数据包来说,它的效果非常好。它是这样的:
我希望你能明白这个想法。
If you'd like it to be nice OO, you can use the visitor pattern in an object hierarchy. How I've done it was like this (for identifying packets captured off the network, pretty much the same thing you might need):
huge object hierarchy, with one parent class
each class has a static contructor that registers with its parent, so the parent knows about its direct children (this was c++, I think this step is not needed in languages with good reflection support)
each class had a static constructor method that got the remaining part of the bytestream and based on that, it decided if it is his responsibility to handle that data or not
When a packet came in, I've simply passed it to static constructor method of the main parent class (called Packet), which in turn checked all of its children if it's their responsibility to handle that packet, and this went recursively, until one class at the bottom of the hierarchy returned the instantiated class back.
Each of the static "constructor" methods cut its own header from the bytestream and passed down only the payload to its children.
The upside of this approach is that you can add new types anywhere in the object hierarchy WITHOUT needing to see/change ANY other class. It worked remarkably nice and well for packets; it went like this:
I hope you can see the idea.