是否有好的数据翻译/转换模式/习惯用法?
我对这个问题的通用标题感到抱歉,但我希望我能够更笼统地表达它。 :-}
我想编写一个软件(在本例中,使用 C++)将输入令牌流转换为输出令牌流。 只有五个输入标记(我们称它们为 0、1、2、3、4),每个标记都可以有一些不同的属性(例如,可能有 4.x 属性或 0.foo)。 还有一些输出标记,大约十个,我们称它们为(Out0..Out9),每个标记也有一些属性。
现在,我们一直致力于从输入标记到可能的输出标记的序列映射,如下所示:
01 -> Out0
34 -> Out1
0101 -> Out3
...因此不同的输入标记序列映射到单个输出标记。
在我的场景中,输入标记集是固定的,但输出标记集不是固定的 - 我们可能决定引入新的“产品”。 我的问题是:
有人知道是否有好的模式和/或习语可以在这种情况下提供帮助?
现在我有一组“压缩机”对象,每个对象都可以消耗输入令牌并最终生成输出令牌。 问题是某些输入标记发生冲突,请考虑上述情况中的“Out0”和“Out3”。 输入“0101”应该产生 Out3,但不是 Out0。 但是,输入“0104”应该产生 Out0,然后将 0 和 4 留在队列中。
我想知道数据压缩或其他领域是否存在可能有益的模式。
这种将低级标记的输入“减少”为高级标记并处理可能的冲突的工作在解析器编写者中很常见,不是吗? 那里有有用的模式吗?
更新:
更多信息:
在我的具体情况中,输入标记是 C 结构,输出标记是 C++ 对象。 我无法控制令牌的输入流,但我可以对它们进行排队,然后在有利的情况下修改队列。
我通过尝试先匹配 Out3,然后匹配 Out0 来解决冲突(例如 Out3 (0101) 与 Out0 (01)),但这有点难看。 可能的产生式在一个列表中,我只是尝试将它们一个接一个地应用到输入流
可能的产生式列表可以由用户扩展,所以我无法生成一个巨大的 DAG 然后拥有一个状态实现它来处理每个可能的转换的机器。 当然,这意味着用户可以添加冲突,但事实就是如此。
用户可以
I'm sorry for the generic title of this question but I wish I was able to articulate it less generically. :-}
I'd like to write a piece of software (in this case, using C++) which translates a stream of input tokens into a stream of output tokens. There are just five input tokens (lets call them 0, 1, 2, 3, 4) and each of them can have a few different attributes (like, There might be an 4.x property or 0.foo). There are a few more output tokens, about ten, let's call them (Out0..Out9) each of them also has a few properties.
Now, we've been working on a mapping of sequences from input tokens to possible output tokens, like this:
01 -> Out0
34 -> Out1
0101 -> Out3
...so different sequences of input tokens map to a single output token.
In my scenario, the set of input tokens is fixed, but the set of output tokens is not - we might decide to introduce new 'productions'. My question is:
Does anybody know whether there are good patterns and/or idioms which help in such a situation?
Right now I have a set of 'Compressor' object, each of which can eat the input tokens and eventually produces the output tokens. The problem is that some input tokens clash, consider 'Out0' and 'Out3' in the above case. The input '0101' should yield Out3 but not Out0. However, the input '0104 should yield Out0 and then leave 0 and 4 in the queue.
I'm wondering whether there are maybe patterns from data compression or other areas which might be beneficial.
This work of 'reducing' an input of lowlevel tokens to highlevel tokens and dealing with possible conflicts is common among parser writers, no? Are there are useful patterns there?
UPDATE:
A bit more information:
in my concrete case, the input tokens are C structs, and the output tokens are C++ objects. I have no control whatsoever over the input stream of tokens, but I can queue them and then modify the queue in case that is beneficial.
I solved clashes (like Out3 (0101) vs. Out0 (01)) by trying to match Out3 first and then Out0, but it's a bit ugly. The possible productions are in a list and I simply try to apply them to the input stream, one after the other
The list of possible productions can be extended by the user, so I cannot generate one huge DAG and then have a state machine which implements that to handle every possible transition. Of course, this means that the user can add clashes, but that's just the way it is.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以定义一个图,其中每个节点都包含一个输入标记和一个关联的输出。 每个节点的链接描述了可能的下一个令牌。 因此,图中的路径描述了可能的变换规则。
要转换数据,请从与第一个输入标记对应的节点开始,并尝试在可能的最长路径上导航图形,将下一个输入标记与链接到当前节点的节点相匹配。 当没有链接节点与下一个输入节点匹配时,将与当前节点关联的输出作为结果。
You could define a graph, where each node contains an input token and an associated output. The links of each node describe the possible next tokens. Thus, a path in the graph describe a possible transformation rule.
To transform the data, start from the node corresponding to the first input token, and try to navigate the graph on the longest path possible, matching the next input token to the nodes linked to the current node. When no linked node matches the next input node, take the output associated with the current node as the result.
在设计数据压缩算法时,您需要注意一个代码的开头不能被误认为是另一个较短的代码。 这是汉明码的基础。 另一种选择是使用分隔符来分隔标记,就像摩尔斯电码(在字符之间使用短暂的停顿)一样。
When designing a data compression algorithm, you need to take care that the beginning of one code can't be mistaken for another shorter code. This is the basis for Hamming code. The other alternative is to have a delimiter character separating your tokens, like in Morse code (which uses a short pause between characters).
几年前,我会立即说看看 Bison http://www.gnu.org/software/bison / 或 Yacc 但我已经有一段时间没有做过这样的事情了,所以不知道是否有更好的。
对于您正在做的事情来说,使用它们可能有点过头了,但所使用的习语可能很有用。
Years ago I would have immediately said look at Bison http://www.gnu.org/software/bison/ or Yacc but I haven't done anything like this for some time so don't know if there is anything better.
Using them might be a bit over the top for what you are doing but the idioms used might be useful.