是否有好的数据翻译/转换模式/习惯用法？

发布于 2024-07-30 09:09:00 字数 1096 浏览 4 评论 0原文

我对这个问题的通用标题感到抱歉，但我希望我能够更笼统地表达它。 :-}

我想编写一个软件（在本例中，使用 C++）将输入令牌流转换为输出令牌流。只有五个输入标记（我们称它们为 0、1、2、3、4），每个标记都可以有一些不同的属性（例如，可能有 4.x 属性或 0.foo）。还有一些输出标记，大约十个，我们称它们为（Out0..Out9），每个标记也有一些属性。

现在，我们一直致力于从输入标记到可能的输出标记的序列映射，如下所示：

01 -> Out0
34 -> Out1
0101 -> Out3

...因此不同的输入标记序列映射到单个输出标记。

在我的场景中，输入标记集是固定的，但输出标记集不是固定的 - 我们可能决定引入新的“产品”。我的问题是：

有人知道是否有好的模式和/或习语可以在这种情况下提供帮助？

现在我有一组“压缩机”对象，每个对象都可以消耗输入令牌并最终生成输出令牌。问题是某些输入标记发生冲突，请考虑上述情况中的“Out0”和“Out3”。输入“0101”应该产生 Out3，但不是 Out0。但是，输入“0104”应该产生 Out0，然后将 0 和 4 留在队列中。

我想知道数据压缩或其他领域是否存在可能有益的模式。

这种将低级标记的输入“减少”为高级标记并处理可能的冲突的工作在解析器编写者中很常见，不是吗？那里有有用的模式吗？

更新：

更多信息：

在我的具体情况中，输入标记是 C 结构，输出标记是 C++ 对象。我无法控制令牌的输入流，但我可以对它们进行排队，然后在有利的情况下修改队列。
我通过尝试先匹配 Out3，然后匹配 Out0 来解决冲突（例如 Out3 (0101) 与 Out0 (01)），但这有点难看。可能的产生式在一个列表中，我只是尝试将它们一个接一个地应用到输入流
可能的产生式列表可以由用户扩展，所以我无法生成一个巨大的 DAG 然后拥有一个状态实现它来处理每个可能的转换的机器。当然，这意味着用户可以添加冲突，但事实就是如此。
用户可以

原文

I'm sorry for the generic title of this question but I wish I was able to articulate it less generically. :-}

I'd like to write a piece of software (in this case, using C++) which translates a stream of input tokens into a stream of output tokens. There are just five input tokens (lets call them 0, 1, 2, 3, 4) and each of them can have a few different attributes (like, There might be an 4.x property or 0.foo). There are a few more output tokens, about ten, let's call them (Out0..Out9) each of them also has a few properties.

Now, we've been working on a mapping of sequences from input tokens to possible output tokens, like this:

01 -> Out0
34 -> Out1
0101 -> Out3

...so different sequences of input tokens map to a single output token.

In my scenario, the set of input tokens is fixed, but the set of output tokens is not - we might decide to introduce new 'productions'. My question is:

Does anybody know whether there are good patterns and/or idioms which help in such a situation?

Right now I have a set of 'Compressor' object, each of which can eat the input tokens and eventually produces the output tokens. The problem is that some input tokens clash, consider 'Out0' and 'Out3' in the above case. The input '0101' should yield Out3 but not Out0. However, the input '0104 should yield Out0 and then leave 0 and 4 in the queue.

I'm wondering whether there are maybe patterns from data compression or other areas which might be beneficial.

This work of 'reducing' an input of lowlevel tokens to highlevel tokens and dealing with possible conflicts is common among parser writers, no? Are there are useful patterns there?

UPDATE:

A bit more information:

in my concrete case, the input tokens are C structs, and the output tokens are C++ objects. I have no control whatsoever over the input stream of tokens, but I can queue them and then modify the queue in case that is beneficial.
I solved clashes (like Out3 (0101) vs. Out0 (01)) by trying to match Out3 first and then Out0, but it's a bit ugly. The possible productions are in a list and I simply try to apply them to the input stream, one after the other
The list of possible productions can be extended by the user, so I cannot generate one huge DAG and then have a state machine which implements that to handle every possible transition. Of course, this means that the user can add clashes, but that's just the way it is.

分享到QQ

分享到微博