如何编写匹配 X 个字符的 Antlr4 语法
我想使用Antlr4来解析以序列化形式存储段长度的格式
例如解析: “6,Hello 5,World”
我尝试创建这样的语法
grammar myGrammar;
sequence:
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
这对于 Antlr 来说可能吗?
一个现实世界的例子是解析 messagePack 二进制格式,其中有几种将数据长度序列化为序列化形式的类型。
例如,有 str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
| 0xd9 |YYYYYYYY| data |
+--------+--------+========+
和 str16 类型
str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
+--------+--------+--------+========+
在这些示例中,第一个字节标识类型,然后我们有 1 个字节用于 str8,2 个字节用于 str16,其中包含数据的长度。最后是数据。
我认为规则可能看起来像这样,但不知道如何匹配正确的数据量
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???
I want to use Antlr4 to parse a format that stores the length of segments in the serialised form
For example, to parse:
"6,Hello 5,World"
I tried to create a grammar like this
grammar myGrammar;
sequence:
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
Is this even possible with Antlr?
A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.
For example there is the str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
| 0xd9 |YYYYYYYY| data |
+--------+--------+========+
And str16 type
str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
+--------+--------+--------+========+
In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.
I think a rule might look something like this but dont know how to match the right amount of data
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您描述的数据格式通常称为 TLV(标签/类型-长度-值)。 TLV 无法使用正则表达式(甚至上下文无关语法)进行识别,因此标准标记器通常不支持它。
幸运的是,它很容易标记化。特定格式可能存在标准库,某些格式甚至具有自动代码生成器以实现更有效的解析。但是您应该能够用几行代码为特定格式编写一个简单的标记器。
一旦编写了数据流标记器,您就可以使用像 Antlr 这样的解析器生成器从解析中构建数据结构,但这很少是必要的。大多数 TLV 编码流都是简单的组件序列,尽管您偶尔会遇到包含嵌套子序列的格式(例如 Google protobufs 或 ASN.1)。即使有了这些,解析也是直接的(尽管对于这两个示例,都存在标准工具)。
无论如何,使用像 Antlr 这样的上下文无关语法工具很少是最简单的解决方案,因为 TLV 格式大多与顺序无关。 (如果顺序是固定的,则不需要标签。)上下文无关语法没有任何方式处理诸如“按任意顺序最多 A、B、C、D 和 E 之一”之类的语言除了列举备选方案之外,备选方案的数量呈指数级增长。
The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.
Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.
Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).
In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.