如何编写匹配 X 个字符的 Antlr4 语法

发布于 2025-01-16 03:23:49 字数 1080 浏览 1 评论 0原文

我想使用Antlr4来解析以序列化形式存储段长度的格式

例如解析： “6，Hello 5，World”

我尝试创建这样的语法

grammar myGrammar;

sequence:
 (LEN ',' TEXT)*;

LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars

这对于 Antlr 来说可能吗？

一个现实世界的例子是解析 messagePack 二进制格式，其中有几种将数据长度序列化为序列化形式的类型。

例如，有 str8：

str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
|  0xd9  |YYYYYYYY|  data  |
+--------+--------+========+

和 str16 类型

str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
|  0xda  |ZZZZZZZZ|ZZZZZZZZ|  data  |
+--------+--------+--------+========+

在这些示例中，第一个字节标识类型，然后我们有 1 个字节用于 str8，2 个字节用于 str16，其中包含数据的长度。最后是数据。

我认为规则可能看起来像这样，但不知道如何匹配正确的数据量

str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;

BYTE : '\u0000'..'\u00FF' ;
DATA : ???

原文

I want to use Antlr4 to parse a format that stores the length of segments in the serialised form

For example, to parse:
"6,Hello 5,World"

I tried to create a grammar like this

grammar myGrammar;

sequence:
 (LEN ',' TEXT)*;

LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars

Is this even possible with Antlr?

A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.

For example there is the str8:

str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
|  0xd9  |YYYYYYYY|  data  |
+--------+--------+========+

And str16 type

str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
|  0xda  |ZZZZZZZZ|ZZZZZZZZ|  data  |
+--------+--------+--------+========+

In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.

I think a rule might look something like this but dont know how to match the right amount of data

str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;

BYTE : '\u0000'..'\u00FF' ;
DATA : ???

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不甘平庸 2025-01-23 03:23:49

您描述的数据格式通常称为 TLV（标签/类型-长度-值）。 TLV 无法使用正则表达式（甚至上下文无关语法）进行识别，因此标准标记器通常不支持它。

幸运的是，它很容易标记化。特定格式可能存在标准库，某些格式甚至具有自动代码生成器以实现更有效的解析。但是您应该能够用几行代码为特定格式编写一个简单的标记器。

一旦编写了数据流标记器，您就可以使用像 Antlr 这样的解析器生成器从解析中构建数据结构，但这很少是必要的。大多数 TLV 编码流都是简单的组件序列，尽管您偶尔会遇到包含嵌套子序列的格式（例如 Google protobufs 或 ASN.1）。即使有了这些，解析也是直接的（尽管对于这两个示例，都存在标准工具）。

无论如何，使用像 Antlr 这样的上下文无关语法工具很少是最简单的解决方案，因为 TLV 格式大多与顺序无关。（如果顺序是固定的，则不需要标签。）上下文无关语法没有任何方式处理诸如“按任意顺序最多 A、B、C、D 和 E 之一”之类的语言除了列举备选方案之外，备选方案的数量呈指数级增长。

回复收藏 0 原文

~没有更多了~