在正则表达式中解析 FIX 协议？

发布于 2024-12-17 13:00:12 字数 700 浏览 2 评论 0原文

我需要解析包含 FIX 协议消息的日志文件。

每行包含标头信息（时间戳、日志记录级别、端点），后跟 FIX 有效负载。

我使用正则表达式将标头信息解析为命名组。例如：

 <?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*

然后我来到 FIX 有效负载本身（^A 是每个标签之间的分隔符）例如：

8=FIX.4.2^A9=61^A35=A...^A11=blahblah...

我需要从中提取特定标签（例如 35= 中的“A”，或 11= 中的“blahblah”），并忽略所有其他的东西 - 基本上我需要忽略“35 = A”之前的任何内容，以及“11 = blahblah”之后的任何内容，然后忽略之后的任何内容等等。

我确实知道有一个库可能能够解析每个内容每个标签（http://source.kentyde.com/fixlib/overview），但是，如果可能的话，我希望在这里使用正则表达式的简单方法，因为我实际上只需要几个标签。

正则表达式有没有好的方法来提取我需要的标签？

干杯，胜利者

原文

I need to parse a logfiles that contains FIX protocol messages.

Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.

I've used regex to parse the header information into named groups. E.g.:

 <?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*

I then come to the FIX payload itself (^A is the separator between each tag) e.g:

8=FIX.4.2^A9=61^A35=A...^A11=blahblah...

I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.

I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.

Is there a good way in regex to extract the tags I require?

Cheers,
Victor

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白鸥掠海 2024-12-24 13:00:12

无需拆分“\x01”，然后使用正则表达式，然后进行过滤。如果您只想要标签 34,49 和 56（MsgSeqNum、SenderCompId 和 TargetCompId），您可以使用正则表达式：

dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))

如果您知道发件人没有嵌入可能在任何简单正则表达式中导致错误的数据，则像这样的简单正则表达式将起作用。具体来说：

没有原始数据字段（实际上是数据长度和原始数据的组合，如 RawDataLength、RawData (95/96) 或 XmlDataLen、XmlData (212,213)
没有 unicode 字符串的编码字段，如 EncodedTextLen、EncodedText (354/355)

处理这些情况需要大量额外的解析。我使用自定义的 python 解析器，但即使你上面引用的 fixlib 代码也会导致这些情况错误。清除这些异常，上面的正则表达式应该返回所需字段的一个很好的字典

编辑： 我按原样保留了上面的正则表达式，但应该对其进行修改，以便最终的匹配元素为 <。 code>(?=\x01) 可以在 @tropleee 的在此处回答。

No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:

dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))

Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:

No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)

To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.

Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in @tropleee's answer here.

回复收藏 0 原文