在正则表达式中解析 FIX 协议?
我需要解析包含 FIX 协议消息的日志文件。
每行包含标头信息(时间戳、日志记录级别、端点),后跟 FIX 有效负载。
我使用正则表达式将标头信息解析为命名组。例如:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
然后我来到 FIX 有效负载本身(^A 是每个标签之间的分隔符)例如:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
我需要从中提取特定标签(例如 35= 中的“A”,或 11= 中的“blahblah”),并忽略所有其他的东西 - 基本上我需要忽略“35 = A”之前的任何内容,以及“11 = blahblah”之后的任何内容,然后忽略之后的任何内容等等。
我确实知道有一个库可能能够解析每个内容每个标签(http://source.kentyde.com/fixlib/overview),但是,如果可能的话,我希望在这里使用正则表达式的简单方法,因为我实际上只需要几个标签。
正则表达式有没有好的方法来提取我需要的标签?
干杯, 胜利者
I need to parse a logfiles that contains FIX protocol messages.
Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.
I've used regex to parse the header information into named groups. E.g.:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
I then come to the FIX payload itself (^A is the separator between each tag) e.g:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.
I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.
Is there a good way in regex to extract the tags I require?
Cheers,
Victor
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
无需拆分“\x01”,然后使用正则表达式,然后进行过滤。如果您只想要标签 34,49 和 56(MsgSeqNum、SenderCompId 和 TargetCompId),您可以使用正则表达式:
如果您知道发件人没有嵌入可能在任何简单正则表达式中导致错误的数据,则像这样的简单正则表达式将起作用。具体来说:
处理这些情况需要大量额外的解析。我使用自定义的 python 解析器,但即使你上面引用的 fixlib 代码也会导致这些情况错误。清除这些异常,上面的正则表达式应该返回所需字段的一个很好的字典
编辑: 我按原样保留了上面的正则表达式,但应该对其进行修改,以便最终的匹配元素为 <。 code>(?=\x01) 可以在 @tropleee 的 在此处回答。
No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:
Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:
To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.
Edit: I've left the above regex as-is but it should be revised so that the final match element be
(?=\x01)
. The explanation can be found in @tropleee's answer here.^A 实际上是 \x{01},这就是它在 vim 中的显示方式。在 Perl 中,我通过对十六进制 1 进行拆分,然后对“=”进行拆分来完成此操作,在第二次拆分时,数组的值 [0] 是标记,值 [1] 是值。
^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.
使用正则表达式工具,例如expresso或regexbuddy。
为什么不在
^A
上进行拆分,然后将每个匹配([^=])+=(.*)
将它们放入哈希中?您还可以使用开关进行过滤,默认情况下不会添加您不感兴趣的标签,并且会过滤您感兴趣的所有标签。Use a regex tool like expresso or regexbuddy.
Why don't you split on
^A
and then match([^=])+=(.*)
for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.