PHP 中的字符串解析
对于我自己的一个小项目,我正在编写一个解析器来解析来自某个应用程序的事件日志。通常情况下,处理这样的事情我不会有什么问题,但问题是这些日志中的字符串并不总是具有相同的参数。例如,一个这样的字符串可以是:
DD/MM HH:MM:SS.MSEC TYPE_OF_EVENT SOURCE, SOURCE_FLAGS, TARGET, TARGET_FLAGS, PARAM1
在另一种情况下,该字符串可以有一系列参数,最多可达 27 个,另一个有 16 个。通读文档,参数中有一些逻辑,例如例如,第 17 个参数将始终保存一个整数。虽然这很好,但不幸的是,第 17 个参数可能是字符串中的第 7 个参数。每个字符串中唯一真正不变的是时间戳和第 6 个参数。
我将如何解析这样的字符串?如果我的问题有点不清楚,我很抱歉,我发现很难表达我的问题。
For a small project of my own, I'm writing a parser that parses event logs from a certain application. Normally I'd have little issue with handling such a thing, but the problem is that strings from these logs do not always have the same parameters. For example, one such string could be:
DD/MM HH:MM:SS.MSEC TYPE_OF_EVENT SOURCE, SOURCE_FLAGS, TARGET, TARGET_FLAGS, PARAM1
On another occasion, the string could have a series of parameters, all the way up to 27 of them, the other has 16. Reading through the documentation, there is some logic in the parameters, for example, the 17th Parameters will always hold an integer. While that is good, unfortunately the 17th parameter might be the 7th thing on the string. The only thing that is really constant on every string is the time stamp and the 6th first parameters.
How would I go around parsing strings like these? I'm sorry if my question is a tad unclear, I find it difficult to word my problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
好的,跟进我在顶部的评论。
如果日志的格式基于 TYPE_OF_EVENT 字段是“常量”,则您只需要做一些简单的预解析,之后其余的事情就可以轻松完成。
基于 type_of_event,做进一步分析
开关(事件类型){
case 'a': 解析出 'a' 事件参数
case 'b': 解析出 'b' 事件参数
默认:记录未知事件类型以供将来分析
}
等等。
Ok, followup for my comment up at the top.
If the log's format is "constant" based on the TYPE_OF_EVENT field, you'll just have to do some simple pre-parsing, after which the rest should follow easily.
based on type_of_event, do further analysis
switch (event type) {
case 'a': parse out 'a' event parameters
case 'b': parse out 'b' event parameters
default: log unknown event type for future analysis
}
and so on.
我会使用不同的日志记录解决方案,或者找到一种方法来修改它,以便您有空的占位符、item3、、item6 等。
只是我的意见,不太了解这个应用程序 - 这个应用程序听起来不怎么样太好了。我通常通过这样的因素来判断应用程序,如果没有充分的理由使日志文件不标准化,那么您认为其余的代码是什么样的? :)
I would use a different logging solution, or find a way to modify it so that you have empty place holders, item,,item3,,,item6 etc.
Just my opinion without knowing too much about this app - this app doesn't sound too good. I usually judge apps by factors like this, if there is not a good reason for the log file to be non-standardized then what do you think the rest of the code look like? :)
这不是一个可以“解析”的输入,因为没有固定的关键字需要寻找。但正则表达式似乎足以提取和拆分内容。
http://regular-expressions.info/ 有很好的介绍,https://stackoverflow.com/questions/89718/is-there-anything-like -regexbuddy-in-the-open-source-world 列出了一些有助于设计正则表达式的很酷的工具。
在您的情况下,您需要
\d+
来匹配小数,按字面意思使用分隔符,并且您可能可以使用.*?
用,
分隔> 逗号分隔符来查找各个部分。也许:如果属性的长度可变,那么您应该更喜欢两个正则表达式(尽管可以用一个正则表达式来完成)。首先获取每行的
.*
剩余部分,然后将其拆分。That's not an input that can be "parsed" as such, because there are no fixed keywords to look out for. But regular expressions seem sufficient to extract and split up the contents.
http://regular-expressions.info/ has a good introduction, and https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world lists a few cool tools that help in designing regular expressions.
In your case you would need
\d+
for matching decimals, use delimiters literally, und you probably can get away with.*?
separated by the,
comma delimiters to find the individual parts. Maybe:If there is a variable length of attributes, then you should prefer two regexps (though it can be done in one). First get the
.*
remainder of each line, then split it afterwards.如何用“,”分隔符分割字符串并将所有内容放入数组中。这样您将有一个数字索引来检查参数是否存在。
How about splitting the string by the ", " separator and putting everything in an array. That way you'll have a numeric index to check if a parameter exists or not.