解析格式错误的日志文件?

发布于 2024-09-27 23:27:39 字数 508 浏览 1 评论 0原文

我正在处理一些格式非常很差的日志文件,列分隔符是(经常)出现在字段中的项目,并且它不会转义。例如:

sam,male,september,brown,blue,i like cats, and i like dogs

其中:

name,gender,month,hair,eyes,about

正如您所看到的,about 包含列分隔符,这意味着分隔符的单个解析将不起作用,因为它将把 about me 分成两个单独的列。现在想象一下聊天系统......我确信你可以想象这些问题。

那么,理论上解决这个问题的最佳方法是什么?我不是在寻找特定于语言的实现,而是在寻找指向正确方向的一般指针,或者关于其他人如何解决它的一些想法......无需手动执行。

编辑:

我应该澄清一下,我的实际日志处于更糟糕的状态。这些字段到处都有分隔符,但我找不到任何模式。

I'm working with some log files that are very poorly formatted, the column delimiter is an item that (often) appears within the field and it isn't escaped. For example:

sam,male,september,brown,blue,i like cats, and i like dogs

Where:

name,gender,month,hair,eyes,about

So as you can see, the about contains the column delimiter which means a single parse by the delimiter won't work, because it'll separate the about me into two separate columns. Now imagine this with a chat system... you can visualize the issues I'm sure.

So, theoretically what's the best approach to solving this? I'm not looking for a language specific implementation but more of a general pointer to the correct direction, or some ideas on how others have solved it... without doing it manually.

Edit:

I should clarify, my actual logs are in a much worse state. There are these fields with delimiter characters everywhere, there is no pattern that I can locate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

深居我梦 2024-10-04 23:27:39

如果只有最后一列有未转义的逗号,那么大多数语言的字符串分割实现都会限制分割的数量,例如在Python中s.split(',',5)

如果你想解析文件作为 CSV(逗号分隔值)解析器,那么我认为最好的方法是运行一个修复程序,在将其传递给 csv 解析器之前进行正确的转义。

If only the last column have unescaped commas, then most language's implementation of string split can limit the number of splits made, e.g. in Python s.split(',',5)

If you want to parse the file as a CSV (comma separated values) parser, then I think the best approach would be to run a fixer that does proper escaping before passing it to the csv parser.

向地狱狂奔 2024-10-04 23:27:39

我想您可以对此类数据做出某些假设。就像性别、月份、头发和眼睛一样,都有一个值域,然后验证它。

除了 about 和可能 name 之外的所有字段都不包含逗号,因此也许您可以贪婪地解析,使前 5 个或 6 个逗号表现为分隔符和其他所有内容都是 about 的一部分。如有必要请再次确认。

I suppose you can make certain assumptions on the kind of data. Like gender, month, hair, and eyes have a domain of values then verify that.

It could also make sense that all of the fields except about and maybe name wouldn't contain a comma so perhaps you can parse greedily making the first 5 or 6 commas behave as delimiters and everything else is part of about. Verify again if necessary.

绮筵 2024-10-04 23:27:39

如果不使用转义,可能无法完美地解析它们。

Lie Ryan 指出,如果只有最后一列可以包含这些值,那么您就有一个选择。

如果不是这种情况,是否有任何列可以保证始终缺少未转义的保留字符?另外,是否有任何列可以保证始终只有一组特定的值?

如果其中任何一个为真,您也许能够首先识别这些字段,然后分离出其他所有内容以将其从那里拆分。

我必须了解有关您的信息的更多细节才能进一步进行。

It might be impossible to perfectly parse them, if no escaping is used.

Lie Ryan noted that if only the last column could have those values, you have an option there.

If that's not the case, are there any columns where you are guaranteed to always have a lack of unescaped, reserved characters? Also, are there any columns where you are guaranteed to always have only a certain set of values?

If either of those are true, you may be able to identify those fields first, and separate out everything else to split it from there.

I'd have to know more specifics about your info to go further.

尝蛊 2024-10-04 23:27:39

您可以尝试以下两个想法:

  • 长度/格式模式 - 我认为您可以识别文件各个列中的某些模式。例如,某些列中的值可能较短,而某些列中的值可能较短。某些列中的值通常是数字或来自一组有限的值(例如月份),或者至少通常包含一些子字符串。

    当您可以识别这些模式时(基于根据正确分隔的项目计算出的统计数据),那么您应该创建使用这些模式来猜测应忽略哪些分隔符的算法(例如,当列比预期短时) .

  • 语法规则 - 受您的示例启发的另一个想法 - 未转义的逗号通常后跟一些字符串(例如单词“and”或“about”?)如果是,您可以使用通过此信息来猜测应该转义哪些分隔符。

最后,如果这些临时技术都不能解决您的问题,那么您可以使用一些大量的统计数据来进行估计。有一些机器学习框架可以为你做大量的统计工作,但这仍然是一个相当复杂的问题。例如,在 .NET 上,您可以使用 Infer.NET 来自微软研究院。

Here are two ideas that you could try out:

  • Length/format patterns - I think you could be able to identify some patterns in the individual columns of the file. For example, values in some columns may be shorter and values in some columsn may be shorter. Values in some columns are typically numbers or from a limited set of values (e.g. months) or at least contain often some sub-string.

    When you can identify these patterns (based on statistics calculated from items correctly delmited items), then you should could create algorithm that uses these to guess which of the delimiters should be ignored (e.g. when a column would be shorter than expected).

  • Grammatical rules - another idea inspired by your example - are the commas that are not escaped usually followed by some strings (e.g. words "and" or "about"?) If yes, you could use this information to guess which delmiters should be escaped.

Finally, if none of these ad hoc techniques can solve your problem, then you can use some heavy statistics to do the estimation. There are some machine learning frameworks that can do the heavy statistics for you, but it is still quite compilicated problem. For example on .NET, you could use Infer.NET from Microsoft Research.

捎一片雪花 2024-10-04 23:27:39

如果可能的话,我建议做的一件事是在每个数据记录中保留一些内容,以指示所做的假设(可能保留原始字符串),这样,如果发现记录有问题,则可以希望得到正确的数据重建(如果没有别的办法,通过手工检查)。

One thing I would suggest doing, if possible, is keep something in each data record which indicates the assumptions that were made (possibly keeping the original string), so that if something is found to be wrong with a record the proper data can hopefully be reconstructed (by hand-examining it if nothing else).

爱的那么颓废 2024-10-04 23:27:39

如果第 6 列始终是最后一列,并且始终未转义,那么这段 Perl 应该可以解决问题:

$file = '/path/to/my/log.txt';
open(LOG, $file);
@lines = <LOG>;

foreach $line (@lines)
{
    chomp($line);
    if ($line =~ /([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_\, ]+)/)
    {
        print "Name:         $1\n";
        print "Gender:       $2\n";
        print "Month:        $3\n";
        print "Color #1:     $4\n";
        print "Color #2:     $5\n";
        print "Random Text:  $6\n";
    }
}

close(LOG)

If the 6th column is always the last, and always unescaped, this bit of perl should do the trick:

$file = '/path/to/my/log.txt';
open(LOG, $file);
@lines = <LOG>;

foreach $line (@lines)
{
    chomp($line);
    if ($line =~ /([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_\, ]+)/)
    {
        print "Name:         $1\n";
        print "Gender:       $2\n";
        print "Month:        $3\n";
        print "Color #1:     $4\n";
        print "Color #2:     $5\n";
        print "Random Text:  $6\n";
    }
}

close(LOG)
若相惜即相离 2024-10-04 23:27:39

您的日志不明确:您无法确定要做出多种可能的解释中的哪一种。处理不确定性是概率论的工作。一个自然的工具是概率上下文无关语法——有一些算法可以找到最可能的解析。 (虽然我用这种统计方法完成了更简单的工作,但我自己还没有机会使用它。Peter Norvig 的 拼写校正器文章详细介绍了一个这样的例子。)

对于这个特定的简化问题:您可以枚举将一行分成 N 个部分的所有可能方法(其中您已经知道 N 是什么)期望),根据某种模型计算每个问题的概率,并选择最佳答案。

(处理删除区别的数据的另一个例子:我有一个来自 50 万张 Flickr 照片的标签数据集。这些标签来自他们的 API,所有单词都放在一起,空格被压扁。我使用以下方法计算了最可能的单词边界:从互联网摄影网站列出的词频,加上像这样的代码这样的答案。)

Your logs are ambiguous: you can't be sure which of many possible interpretations to make. Dealing with uncertainty is a job for probability theory. A natural tool then is a probabilistic context-free grammar -- there are algorithms for finding the most-probable parse. (I haven't had occasion to use one myself, though I've done simpler jobs with this kind of statistical approach. Peter Norvig's spelling-corrector article goes into practical detail on one such example.)

For this particular simplified problem: you might enumerate all the possible ways to split a line into N parts (where you already know what N to expect), calculate the probability of each according to some model, and pick the best answer.

(Another example of dealing with data with distinctions erased: I had a dataset of tags from a half-million Flickr photos. The tags came out of their API with all the wordsruntogether with the spaces squished out. I computed the most likely word boundaries using word frequencies tabulated from Internet photography sites, plus code like this SO answer.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文