寻找反转 sprintf() 函数输出的算法

发布于 2024-07-04 19:00:18 字数 414 浏览 10 评论 0原文

我正在开发一个需要解析日志文件的项目。 我正在寻找一种快速算法,可以接受如下组消息:

P1 的温度为 35F。

P1 的温度为 40F。

P3 的温度为 35F。

记录器已停止。

记录器已启动。

P1 的温度为 40F。

并以 printf() 的形式输出内容:

"The temperature at P%d is %dF.", Int1, Int2" 
{(1,35), (1, 40), (3, 35), (1,40)}

该算法需要足够通用才能识别消息组中的几乎所有数据负载。

我尝试搜索这种技术,但我什至不知道要搜索的正确术语。

I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:

The temperature at P1 is 35F.

The temperature at P1 is 40F.

The temperature at P3 is 35F.

Logger stopped.

Logger started.

The temperature at P1 is 40F.

and puts out something in the form of a printf():

"The temperature at P%d is %dF.", Int1, Int2" 
{(1,35), (1, 40), (3, 35), (1,40)}

The algorithm needs to be generic enough to recognize almost any data load in message groups.

I tried searching for this kind of technology, but I don't even know the correct terms to search for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

夏日浅笑〃 2024-07-11 19:00:19

您不会找到一个可以简单地接受任意输入、猜测您想要从中获得什么数据并生成您想要的输出的工具。 对我来说,这听起来像是强大的人工智能。

制作这样的东西,即使只是为了识别数字,也会变得非常棘手。 例如“123.456”是一个数字还是两个数字? 这个“123456”怎么样? “35F”是十进制数和“F”还是十六进制值 0x35F? 您将必须构建一些能够按照您需要的方式进行解析的东西。 您可以使用正则表达式来完成此操作,也可以使用 sscanf 来完成此操作,或者可以使用其他方式来完成此操作,但您必须编写一些自定义的内容。

但是,使用基本的正则表达式,您可以自己完成此操作。 这不会很神奇,但也没有那么多工作。 像这样的东西将解析您感兴趣的行并合并它们(Perl):

my @vals = ();
while (defined(my $line = <>))
{
    if ($line =~ /The temperature at P(\d*) is (\d*)F./)
    {
        push(@vals, "($1,$2)");
    }
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < @vals; $i++)
{
    print $vals[$i];
    if ($i < @vals - 1)
    {
        print ",";
    }
}
print "}\n";

此输出是L

The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}

您可以对需要解析的每种类型的行执行类似的操作。 您甚至可以从文件中读取这些正则表达式,而不是对每个正则表达式进行自定义编码。

You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.

Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.

However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):

my @vals = ();
while (defined(my $line = <>))
{
    if ($line =~ /The temperature at P(\d*) is (\d*)F./)
    {
        push(@vals, "($1,$2)");
    }
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < @vals; $i++)
{
    print $vals[$i];
    if ($i < @vals - 1)
    {
        print ",";
    }
}
print "}\n";

The output from this isL

The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}

You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.

三生池水覆流年 2024-07-11 19:00:19

感谢所有的好建议。
克里斯,是的。 我正在寻找一种通用的解决方案来规范任何类型的文本。 该问题的解决方案归结为动态地查找两个或多个相似字符串中的模式。
几乎就像根据前两个预测集合中的下一个元素:

1:珠穆朗玛峰高 30000 英尺

2:K2 高 28000 英尺

=> 模式是什么?
=> 答案:

[名称] 为 [数字] 英尺高

现在文本文件可以有数百万行和数千个模式。 我想非常非常快地解析文件,找到模式并收集与每个模式关联的数据集。

我考虑创建一些高级语义哈希来表示消息字符串中的模式。
我将使用标记生成器并为每个标记类型赋予特定的“权重”。
然后我会对哈希进行分组并评估它们的相似性。 分组完成后,我将收集数据集。

我希望我不必重新发明轮子,并且可以重用已经存在的东西。

克劳斯

Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:

1: Everest is 30000 feet high

2: K2 is 28000 feet high

=> What is the pattern?
=> Answer:

[name] is [number] feet high

Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.

I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.

I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.

Klaus

第七度阳光i 2024-07-11 19:00:19

这取决于您想要做什么,如果您的目标是快速生成 sprintf() 输入,那么这是可行的。 如果您尝试解析数据,也许正则表达式也可以。

It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..

不必你懂 2024-07-11 19:00:19

@John:我认为这个问题与一种算法有关,该算法实际上可以识别日志文件中的模式并自动“猜测”适当的格式字符串和数据。 *scanf 系列本身无法做到这一点,只有在首先识别出模式后它才能提供帮助。

@John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.

中性美 2024-07-11 19:00:19

@Derek Park:嗯,即使是强大的人工智能也不能确定它有正确的答案。

也许可以使用一些类似压缩的机制:

  1. 查找大的、频繁的子串
  2. 查找大的、频繁的子串模式。 (即 [pattern:1] [junk] [pattern:2])

要考虑的另一个项目可能是按

事实上,如果你能写出这个,让全世界都知道,我想我们很多人都会喜欢这个工具!

@Derek Park: Well, even a strong AI couldn't be sure it had the right answer.

Perhaps some compression-like mechanism could be used:

  1. Find large, frequent substrings
  2. Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])

Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.

Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!

老娘不死你永远是小三 2024-07-11 19:00:19

@安德斯

嗯,即使是强大的人工智能也不能确定它有正确的答案。

我认为足够强大的人工智能通常可以从上下文中找出正确的答案。 例如,强人工智能可以识别出在此上下文中的“35F”是温度而不是十六进制数字。 肯定有一些情况,即使是强大的人工智能也无法回答。 不过,这些都是人类无法回答的情况(假设人工智能非常强大)。

当然,这并不重要,因为我们没有强大的人工智能。 :)

@Anders

Well, even a strong AI couldn't be sure it had the right answer.

I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).

Of course, it doesn't really matter, since we don't have strong AI. :)

围归者 2024-07-11 19:00:19

http://www.logparser.com 转发到一个看起来相当活跃的 IIS 论坛。 这是 Gabriele Giuseppini 的“日志解析器工具包”的官方网站。 虽然我从未真正使用过这个工具,但我确实从 Amazon Marketplace 上买了一本便宜的书 - 今天一本的价格低至 16 美元。 没有什么比死树界面更适合翻页了。

浏览一下这个论坛,我以前没有听说过“MS Log Parser 的新 GUI 工具,Log Parser Lizard”,位于 http ://www.lizardl.com/

当然,关键问题是语法的复杂性。 要使用任何类型的日志解析器(该术语很常用),您需要确切地知道您要扫描的内容,您可以为其编写 BNF。 很多年前我学过一门基于Aho-and-Ullman的《龙之书》的课程,彻底理解LALR技术可以给你最佳的速度,当然前提是你有那个CFG。

另一方面,你似乎确实可能正在追求类似人工智能的东西,这是完全不同的复杂程度。

http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.

Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.

The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.

On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.

娇纵 2024-07-11 19:00:19

我不知道有什么具体的工具可以做到这一点。 当我遇到类似的问题需要解决时,我所做的就是尝试猜测正则表达式来匹配行。

然后我处理这些文件并仅显示不匹配的行。 如果一条线不匹配,则意味着该模式是错误的,应该进行调整或添加另一个模式。

经过大约一个小时的工作,我成功找到了大约 20 个模式来匹配 10000 多行。

就您的情况而言,您可以首先“猜测”一种模式是“P[1-3] 处的温度为 [0-9]{2}F。”。 如果您重新处理文件并删除任何匹配的行,则会留下“仅”:

记录器已停止。

记录器已启动。

然后您可以将其与“Logger (.+).” 匹配。

然后,您可以优化模式并找到新的模式来匹配您的整个日志。

I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.

I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.

After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.

In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":

Logger stopped.

Logger started.

Which you can then match with "Logger (.+).".

You can then refine the patterns and find new ones to match your whole log.

皓月长歌 2024-07-11 19:00:18

我认为您可能忽略并错过了 fscanf() 和 sscanf()。 与 fprintf() 和 sprintf() 相反。

I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().

牵强ㄟ 2024-07-11 19:00:18

概述:

一种naïve!!算法以每列的方式跟踪单词的频率,其中可以假设每一行都可以用分隔符分成几列。

输入示例:

狗跳过了月亮
猫跳过了月亮
月亮跳过月亮
汽车跳过了月亮

频率:

Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}

我们可以通过根据字段总数进行分组来进一步划分这些频率列表,但在这个简单方便的示例中,我们仅使用固定数量的字段 (6)。

下一步是迭代生成这些频率列表的行,所以让我们以第一个示例为例。

  1. The:满足一些挥手标准,算法决定它必须是静态的。
  2. dog:根据频率列表的其余部分,它似乎不是静态的,因此它必须是动态的,而不是静态文本。 我们循环遍历一些预定义的正则表达式并得出 /[az]+/i
  3. 超过:与#1相同的交易; 它是静态的,所以保持原样。
  4. the:与#1相同的交易; 它是静态的,所以保持原样。
  5. moon:与#1相同的交易; 它是静态的,所以保持原样。

因此,只需浏览第一行,我们就可以将以下正则表达式放在一起:

/The ([a-z]+?) jumps over the moon/

注意事项:

  • 显然,只要有信心,频率列表将是第一遍,就可以选择扫描部分或整个文档。整个数据的充分采样。

  • 误报可能会渗透到结果中,这将取决于过滤算法(挥手)来提供静态和动态字段之间的最佳阈值,或进行一些人工后处理。

  • 总体想法可能是好的,但实际实现肯定会影响该算法的速度和效率。

    总体想法可能不错,但实际实现肯定会影响算法的速度和效率

Overview:

A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.

Example input:

The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon

Frequencies:

Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}

We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).

The next step is to iterate through lines which generated these frequency lists, so let's take the first example.

  1. The: meets some hand-wavy criteria and the algorithm decides it must be static.
  2. dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
  3. over: same deal as #1; it's static, so leave as is.
  4. the: same deal as #1; it's static, so leave as is.
  5. moon: same deal as #1; it's static, so leave as is.

Thus, just from going over the first line we can put together the following regular expression:

/The ([a-z]+?) jumps over the moon/

Considerations:

  • Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.

  • False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.

  • The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文