如何编写分析器?

发布于 2024-07-10 13:08:58 字数 70 浏览 4 评论 0原文

我想知道如何编写分析器? 推荐哪些书籍和/或文章? 有人可以帮我吗?

有人已经做过这样的事情了吗?

i would to know how to write a profiler? What books and / or articles recommended? Can anyone help me please?

Someone has already done something like this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小嗷兮 2024-07-17 13:08:58

令人鼓舞,不是吗:)

如果您只是想合理地了解程序大部分时间都花在哪里,那么分析器并不太难。 如果您担心高精度和最小干扰,事情就会变得困难。

因此,如果您只是想要分析器给您的答案,请寻找其他人编写的答案。 如果您正在寻找智力挑战,为什么不尝试写一个呢?

我已经为多年来已经变得无关紧要的运行时环境编写了一些。

有两种方法

  • 向每个函数或其他重要点添加一些内容来记录时间和位置。

  • 有一个计时器定期响起并查看程序当前的位置。

JVMPI 版本似乎是第一种 - uzhin 提供的链接表明它可以报告很多事情(参见第 1.3 节)。 为此,执行的内容会发生变化,因此分析可能会影响性能(如果您正在分析原本非常轻量级但经常调用的函数,则可能会产生误导)。

如果您可以让计时器/中断告诉您程序计数器在中断时的位置,则可以使用符号表/调试信息来计算出它当时位于哪个函数中。 这提供的信息较少,但破坏性较小。 通过遍历调用堆栈来识别调用者等,可以获得更多信息。我不知道这些在 Java 中是否可能......

Paul。

Encouraging lot, aren't we :)

Profilers aren't too hard if you're just trying to get a reasonable idea of where the program's spending most of its time. If you're bothered about high accuracy and minimum disruption, things get difficult.

So if you just want the answers a profiler would give you, go for one someone else has written. If you're looking for the intellectual challenge, why not have a go at writing one?

I've written a couple, for run time environments that the years have rendered irrelevant.

There are two approaches

  • adding something to each function or other significant point that logs the time and where it is.

  • having a timer going off regularly and taking a peek where the program currently is.

The JVMPI version seems to be the first kind - the link provided by uzhin shows that it can report on quite a number of things (see section 1.3). What gets executed changes to do this, so the profiling can affect the performance (and if you're profiling what was otherwise a very lightweight but often called function, it can mislead).

If you can get a timer/interrupt telling you where the program counter was at the time of the interrupt, you can use the symbol table/debugging information to work out which function it was in at the time. This provides less information but can be less disruptive. A bit more information can be obtained from walking the call stack to identify callers etc. I've no idea if these is even possible in Java...

Paul.

沫雨熙 2024-07-17 13:08:58

我曾经写过一篇,主要是为了让“深度采样”更加用户友好。 当您手动执行该方法时,它在这里进行了解释。 它基于抽样,但不是采取大量小样本,而是采取少量大样本。

例如,它可以告诉您,指令 I(通常是函数调用)或多或少地花费了总执行时间的 X%,因为它出现在堆栈上的 X% 样本上。

想一想,因为这是一个关键点。 只要程序运行,调用堆栈就存在。 如果特定调用指令 I 有 X% 的时间位于堆栈上,那么如果该指令可能消失,那么 X% 的时间也会消失。 这不取决于I执行了多少次,或者函数调用需要多长时间。 所以定时器和计数器没有抓住重点。从某种意义上说,所有指令都是调用指令,即使它们只调用微代码。

采样器基于这样的前提:精确地知道指令 I 的地址(因为这就是您要查找的)比知道数字 X 更好% 精确。 如果您知道通过重新编码可以节省大约 30% 的时间,您真的在乎可能会节省 5% 的时间吗? 你仍然想要修复它。 它实际节省的时间不会因您准确地了解 X 而减少或增加。

因此,可以从计时器中驱动样本,但坦率地说,我发现通过用户同时按下两个 Shift 键来触发中断同样有用。 由于 20 个样本通常就足够了,这样您就可以确保在相关时间(即不是在等待用户输入时)采样,这已经足够了。 另一种方法是在用户按住两个 Shift 键(或类似的键)时仅执行计时器驱动的示例。

我并不担心采样可能会减慢程序速度,因为目标不是测量速度,而是找到成本最高的指令。 修复某些问题后,整体加速很容易测量。

探查器提供的主要内容是 UI,以便您可以轻松地检查结果。 采样阶段产生的是调用堆栈样本的集合,其中每个样本都是指令地址列表,其中除了最后一条指令之外的每条指令都是调用指令。 用户界面主要是所谓的“蝴蝶视图”。
它有一个当前的“焦点”,这是一条特定的指令。 左侧显示该指令正上方的调用指令,从堆栈样本中剔除。 如果焦点指令是调用指令,则其下方的指令将显示在右侧,如从样本中剔除的那样。 在焦点指令上显示一个百分比,它是包含该指令的堆栈的百分比。 同样,对于左侧或右侧的每条指令,百分比按每条此类指令的频率进行细分。 当然,指令由文件、行号及其所在函数的名称来表示。用户可以通过单击任何指令使其成为新焦点来轻松浏览数据。

此 UI 的一个变体将蝴蝶视为二分体,由函数调用指令和包含它们的函数的交替层组成。 这样可以更清楚地了解每个功能所花费的时间。

也许这并不明显,所以值得一提的是该技术的一些特性。

  • 递归不是问题,因为如果一条指令在任何给定堆栈样本上出现多次,则仍然只算作包含该指令的一个样本。 删除它所节省的估计时间仍然是它所在堆栈的百分比。

  • 请注意,这与调用树不同。 无论调用树中有多少个不同分支,它都会为您提供指令的成本。

  • UI 的性能不是问题,因为样本数量不需要很大。 I 是焦点,那么很容易找到有多少样本包含它,以及对于每个相邻指令,包含 I 的样本中有多少也包含它旁边的相邻指令。

  • 如前所述,采样速度不是问题,因为我们不是在测量性能,而是在诊断。 采样不会使结果产生偏差,因为采样不会影响整个程序的功能。 需要 N 条指令才能完成的算法即使暂停任意多次,仍然需要 N 条指令。

  • 经常有人问我如何对在毫秒内完成的程序进行采样。 简单的答案是将其包装在一个外循环中,以使其需要足够长的时间来采样。 您可以找出什么花费了 X% 的时间,将其删除,获得 X% 的加速,然后删除外循环。

这个小分析器,我称之为 YAPA(又一个性能分析器)是基于 DOS 的,并做了一个很好的小演示,但是当我有严肃的工作要做时,我会求助于手动方法。 造成这种情况的主要原因是,仅调用堆栈通常不足以提供足够的状态信息来告诉您为什么要花费特定的周期。 您可能还需要了解其他状态信息,以便更全面地了解程序当时在做什么。 由于我发现手动方法相当令人满意,所以我搁置了该工具。

在谈论分析时经常忽略的一点是,您可以重复进行分析以发现多个问题。 例如,假设指令 I1 5% 的时间位于堆栈上,而 I2 50% 的时间位于堆栈上。 二十个样本很容易找到 I2,但可能找不到 I1。 所以你修复了I2。 然后您再次执行所有操作,但现在 I1 花费了 10% 的时间,因此 20 个样本可能会看到它。 这种放大效应允许重复应用分析以实现较大的复合加速因子。

I wrote one once, mainly as an attempt to make "deep sampling" more user-friendly. When you do the method manually, it is explained here. It is based on sampling, but rather than take a large number of small samples, you take a small number of large samples.

It can tell you, for example, that instruction I (usually a function call) is costing you some percent X of total execution time, more or less, since it appears on the stack on X% of samples.

Think about it, because this is a key point. The call stack exists as long as the program is running. If a particular call instruction I is on the stack X% of the time, then if that instruction could disappear, that X% of time would disappear. This does not depend on how many times I is executed, or how long the function call takes. So timers and counters are missing the point. And in a sense all instructions are call instructions, even if they only call microcode.

The sampler is based on the premise that it is better to know the address of instruction I with precision (because that is what you are looking for) than to know the number X% with precision. If you know that you could save roughly 30% of time by recoding something, do you really care that you might be off by 5%? You're still going to want to fix it. The amount of time it actually saves won't be made any less or greater by your knowing X precisely.

So it is possible to drive samples off of a timer, but frankly I found it just as useful to trigger an interrupt by the user pressing both shift keys as the same time. Since 20 samples is generally plenty, and this way you can be sure to take samples at a relevant time (i.e. not while waiting for user input) it was quite adequate. Another way would be to only do the timer-driven samples while the user holds down both shift keys (or something like that).

It did not concern me that the taking of samples might slow down the program, because the goal was not to measure speed, but to locate the most costly instructions. After fixing something, the overall speedup is easy to measure.

The main thing that the profiler provided was a UI so you could examine the results painlessly. What comes out of the sampling phase is a collection of call stack samples, where each sample is a list of addresses of instructions, where every instruction but the last is a call instruction. The UI was mainly what is called a "butterfly view".
It has a current "focus", which is a particular instruction. To the left is displayed the call instructions immediately above that instruction, as culled from the stack samples. If the focus instruction is a call instruction, then the instructions below it appear to the right, as culled from the samples. On the focus instruction is displayed a percent, which is the percent of stacks containing that instruction. Similarly for each instruction on the left or right, the percent is broken down by the frequency of each such instruction. Of course, the instruction was represented by file, line number, and the name of the function it was in. The user could easily explore the data by clicking any of the instructions to make it the new focus.

A variation on this UI treated the butterfly as bipartite, consisting of alternating layers of function call instructions and the functions containing them. That can give a little more clarity of time spent in each function.

Maybe it's not obvious, so it's worth mentioning some properties of this technique.

  • Recursion is not an issue, because if an instruction appears more than once on any given stack sample, that still counts as only one sample containing it. It still remains true that the estimated time that would be saved by its removal is the percent of stacks it is on.

  • Notice this is not the same as a call tree. It gives you the cost of an instruction no matter how many different branches of a call tree it is in.

  • Performance of the UI is not an issue, because the number of samples need not be very large. If a particular instruction I is the focus, it is quite simple to find how may samples contain it, and for each adjacent instruction, how many of the samples containing I also contain the adjacent instruction next to it.

  • As mentioned before, speed of sampling is not an issue, because we're not measuring performance, we're diagnosing. The sampling does not bias the results, because the sampling does not affect what the overall program does. An algorithm that takes N instructions to complete still takes N instructions even if it is halted any number of times.

  • I'm often asked how to sample a program that completes in milliseconds. The simple answer is wrap it in an outer loop to make it take long enough to sample. You can find out what takes X% of time, remove it, get the X% speedup, and then remove the outer loop.

This little profiler, that I called YAPA (yet another performance analyzer) was DOS-based and made a nice little demo, but when I had serious work to do, I would fall back on the manual method. The main reason for this is that the call stack alone is often not enough state information to tell you why a particular cycle is being spent. You may also need to know other state information so you have a more complete idea of what the program was doing at that time. Since I found the manual method pretty satisfactory, I shelved the tool.

A point that's often missed when talking about profiling is that you can do it repeatedly to find multiple problems. For example, suppose instruction I1 is on the stack 5% of the time, and I2 is on the stack 50% of the time. Twenty samples will easily find I2, but maybe not I1. So you fix I2. Then you do it all again, but now I1 takes 10% of the time, so 20 samples will probably see it. This magnification effect allows repeated applications of profiling to achieve large compounded speedup factors.

月竹挽风 2024-07-17 13:08:58

我会首先看看这些开源项目:

然后我会查看 JVMTI (不是 JVMPI)

I would look at those open-source projects first:

Then I would look at JVMTI (not JVMPI)

放手` 2024-07-17 13:08:58

JVMPI 规范: http://java.sun.com /j2se/1.5.0/docs/guide/jvmpi/jvmpi.html

我向你的勇气和勇敢致敬

编辑:正如用户 Boune,JVMTI 所指出的:
http://java.sun.com/developer/technicalArticles/Programming/jvmti/< /a>

JVMPI spec: http://java.sun.com/j2se/1.5.0/docs/guide/jvmpi/jvmpi.html

I salute your courage and bravery

EDIT: And as noted by user Boune, JVMTI:
http://java.sun.com/developer/technicalArticles/Programming/jvmti/

萤火眠眠 2024-07-17 13:08:58

作为另一个答案,我刚刚查看了 sourceforge 上的 LukeStackwalker。 这是一个很好的、小型的堆栈采样器示例,如果您想编写分析器,这是一个不错的起点。

在我看来,它的做法是正确的:

  • 它对整个调用堆栈进行采样。

叹息……那么近却又那么远。 在我看来,它(以及 xPerf 等其他堆栈采样器)应该这样做:

  • 它应该保留原始堆栈样本。 事实上,它在采样时在函数级别进行总结。 这会丢失定位有问题的调用站点的关键行号信息。

  • 如果存储样本是一个问题,则不需要采取那么多样本。 由于典型性能问题的成本为 10% 到 90%,因此 20-40 个样本将非常可靠地显示它们。 数百个样本提供了更高的测量精度,但它们并没有增加定位问题的概率。

  • UI 应该用语句而不是函数来概括。 如果保留原始样品,这很容易做到。 附加到声明的关键度量是包含该声明的样本的比例。 例如:

    5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)

这表示在此过程中,MyFile.cpp 中的第 326 行出现在 20 个样本中的 5 个上调用strlen。 这非常重要,因为您可以立即看到问题,并且知道修复它可以带来多少加速。 如果将 strlen(s) 替换为 s[i],它将不再在该调用中花费时间,因此不会出现这些样本,并且加速将是大约 1/(1-5/20) = 20/(20-5) = 4/3 = 33% 加速。 (感谢 David Thornley 提供的示例代码。)

  • UI 应该有一个显示语句的“蝴蝶”视图。 (如果它也显示函数,那没关系,但语句才是真正重要的。)例如:

    3/20 MyFile.cpp:502 MyFunction(myArgs)
    2/20 HisFile.cpp:113 MyFunction(hisArgs)

    5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)

    5/20 strlen.asm:23 ...一些汇编代码...

在此示例中,包含 for 语句的行是“关注焦点”。 它发生在 5 个样品上。 上面的两行表示,在其中 3 个示例中,它是从 MyFile.cpp:502 调用的,而在其中 2 个示例中,它是从 HisFile.cpp:113< 调用的。 /代码>。 下面的行表示,在所有 5 个样本中,它都在 strlen 中(这并不奇怪)。 一般来说,焦点线会有一棵“父母”树和一棵“孩子”树。 如果由于某种原因,焦点线无法修复,您可以向上或向下移动。 目标是在尽可能多的样本上找到可以修复的线条。

重要提示:分析不应被视为您一次所做的事情。 例如,在上面的示例中,我们通过修复一行代码获得了 4/3 的加速。 重复该过程时,其他有问题的代码行出现的频率应该是之前的 4/3,因此更容易找到。 我从未听说过有人谈论迭代分析过程,但这对于获得整体大幅复合加速至关重要。

PS 如果一个语句在单个样本中出现多次,则意味着发生了递归。 这不成问题。 它仍然只算作包含该声明的一个样本。 事实仍然是,该语句的成本近似于包含该语句的样本的比例。

As another answer, I just looked at LukeStackwalker on sourceforge. It is a nice, small, example of a stack-sampler, and a nice place to start if you want to write a profiler.

Here, in my opinion, is what it does right:

  • It samples the entire call stack.

Sigh ... so near yet so far. Here, IMO, is what it (and other stack samplers like xPerf) should do:

  • It should retain the raw stack samples. As it is, it summarizes at the function level as it samples. This loses the key line-number information locating the problematic call sites.

  • It need not take so many samples, if storage to hold them is an issue. Since typical performance problems cost from 10% to 90%, 20-40 samples will show them quite reliably. Hundreds of samples give more measurement precision, but they do not increase the probability of locating the problems.

  • The UI should summarize in terms of statements, not functions. This is easy to do if the raw samples are kept. The key measure to attach to a statement is the fraction of samples containing it. For example:

    5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)

This says that line 326 in MyFile.cpp showed up on 5 out of 20 samples, in the process of calling strlen. This is very significant, because you can instantly see the problem, and you know how much speedup you can expect from fixing it. If you replace strlen(s) by s[i], it will no longer be spending time in that call, so these samples will not occur, and the speedup will be approximately 1/(1-5/20) = 20/(20-5) = 4/3 = 33% speedup. (Thanks to David Thornley for this sample code.)

  • The UI should have a "butterfly" view showing statements. (If it shows functions too, that's OK, but the statements are what really matter.) For example:

    3/20 MyFile.cpp:502 MyFunction(myArgs)
    2/20 HisFile.cpp:113 MyFunction(hisArgs)

    5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)

    5/20 strlen.asm:23 ... some assembly code ...

In this example, the line containing the for statement is the "focus of attention". It occurred on 5 samples. The two lines above it say that on 3 of those samples, it was called from MyFile.cpp:502, and on 2 of those samples, it was called from HisFile.cpp:113. The line below it says that on all 5 of those samples, it was in strlen (no surprise there). In general, the focus line will have a tree of "parents" and a tree of "children". If for some reason, the focus line is not something you can fix, you can go up or down. The goal is to find lines that you can fix that are on as many samples as possible.

IMPORTANT: Profiling should not be looked at as something you do once. For example, in the sample above, we got a 4/3 speedup by fixing one line of code. When the process is repeated, other problematic lines of code should show up at 4/3 the frequency they did before, and thus be easier to find. I never hear of people talking about iterating the profiling process, but it is crucial to getting overall large compounded speedups.

P.S. If a statement occurs more than once in a single sample, that means there is recursion taking place. It is not a problem. It still only counts as one sample containing the statement. It is still the case that the cost of the statement is approximated by the fraction of samples containing it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文