哪些统计概念对于分析有用?

发布于 2024-08-19 23:35:04 字数 149 浏览 3 评论 0原文

我一直想温习一下我的统计学知识。统计数据似乎有用的一个领域是分析代码。我这样说是因为分析似乎总是涉及我尝试从大量数据中提取一些信息。

我可以复习一下统计学中的任何主题,以便更好地理解探查器输出吗?如果您能给我推荐一本书或其他资源来帮助我更好地理解这些主题,那就加分了。

I've been meaning to do a little bit of brushing up on my knowledge of statistics. One area where it seems like statistics would be helpful is in profiling code. I say this because it seems like profiling almost always involves me trying to pull some information from a large amount of data.

Are there any subjects in statistics that I could brush up on to get a better understanding of profiler output? Bonus points if you can point me to a book or other resource that will help me understand these subjects better.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

嘿哥们儿 2024-08-26 23:35:04

我不确定统计学书籍在分析方面是否有用。运行探查器应该会为您提供一个函数列表以及每个函数所花费的时间百分比。然后,您查看百分比最高的那个,看看是否可以以任何方式对其进行优化。重复直到你的代码足够快。我觉得标准差或卡方的范围不大。

I'm not sure books on statistics are that useful when it comes to profiling. Running a profiler should give you a list of functions and the percentage of time spent in each. You then look at the one that took the most percentage wise and see if you can optimise it in any way. Repeat until your code is fast enough. Not much scope for standard deviation or chi squared there, I feel.

木森分化 2024-08-26 23:35:04

我对分析的了解只是我刚刚在维基百科中读到的:-),但我确实对统计有一定的了解。分析文章提到了采样数据的采样和统计分析。显然,统计分析将能够使用这些样本来制定一些有关绩效的统计报表。假设您有某种性能衡量标准 m,并且对该衡量标准进行了 1000 次采样。假设您了解创建 m 值的基本过程。例如,如果 m 是一堆随机变量的总和,则 m 的分布可能是正态分布。如果 m 是一堆随机变量的乘积,则分布可能是对数正态分布。等等...

如果您不知道基本分布并且想要对比较性能做出一些陈述,您可能需要所谓的非参数统计。

总的来说,我建议任何关于统计推断的标准文本(DeGroot),一本涵盖不同概率分布及其适用范围的文本(Hastings&Peacock),以及一本关于非参数统计的书(Conover)。希望这有帮助。

All I know about profiling is what I just read in Wikipedia :-) but I do know a fair bit about statistics. The profiling article mentioned sampling and statistical analysis of sampled data. Clearly statistical analysis will be able to use those samples to develop some statistical statements on performance. Let's say you have some measure of performance, m, and you sample that measure 1000 times. Let's also say you know something about the underlying processes that created that value of m. For instance, if m is the SUM of a bunch of random variates, the distribution of m is probably normal. If m is the PRODUCT of a bunch of random variates, the distribution is probably lognormal. And so on...

If you don't know the underlying distribution and you want to make some statement about comparing performance, you may need what are called non-parametric statistics.

Overall, I'd suggest any standard text on statistical inference (DeGroot), a text that covers different probability distributions and where they're applicable (Hastings & Peacock), and a book on non-parametric statistics (Conover). Hope this helps.

⊕婉儿 2024-08-26 23:35:04

统计数据很有趣,但对于性能调优来说,你不需要它。 这里有一个解释,但一个简单的类比可能会给出这个想法。

性能问题就像一个物体(实际上可能是多个相连的物体)埋在一英亩的雪下,你试图通过用棍子随机探测来找到它。如果你的棍子击中它几次,你就已经找到了它——它的确切尺寸并不那么重要。 (如果你真的想更好地估计它有多大,请使用更多的探头,但这不会改变它的大小。)在找到雪之前你必须探测雪的次数取决于雪的面积有多大它下面的雪。

一旦找到它,就可以将其拉出来。现在雪减少了,但积雪下可能还有更多物体。因此,通过更多的探测,您也可以找到并删除它们。通过这种方式,您可以继续前进,直到找不到更多可以删除的东西为止。

在软件中,雪就是时间,探测是对调用堆栈进行随机时间采样。通过这种方式,可以找到并消除多个问题,从而产生 大的加速因子

统计数据与此无关。

Statistics is fun and interesting, but for performance tuning, you don't need it. Here's an explanation why, but a simple analogy might give the idea.

A performance problem is like an object (which may actually be multiple connected objects) buried under an acre of snow, and you are trying to find it by probing randomly with a stick. If your stick hits it a couple of times, you've found it - it's exact size is not so important. (If you really want a better estimate of how big it is, take more probes, but that won't change its size.) The number of times you have to probe the snow before you find it depends on how much of the area of the snow it is under.

Once you find it, you can pull it out. Now there is less snow, but there might be more objects under the snow that remains. So with more probing, you can find and remove those as well. In this way, you can keep going until you can't find anything more that you can remove.

In software, the snow is time, and probing is taking random-time samples of the call stack. In this way, it is possible to find and remove multiple problems, resulting in large speedup factors.

And statistics has nothing to do with it.

白日梦 2024-08-26 23:35:04

Zed Shaw 像往常一样,对统计和编程主题有一些想法,但他提出他们比我更有说服力。

Zed Shaw, as usual, has some thoughts on the subject of statistics and programming, but he puts them much more eloquently than I could.

花之痕靓丽 2024-08-26 23:35:04

我认为在这种情况下需要理解的最重要的统计概念是阿姆达尔定律。尽管阿姆达尔定律通常在并行化背景下提及,但它具有更一般的解释。以下是维基百科页面的摘录:

从技术上讲,涉及法律
与可实现的加速
改进计算
影响其中的比例 P
计算改进之处
S 的加速比。(例如,如果
改进可加快 30%
计算,P为0.3;如果
改善使受影响的部分
速度提高两倍,S 将为 2。)Amdahl 的
法律规定,总体加速
应用改进将是

alt text

I think that the most important statistical concept to understand in this context is Amdahl's law. Although commonly referred to in contexts of parallelization, Amdahl's law has a more general interpretation. Here's an excerpt from the Wikipedia page:

More technically, the law is concerned
with the speedup achievable from an
improvement to a computation that
affects a proportion P of that
computation where the improvement has
a speedup of S. (For example, if an
improvement can speed up 30% of the
computation, P will be 0.3; if the
improvement makes the portion affected
twice as fast, S will be 2.) Amdahl's
law states that the overall speedup of
applying the improvement will be

alt text

一袭水袖舞倾城 2024-08-26 23:35:04

我认为与统计和分析(您原来的问题)相关的一个概念非常有用,并且被一些人使用(您不时看到建议的技术)是在进行“微观分析”时:很多程序员会集会并大喊大叫“你不能进行微观分析,微观分析根本不起作用,太多的事情会影响你的计算”

然而,只需运行 n 次您的分析,并仅保留 x% 的观察结果,即中位数附近的观察结果,因为中位数是一个“稳健的统计数据”(与平均值),不受异常值的影响(异常值正是您在进行此类分析时不希望考虑的值)。

对于想要对代码进行微观剖析的程序员来说,这绝对是一种非常有用的统计学技术。

I think one concept related to both statistics and profiling (your original question) that is very useful and used by some (you see the technique advised from time to time) is while doing "micro profiling": a lot of programmers will rally and yell "you can't micro profile, micro profiling simply doesn't work, too many things can influence your computation".

Yet simply run n times your profiling, and keep only x% of your observations, the ones around the median, because the median is a "robust statistic" (contrarily to the mean) that is not influenced by outliers (outliers being precisely the value you want to not take into account when doing such profiling).

This is definitely a very useful statistician technique for programmers who want to micro-profile their code.

迷离° 2024-08-26 23:35:04

如果您将 MVC 编程方法与 PHP 结合使用,那么您需要分析以下内容:

<前><代码>应用:
控制器设置时间
模型建立时间
查看设置时间
数据库
查询-时间
曲奇饼
名称 - 值
会议
名称 - 值

If you apply the MVC programming method with PHP this would be what you need to profile:

Application:
   Controller Setup time
   Model Setup time
   View Setup time
Database
   Query - Time
Cookies
   Name - Value
Sessions
   Name - Value
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文