寻找估算方法（数据分析）

发布于 2024-09-08 07:56:49 字数 904 浏览 10 评论 0原文

由于我不知道自己现在在做什么，所以我的措辞可能听起来很有趣。但说实话，我需要学习。

我面临的问题是提出一种方法（模型）来估计软件程序如何工作：即运行时间和最大内存使用量。我已经拥有大量数据。该数据集概述了程序在不同条件下如何工作，例如

<code>
RUN     Criterion_A  Criterion_B  Criterion_C  Criterion_D  Criterion_E <br>
------------------------------------------------------------------------
R0001           12         2           3556            27           9 <br>      
R0002            2         5           2154            22           8 <br>
R0003           19        12           5556            37           9 <br>
R0004           10         3           1556             7           9 <br>
R0005           5          1            556            17           8 <br>
</code>

我有数千行此类数据。现在，如果我提前知道所有条件，我需要知道如何估计（预测）运行时间和最大内存使用量。我需要的是一个给出提示的近似值（上限或范围）。

我感觉这很典型？？？我不知道的问题。你们能给我一些提示或给我一些想法（理论、解释、网页）或任何可能有帮助的东西吗？谢谢！

原文

Since I have no idea about what I am doing right now, my wording may sound funny. But seriously, I need to learn.

The problem I'm facing is to come up with a method (model) to estimate how a software program works: namely running time and maximal memory usage. What I already have are a large amount of data. This data set gives an overview of how a program works under different conditions, e.g.

<code>
RUN     Criterion_A  Criterion_B  Criterion_C  Criterion_D  Criterion_E <br>
------------------------------------------------------------------------
R0001           12         2           3556            27           9 <br>      
R0002            2         5           2154            22           8 <br>
R0003           19        12           5556            37           9 <br>
R0004           10         3           1556             7           9 <br>
R0005           5          1            556            17           8 <br>
</code>

I have thousands of rows of such data. Now I need to know how I can estimate (forecast) the running time and maximal memory usage if I know all criteria in advance. What I need is an approximation that gives hints (upper limits, or ranges).

I have feeling that it is a typical ??? problem which I don't know. Could you guys show me some hints or give me some ideas (theories, explanations, webpages) or anything that may help. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪异←思 2024-09-15 07:56:49

您想要一个新程序将一个或多个条件作为输入，然后输出运行时间或内存使用情况的估计值。这是一个机器学习问题。

您的输入可以列为数字向量，如下所示：

input = [ A, B, C, D, E ]

最简单的算法之一是 K-近邻算法。这背后的想法是，您将获取输入的数字向量，并在数据库中找到与您的输入向量最相似的数字向量。例如，给定这个输入向量：

input = [ 11, 1.8, 3557, 29, 10 ]

您可以假设运行时间和内存应该与这次运行的值非常相似（最初在上面列出的表中）：

R0001           12         2           3556            27           9

有几种算法可以计算这两个向量之间的相似度，一种简单直观的算法是欧几里得距离。举个例子，输入向量和表中向量之间的欧几里德距离是这样的：

dist = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
dist = 2.6533

直观上应该清楚的是，距离较小的点应该更好地估计运行时间和内存使用情况，因为距离应该描述两个之间的相似性多组标准。假设您的标准信息丰富且经过精心选择，具有相似标准的点应该具有相似的运行时间和内存使用情况。

以下是如何在 R 中执行此操作的一些示例代码：

r1 = c(11,1.8,3557,29,10)
r2 = c(12,2.0,3556,27, 9)

print(r1)
print(r2)

dist_r1_r2 = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 ) 
print(dist_r1_r2)
smarter_dist_r1_r2 = sqrt( sum( (r1 - r2)^2 ) ) 
print(smarter_dist_r1_r2)

获取最近行的运行时间和内存使用情况是 K=1 的 KNN 算法。通过从数据库中获取多行的加权组合，可以扩展此方法以包含来自多行的数据，其中与输入向量距离较短的行对估计的贡献更大。请阅读 KNN 上的维基百科页面以获取更多信息，尤其是关于数据标准化的信息，包括来自多个点的贡献和计算距离。

在计算这些输入向量列表之间的差异时，您应该考虑对数据进行标准化。这样做的理由是，标准 C 的 3557 和 3556 之间 1 个单位的差异可能不等于标准 A 的 11 和 12 之间 1 的差异。如果您的数据呈正态分布，则可以将它们全部转换为 < a href="http://en.wikipedia.org/wiki/Standard_score" rel="nofollow noreferrer">标准分数（或 Z 分数）使用此公式：

N_trans = (N - mean(N)) / sdev(N)

“右侧没有通用解决方案” “标准化数据的方法，因为它取决于您拥有的数据的类型和范围，但 Z 分数很容易计算，并且是首先尝试的好方法。

有许多更复杂的技术可用于构建此类估计，包括线性回归、支持向量回归和非线性建模。一些更复杂的方法背后的想法是，您尝试开发一个方程来描述变量与运行时间或内存的关系。例如，一个简单的应用程序可能只有一个标准，您可以尝试区分模型，例如：

running_time = s1 * A + s0
running_time = s2 * A^2 + s1 * A + s0
running_time = s3 * log(A) + s2 * A^2 + s1 * A + s0

A 是您的固定标准，sN 是一系列自由参数，您可以调整这些参数，直到获得满足以下条件的模型：效果很好。

这种方法的一个问题是有许多不同的可能模型具有不同数量的参数。区分具有不同参数数量的模型是统计学中的一个难题，我不建议您在第一次涉足机器学习时解决这个问题。

您应该问自己的一些问题是：

我的所有标准都会影响运行时间和内存使用吗？有些是否只影响其中之一？有些从预测的角度来看是否无用？回答这个问题称为特征选择，并且是机器学习。
您是否对变量应如何影响运行时间或内存使用有任何先验估计？例如，您可能知道您的应用程序使用时间为 N * log(N) 的排序算法，这意味着您明确知道一个条件与运行时间之间的关系。
您的测量输入标准行与运行时间和内存使用情况是否涵盖了应用程序的所有可能用例？如果是这样，那么您的估计会更好，因为机器学习可能会在处理不熟悉的数据时遇到困难。
程序的运行时间和内存是否取决于您未输入估计策略的标准？例如，如果您依赖网络蜘蛛等外部资源，则网络问题可能会以难以预测的方式影响运行时间和内存使用情况。如果是这种情况，您的估计将会有更多的方差。

You want a new program that takes as input one or more criteria, then outputs an estimate of the running time or memory usage. This is a machine learning problem.

Your inputs can be listed as a vector of numbers, like this:

input = [ A, B, C, D, E ]

One of the simplest algorithms for this would be a K-nearest neighbor algorithm. The idea behind this is that you'll take your input vector of numbers, and find in your database the vector of numbers that is most similar to your input vector. For example, given this vector of inputs:

input = [ 11, 1.8, 3557, 29, 10 ]

You can assume that the running time and memory should be very similar to the values from this run (originally in your table listed above):

R0001           12         2           3556            27           9

There are several algorithms for calculating the similarity between these two vectors, one simple and intuitive such algorithm is the Euclidean distance. As an example, the Euclidean distance between the input vector and the vector from the table is this:

dist = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
dist = 2.6533

It should be intuitively clear that points with lower distance should be better estimates for running time and memory usage, as the distance should describe the similarity between two sets of criteria. Assuming your criteria are informative and well-selected, points with similar criteria should have similar running time and memory usage.

Here's some example code of how to do this in R:

r1 = c(11,1.8,3557,29,10)
r2 = c(12,2.0,3556,27, 9)

print(r1)
print(r2)

dist_r1_r2 = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 ) 
print(dist_r1_r2)
smarter_dist_r1_r2 = sqrt( sum( (r1 - r2)^2 ) ) 
print(smarter_dist_r1_r2)

Taking the running time and memory usage of your nearest row is the KNN algorithm for K=1. This approach can be extended to include data from multiple rows by taking a weighted combination of multiple rows from the database, with rows with lower distances to your input vector contributing more to the estimates. Read the Wikipedia page on KNN for more information, especially with regard to data normalization, including contributions from multiple points, and computing distances.

When calculating the difference between these lists of input vectors, you should consider normalizing your data. The rationale for doing this is that a difference of 1 unit between 3557 and 3556 for criteria C may not be equivalent to a difference of 1 between 11 and 12 for criteria A. If your data are normally distributed, you can convert them all to standard scores (or Z-scores) using this formula:

N_trans = (N - mean(N)) / sdev(N)

There is no general solution on the "right" way to normalize data as it depends on the type and range of data that you have, but Z-scores are easy to compute and a good method to try first.

There are many more sophisticated techniques for constructing estimates such as this, including linear regression, support vector regression, and non-linear modeling. The idea behind some of the more sophisticated methods is that you try and develop an equation that describes the relationship of your variables to running time or memory. For example, a simple application might just have one criterion and you can try and distinguish between models such as:

running_time = s1 * A + s0
running_time = s2 * A^2 + s1 * A + s0
running_time = s3 * log(A) + s2 * A^2 + s1 * A + s0

The idea is that A is your fixed criteria, and sN are a list of free parameters that you can tweak until you get a model that works well.

One problem with this approach is that there are many different possible models that have different numbers of parameters. Distinguishing between models that have different numbers of parameters is a difficult problem in statistics, and I don't recommend tackling it during your first foray into machine learning.

Some questions that you should ask yourself are:

Do all of my criteria affect both running time and memory usage? Do some affect only one or the other, and are some useless from a predictive point of view? Answering this question is called feature selection, and is an outstanding problem in machine learning.
Do you have any a priori estimates of how your variables should influence running time or memory usage? For example, you might know that your application uses a sorting algorithm that is N * log(N) in time, which means that you explicitly know the relationship between one criterion and your running time.
Do your rows of measured input criteria paired with running time and memory usage cover all of the plausible use cases for your application? If so, then your estimates will be much better, as machine learning can have a difficult time with data that it's unfamiliar with.
Do the running time and memory of your program depend on criteria that you don't input into your estimation strategy? For example, if you're depending on an external resource such as a web spider, problems with your network may influence running time and memory usage in ways that are difficult to predict. If this is the case, your estimates will have a lot more variance.

回复收藏 0 原文