寻找估算方法(数据分析)
由于我不知道自己现在在做什么,所以我的措辞可能听起来很有趣。但说实话,我需要学习。
我面临的问题是提出一种方法(模型)来估计软件程序如何工作:即运行时间和最大内存使用量。我已经拥有大量数据。该数据集概述了程序在不同条件下如何工作,例如
<code>
RUN Criterion_A Criterion_B Criterion_C Criterion_D Criterion_E <br>
------------------------------------------------------------------------
R0001 12 2 3556 27 9 <br>
R0002 2 5 2154 22 8 <br>
R0003 19 12 5556 37 9 <br>
R0004 10 3 1556 7 9 <br>
R0005 5 1 556 17 8 <br>
</code>
我有数千行此类数据。现在,如果我提前知道所有条件,我需要知道如何估计(预测)运行时间和最大内存使用量。我需要的是一个给出提示的近似值(上限或范围)。
我感觉这很典型???我不知道的问题。你们能给我一些提示或给我一些想法(理论、解释、网页)或任何可能有帮助的东西吗?谢谢!
Since I have no idea about what I am doing right now, my wording may sound funny. But seriously, I need to learn.
The problem I'm facing is to come up with a method (model) to estimate how a software program works: namely running time and maximal memory usage. What I already have are a large amount of data. This data set gives an overview of how a program works under different conditions, e.g.
<code>
RUN Criterion_A Criterion_B Criterion_C Criterion_D Criterion_E <br>
------------------------------------------------------------------------
R0001 12 2 3556 27 9 <br>
R0002 2 5 2154 22 8 <br>
R0003 19 12 5556 37 9 <br>
R0004 10 3 1556 7 9 <br>
R0005 5 1 556 17 8 <br>
</code>
I have thousands of rows of such data. Now I need to know how I can estimate (forecast) the running time and maximal memory usage if I know all criteria in advance. What I need is an approximation that gives hints (upper limits, or ranges).
I have feeling that it is a typical ??? problem which I don't know. Could you guys show me some hints or give me some ideas (theories, explanations, webpages) or anything that may help. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您想要一个新程序将一个或多个条件作为输入,然后输出运行时间或内存使用情况的估计值。这是一个机器学习问题。
您的输入可以列为数字向量,如下所示:
最简单的算法之一是 K-近邻算法。这背后的想法是,您将获取输入的数字向量,并在数据库中找到与您的输入向量最相似的数字向量。例如,给定这个输入向量:
您可以假设运行时间和内存应该与这次运行的值非常相似(最初在上面列出的表中):
有几种算法可以计算这两个向量之间的相似度,一种简单直观的算法是欧几里得距离。举个例子,输入向量和表中向量之间的欧几里德距离是这样的:
直观上应该清楚的是,距离较小的点应该更好地估计运行时间和内存使用情况,因为距离应该描述两个之间的相似性多组标准。假设您的标准信息丰富且经过精心选择,具有相似标准的点应该具有相似的运行时间和内存使用情况。
以下是如何在 R 中执行此操作的一些示例代码:
获取最近行的运行时间和内存使用情况是 K=1 的 KNN 算法。通过从数据库中获取多行的加权组合,可以扩展此方法以包含来自多行的数据,其中与输入向量距离较短的行对估计的贡献更大。请阅读 KNN 上的维基百科页面以获取更多信息,尤其是关于数据标准化的信息,包括来自多个点的贡献和计算距离。
在计算这些输入向量列表之间的差异时,您应该考虑对数据进行标准化。这样做的理由是,标准 C 的 3557 和 3556 之间 1 个单位的差异可能不等于标准 A 的 11 和 12 之间 1 的差异。如果您的数据呈正态分布,则可以将它们全部转换为 < a href="http://en.wikipedia.org/wiki/Standard_score" rel="nofollow noreferrer">标准分数(或 Z 分数) 使用此公式:
“右侧没有通用解决方案” “标准化数据的方法,因为它取决于您拥有的数据的类型和范围,但 Z 分数很容易计算,并且是首先尝试的好方法。
有许多更复杂的技术可用于构建此类估计,包括线性回归、支持向量回归和非线性建模。一些更复杂的方法背后的想法是,您尝试开发一个方程来描述变量与运行时间或内存的关系。例如,一个简单的应用程序可能只有一个标准,您可以尝试区分模型,例如:
A 是您的固定标准,sN 是一系列自由参数,您可以调整这些参数,直到获得满足以下条件的模型:效果很好。
这种方法的一个问题是有许多不同的可能模型具有不同数量的参数。区分具有不同参数数量的模型是统计学中的一个难题,我不建议您在第一次涉足机器学习时解决这个问题。
您应该问自己的一些问题是:
You want a new program that takes as input one or more criteria, then outputs an estimate of the running time or memory usage. This is a machine learning problem.
Your inputs can be listed as a vector of numbers, like this:
One of the simplest algorithms for this would be a K-nearest neighbor algorithm. The idea behind this is that you'll take your input vector of numbers, and find in your database the vector of numbers that is most similar to your input vector. For example, given this vector of inputs:
You can assume that the running time and memory should be very similar to the values from this run (originally in your table listed above):
There are several algorithms for calculating the similarity between these two vectors, one simple and intuitive such algorithm is the Euclidean distance. As an example, the Euclidean distance between the input vector and the vector from the table is this:
It should be intuitively clear that points with lower distance should be better estimates for running time and memory usage, as the distance should describe the similarity between two sets of criteria. Assuming your criteria are informative and well-selected, points with similar criteria should have similar running time and memory usage.
Here's some example code of how to do this in R:
Taking the running time and memory usage of your nearest row is the KNN algorithm for K=1. This approach can be extended to include data from multiple rows by taking a weighted combination of multiple rows from the database, with rows with lower distances to your input vector contributing more to the estimates. Read the Wikipedia page on KNN for more information, especially with regard to data normalization, including contributions from multiple points, and computing distances.
When calculating the difference between these lists of input vectors, you should consider normalizing your data. The rationale for doing this is that a difference of 1 unit between 3557 and 3556 for criteria C may not be equivalent to a difference of 1 between 11 and 12 for criteria A. If your data are normally distributed, you can convert them all to standard scores (or Z-scores) using this formula:
There is no general solution on the "right" way to normalize data as it depends on the type and range of data that you have, but Z-scores are easy to compute and a good method to try first.
There are many more sophisticated techniques for constructing estimates such as this, including linear regression, support vector regression, and non-linear modeling. The idea behind some of the more sophisticated methods is that you try and develop an equation that describes the relationship of your variables to running time or memory. For example, a simple application might just have one criterion and you can try and distinguish between models such as:
The idea is that A is your fixed criteria, and sN are a list of free parameters that you can tweak until you get a model that works well.
One problem with this approach is that there are many different possible models that have different numbers of parameters. Distinguishing between models that have different numbers of parameters is a difficult problem in statistics, and I don't recommend tackling it during your first foray into machine learning.
Some questions that you should ask yourself are:
如果您要预测的标准位于当前已知标准的范围内,那么您应该对 插值过程:
如果它位于您当前已知的数据范围研究外推 不太准确:
方法
If the criterion you would be forecasting for lies within the range of currently known criteria then you should do some more research on the Interpolation process:
If it lies outside your currently known data range research Extrapolation which is less accurate:
Methods