预测长流程的完成时间有哪些好方法?

发布于 2024-12-08 08:05:09 字数 1177 浏览 4 评论 0原文

tl;dr:我想预测文件复制完成情况。从开始时间和目前进度来看,有什么好的方法?

首先,我知道这根本不是一个简单的问题,预测未来是很难做好的。对于上下文,我试图预测长文件副本的完成情况。

当前方法:

目前,我正在使用自己想出的一个相当幼稚的公式:(ETC 代表预计​​完成时间)

ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone

这基于要复制剩余文件的假设将以迄今为止的平均复制速度执行此操作,这可能是也可能不是一个现实的假设(此处处理磁带存档)。

  • PRO:ETC 将逐渐变化,并且随着过程接近完成而变得越来越准确。
  • 缺点:它对意外事件的反应不佳,例如文件复制卡住或快速加速。

另一个想法:

我的下一个想法是记录最后n秒(或几分钟,因为这些档案应该需要几个小时)的进度,只需执行以下操作:

ETC = currTime + currAvg * (totalSize - sizeDone)

这与第一种方法相反:

  • PRO: 如果速度变化很快,ETC 将快速更新以反映当前的情况。
  • 缺点:如果速度不一致,ETC 可能会跳动很多。

最后

我想起了我在大学学习的控制工程科目,其目标本质上是尝试获得一个对突然变化快速反应的系统,但又不不稳定和疯狂。

话虽如此,我能想到的另一个选择是计算上述两者的平均值,也许采用某种加权:

  • 如果副本具有相当一致的长期平均速度,则对第一种方法进行更多加权,即使它在本地有点跳跃。
  • 如果复制速度不可预测,并且可能会长时间加速/减速或长时间完全停止,则应更多地考虑第二种方法。

我真正要求的是:

  • 我给出的两种方法的任何替代方法。
  • 您是否以及如何结合几种不同的方法来获得最终的预测。

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?

Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.

Current Approach:

At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)

ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone

This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).

  • PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
  • CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.

Another idea:

The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:

ETC = currTime + currAvg * (totalSize - sizeDone)

This is kind of the opposite of the first method in that:

  • PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
  • CON: The ETC may jump around a lot if the speed is inconsistent.

Finally

I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.

With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:

  • Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
  • Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.

What I am really asking for is:

  • Any alternative approaches to the two I have given.
  • If and how you would combine several different methods to get a final prediction.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

恬淡成诗 2024-12-15 08:05:09

如果您认为预测的准确性很重要,那么构建预测模型的方法如下:

  1. 收集一些现实世界的测量结果;
  2. 将它们分成三个不相交的集合:训练验证测试
  3. 提出一些预测模型(您已经有两个加上一个混合)并使用训练集来拟合它们;
  4. 检查模型在验证集上的预测性能,并选择性能最好的模型;
  5. 使用测试集来评估所选模型的样本外预测误差。

我大胆猜测当前模型的线性组合和“过去n<的平均值” /em> 秒” 对于当前的问题来说效果很好。线性组合的最佳权重可以使用线性回归R)。

研究统计学习方法的优秀资源是The Elements of
统计学习,作者:Hastie、Tibshirani 和 Friedman。我极力推荐这本书。

最后,您的第二个想法(过去n秒的平均值)尝试测量瞬时速度。一种更强大的技术可能是使用卡尔曼滤波器,其用途正是这样的:

其目的是使用一段时间内观察到的测量结果,包括
噪声(随机变化)和其他不准确之处,并产生值
往往更接近测量的真实值
它们相关的计算值。

使用卡尔曼滤波器而不是固定的 n 秒滑动窗口的主要优点是它是自适应的:当测量值比稳定时跳跃很多时,它将自动使用更长的平均窗口。

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:

  1. collect some real-world measurements;
  2. split them into three disjoint sets: training, validation and test;
  3. come up with some predictive models (you already have two plus a mix) and fit them using the training set;
  4. check predictive performance of the models on the validation set and pick the one that performs best;
  5. use the test set to assess the out-of-sample prediction error of the chosen model.

I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).

An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning
by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.

Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:

Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.

The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

森罗 2024-12-15 08:05:09

恕我直言,ETC 的糟糕实现被过度使用,这让我们开怀大笑。有时,显示事实而不是估计可能更好,例如:

  • 10 个文件中的 5 个已被复制
  • 200 MB 中的 10 个已被复制

或者显示事实估计,并明确这只是一个估计。但我不会只显示估计值。

每个用户都知道,ETC 通常是完全没有意义的,因此很难区分有意义的 ETC 和无意义的 ETC,尤其是对于没有经验的用户。

Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:

  • 5 of 10 files have been copied
  • 10 of 200 MB have been copied

Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.

Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.

2024-12-15 08:05:09

我实施了两种不同的解决方案来解决此问题:

  1. 开始时当前传输的 ETC 基于历史速度值。每次传输后都会对该值进行细化。在传输过程中,我计算历史数据和当前传输数据之间的加权平均值,因此越接近结束,传输中的实际数据就越重要。

  2. 不显示单个 ETC,而是显示一个时间范围。这个想法是从最后“n”秒或分钟计算 ETC(就像你的第二个想法)。我跟踪最好和最坏情况的平均值并计算一系列可能的 ETC。这在 GUI 中显示有点令人困惑,但在命令行应用程序中显示是可以的。

I have implemented two different solutions to address this problem:

  1. The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.

  2. Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.

不乱于心 2024-12-15 08:05:09

这里需要考虑两件事:

  • 准确估计
  • 如何将其呈现给用户

1。关于估计

除了统计方法之外,在消除一些噪声或尖峰的同时对当前速度进行良好估计的一种简单方法是采用加权方法。

您已经尝试过滑动窗口,这里的想法是采用相当大的滑动窗口,但不是简单的平均值,而是对最近的度量给予更多权重,因为它们更能说明演变(有点像导数) 。

示例:假设您有 10 个先前的窗口(最近的 x0,最近的 x9),那么您可以计算速度:

Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55

当您对可能的速度有良好的评估时,您就接近获得良好的估计时间。

2.关于演示

这里要记住的主要事情是您想要良好的用户体验,而不是科学前沿。

研究表明,用户对减速的反应非常糟糕,而对加速的反应则非常积极。因此,良好的进度条/估计时间首先应该在所提供的估计中保守(为潜在的减速保留时间)。

实现这一目标的一个简单方法是使用一个完成百分比的因子,您可以用它来调整估计的剩余时间。例如:

real-completion = 0.4
presented-completion = real-completion * factor(real-completion)

其中 factor 满足 factor([0..1]) = [0..1]factor(x) <= x因子(1) = 1。例如,三次函数可以很好地加速完成时间。其他函数可以使用指数形式 1 - e^x 等...

There are two things to consider here:

  • the exact estimation
  • how to present it to the user

1. On estimation

Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.

You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).

Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:

Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55

When you have a good assessment of the likely speed, then you are close to get a good estimated time.

2. On presentation

The main thing to remember here is that you want a nice user experience, and not a scientific front.

Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.

A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:

real-completion = 0.4
presented-completion = real-completion * factor(real-completion)

Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文