感知器学习算法的参数调整
我在尝试找出如何调整感知器算法的参数以使其在未见过的数据上表现相对良好时遇到了一些问题。
我已经实现了一个经过验证的工作感知器算法,我想找出一种方法来调整感知器的迭代次数和学习率。这是我感兴趣的两个参数。
我知道感知器的学习率不会影响算法是否收敛和完成。我试图掌握如何改变n。太快了,它会摆动很多,太低了,会花费更长的时间。
至于迭代次数,我不完全确定如何确定理想的次数。
无论如何,任何帮助将不胜感激。谢谢。
I'm having sort of an issue trying to figure out how to tune the parameters for my perceptron algorithm so that it performs relatively well on unseen data.
I've implemented a verified working perceptron algorithm and I'd like to figure out a method by which I can tune the numbers of iterations and the learning rate of the perceptron. These are the two parameters I'm interested in.
I know that the learning rate of the perceptron doesn't affect whether or not the algorithm converges and completes. I'm trying to grasp how to change n. Too fast and it'll swing around a lot, and too low and it'll take longer.
As for the number of iterations, I'm not entirely sure how to determine an ideal number.
In any case, any help would be appreciated. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
从少量迭代开始(实际上更传统的是计算'epochs'而不是迭代次数——'epochs'指的是用于训练网络的整个数据集的迭代次数)。我们说“小”,比如 50 个 epoch。这样做的原因是您希望看到总误差如何随着每个额外的训练周期(时期)而变化——希望它会下降(更多关于下面的“总误差”)。
显然,您对下一个附加纪元不会导致总误差进一步减少的点(纪元数)感兴趣。因此,从少量的纪元开始,这样您就可以通过增加纪元来接近这一点。
你开始的学习率不应该太细或太粗(显然是主观的,但希望你对什么是大学习率和小学习率有一个粗略的认识)。
接下来,在感知器中插入几行测试代码——实际上只是一些放置得当的“打印”语句。对于每次迭代,计算并显示增量(训练数据中每个数据点的实际值减去预测值),然后将训练数据中所有点(数据行)的各个增量值相加(我通常取delta,或者你可以取平方差之和的平方根——称这个总和值为“总误差”并不重要——只是为了清楚,这是总误差(误差之和。所有节点)每个历元
然后,将总误差绘制为历元数的函数(即,历元数在 x 轴上,总误差在 y 轴上)。当然,最初,您会看到左上角的数据点呈向下和向右倾斜的趋势,并且斜率逐渐减小。
让算法根据训练数据增加纪元来训练网络。 (例如,每次运行 10 次)直到看到曲线(总误差与历元数)展平——即,额外的迭代不会导致总误差减少。
因此,该曲线的斜率很重要,它的垂直位置也很重要——即,您有多少总误差,以及它是否随着更多训练周期(时期)而继续呈下降趋势。如果在增加 epoch 后,您最终发现错误增加,请以较低的学习率重新开始。
学习率(通常是 0.01 到 0.2 之间的一个分数)肯定会影响网络训练的速度,即它可以更快地使您达到局部最小值。它还可能导致您跳过它。因此,编写一个训练网络的循环,假设五次单独的时间,每次使用固定数量的时期(和相同的起点),但将学习率从 0.05 更改为 0.2,每次将学习率增加0.05。
这里还有一个参数很重要(尽管不是绝对必要的),“动量”。顾名思义,使用动量项将帮助您更快地获得经过充分训练的网络。本质上,动量是学习率的乘数——只要错误率在下降,动量项就会加速进度。动量项背后的直觉是“只要您朝目的地行驶,就增加速度”。动量项的典型值为 0.1 或 0.2。在上面的训练方案中,您可能应该在改变学习率的同时保持动量不变。
Start with a small number of iterations (it's actually more conventional to count 'epochs' rather than iterations--'epochs' refers to the number of iterations through the entire data set used to train the network). By 'small' let's say something like 50 epochs. The reason for this is that you want to see how the total error is changing with each additional training cycle (epoch)--hopefully it's going down (more on 'total error' below).
Obviously you are interested in the point (the number of epochs) where the next additional epoch does not cause a further decrease in total error. So begin with a small number of epochs so you can approach that point by increasing the epochs.
The learning rate you begin with should not be too fine or too coarse, (obviously subjective but hopefully you have a rough sense for what is a large versus small learning rate).
Next, insert a few lines of testing code in your perceptron--really just a few well-placed 'print' statements. For each iteration, calculate and show the delta (actual value for each data point in the training data minus predicted value) then sum the individual delta values over all points (data rows) in the training data (i usually take the absolute value of the delta, or you can take the square root of the sum of the squared differences--doesn't matter too much. Call that summed value "total error"--just to be clear, this is total error (sum of the error across all nodes) per epoch.
Then, plot the total error as a function of epoch number (ie, epoch number on the x axis, total error on the y axis). Initially of course, you'll see the data points in the upper left-hand corner trending down and to the right and with a decreasing slope
Let the algorithm train the network against the training data. Increase the epochs (by e.g., 10 per run) until you see the curve (total error versus epoch number) flatten--i.e., additional iterations doesn't cause a decrease in total error.
So the slope of that curve is important and so is its vertical position--ie., how much total error you have and whether it continues to trend downward with more training cycles (epochs). If, after increasing epochs, you eventually notice an increase in error, start again with a lower learning rate.
The learning rate (usually a fraction between about 0.01 and 0.2) will certainly affect how quickly the network is trained--i.e., it can move you to the local minimum more quickly. It can also cause you to jump over it. So code a loop that trains a network, let's say five separate times, using a fixed number of epochs (and a the same starting point) each time but varying the learning rate from e.g., 0.05 to 0.2, each time increasing the learning rate by 0.05.
One more parameter is important here (though not strictly necessary), 'momentum'. As the name suggests, using a momentum term will help you get an adequately trained network more quickly. In essence, momentum is a multiplier to the learning rate--as long as the the error rate is decreasing, the momentum term accelerates the progress. The intuition behind the momentum term is 'as long as you traveling toward the destination, increase your velocity'.Typical values for the momentum term are 0.1 or 0.2. In the training scheme above, you should probably hold momentum constant while varying the learning rate.
关于学习率不影响感知器是否收敛 - 这是不正确的。如果您选择的学习率太高,您可能会得到一个发散网络。如果你在学习过程中改变学习率,并且它下降得太快(即强于 1/n),你也可能会得到一个永远不会收敛的网络(这是因为 t 从 1 到 inf 的 N(t) 之和是有限的。这意味着权重向量只能改变有限的量)。
理论上,对于简单的情况,可以证明根据1/t(其中 t 是给出的示例的数量)改变 n(学习率)应该效果很好,但我实际上发现在实践中,最好的方法是找到好的高 n 值(不会使你的学习发散的最高值)和低 n 值(这个很难计算。实际上取决于数据和问题),然后让n 随着时间的推移从高 n 到低 n 线性变化。
About the learning rate not affecting whether or not the perceptron converges - That's not true. If you choose a learning rate that is too high, you will probably get a divergent network. If you change the learning rate during learning, and it drops too fast (i.e stronger than 1/n) you can also get a network that never converges (That's because the sum of N(t) over t from 1 to inf is finite. that means the vector of weights can only change by a finite amount).
Theoretically it can be shown for simple cases that changing n (learning rate) according to 1/t (where t is the number of presented examples) should work good, but I actually found that in practice, the best way to do this, is to find good high n value (the highest value that doesn't make your learning diverge) and low n value (this one is tricker to figure. really depends on the data and problem), and then let n change linearly over time from high n to low n.
学习率取决于数据的典型值。一般来说,没有经验法则。特征缩放是一种用于标准化自变量或数据特征范围的方法。在数据处理中,也称为数据标准化,通常在数据预处理步骤中执行。
将数据标准化为零均值、单位方差或 0-1 之间或任何其他标准形式可以帮助选择学习率值。正如 doug 提到的,学习率在 0.05 到 0.2 之间通常效果很好。
这也将有助于使算法收敛得更快。
资料来源:Juszczak, P.; DMJ 税和 RPW Dui (2002)。 “支持向量数据描述中的特征缩放”。过程。第八年。会议。副词。学校计算机。成像:95–10。
The learning rate depends on the typical values of data. There is no rule of thumb in general. Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
Normalizing the data to a zero-mean, unit variance or between 0-1 or any other standard form can help in selecting a value of learning rate. As doug mentioned, learning rate between 0.05 and 0.2 generally works well.
Also this will help in making the algorithm converge faster.
Source: Juszczak, P.; D. M. J. Tax, and R. P. W. Dui (2002). "Feature scaling in support vector data descriptions". Proc. 8th Annu. Conf. Adv. School Comput. Imaging: 95–10.