为什么不使用x = data [0：n]和y = data [n＆＃x2b; 1]训练RNN（但是x = data [0：n]和y = data [1：n＆＃x2b; 1]）？

发布于 2025-02-04 16:19:29 字数 820 浏览 3 评论 0原文

这是一个将军，我认为有关如何设置经常性神经网络的非常基本/基本的问题。为此，让我们假设我们正在训练一种自回归语言模型，该模型试图预测某些文本中的下一个字符。

当我查看训练RNN的现有实现时，我通常会发现，馈送给这些RNN的数据是一定长度的片段，其中输入和输出的形状相同，但相对于彼此相对移动（一个样本）（因此，预测变量x是数据[0：n]，而预测Y是数据[1：n+1]）。在此类RNN的末尾，我们通常会找到从隐藏状态H_T到类数量的某种映射（+softmax）（此处：字符）。通常，这似乎会为每个输入时间样本产生一个类标签，即，我们为每个输出样本获得了类标签预测。换句话说，我们使用隐藏状态在看到x [0]（h_t = 0）之后从prejector [0]预测y [0]，我们使用（h_t = 1）从x [1]预测y [1] ，等等。

我发现这令人惊讶，因为我认为RNN的全部要点是在一定时间内“建立” /“开发”隐藏状态。即，在上述情况下，我希望我们只能生成数据[n+1]（即y [n]）的单个预测，使用h_t = n（在其中h正在集成并因此利用整个历史记录长度由数据加载器定义）。如果隐藏的状态随机启动，然后只看到1个单个数据样本，我天真地认为，当它看到n个数据点时，它应该执行得更糟。如果从隐藏状态到输出的映射是通过这种次优的“过早”状态训练的，那么它是否应该在补偿过早状态和理想地利用“成熟”隐藏状态的信息之间汇总某些折衷方案？

但是，通常，当我尝试以我认为应该更有意义的方式设置RNN时，它也不正常（我会遇到更严重的验证损失）。这可能是出于多种原因（鉴于我没有提供我的代码），但是如果有良好的理论动机，有人可以启发我，为什么RNN（和Transformers）通常不会给出preadionor = data [0： n]和preditionee = data [n+1]？

原文

This is a general and I think very basic/elementary question about how to set up recurrent neural networks. For the sake of it, let's assume we're training an autoregressive language model that tries to predict the next character in some text.

When I look at existing implementations that train an RNN, what I usually find is that the data fed to these RNNs are snippets of a certain length where the input and output are of the same shape, but shifted relative to each other by one sample (such that the predictor x is data[0:n] and the predictee y is data[1:n+1]).
At the end of such an RNN, we'd usually find some sort of mapping (+softmax) from the hidden state h_t to the number of classes (here: characters). Usually, this seems to produce a class label for each of the input time samples, i.e., we get a class label prediction for each output sample. In other words, we predict y[0] from predictor[0] using the hidden state after it has seen x[0] (h_t=0), we predict y[1] from x[1] using (h_t=1), and so on and so forth.

I find this surprising, because I thought the whole point of RNNs was to "build up" / "develop" a hidden state over a certain amount of time. I.e., in a case as described above, I would expect that we should only generate a single prediction for data[n+1] (i.e. y[n]), using h_t=n (where h is thus integrating and thus exploiting the entire history length as defined by the data loader).
If a hidden state is initiated randomly, and then only sees 1 single sample of data, I would naively assume it should perform a lot worse vs when it has seen n data points. If the mapping from hidden state to output is trained with such suboptimal "premature" states, shouldn't it converge on some compromise between compensating for the premature states and ideally exploiting information in a "mature" hidden state?

However, usually, when I try to set an RNN up in the way I thought it should make more sense, it doesn't work as well (I reach worse validation losses). This could be for several reasons (given that I am not providing my code), but can someone maybe enlighten me if there is a good theoretical motivation why RNNs (and transformers) aren't usually fed with pairs of predictor=data[0:n] and predictee=data[n+1]?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一百个冬季 2025-02-11 16:19:29

这可能不是完美的解释，我建议您对RNN及其内部工作进行一些阅读，但我将在解释方面做一个镜头。

RNN的重点是能够在每个时间点预测下一个输出，这就是它建立状态的方式。通过采用以先验输出为条件的联合概率分布，您需要学习预测y_t by （p（y_t | y_（t-1），y_（t-2），y_（t-2），。 ..）

。您建议的方式，但要多次，直到序列的长度为

。 /code>。我们重复此过程，直到我们预测最终单词t_n

t_0           -> t_1
t_0, t_1      -> t_2
t_0, t_1, t_2 -> t_3
...

。代码>时间，我们也没有记录如何计算先前的输出。此参数编码已经进行的计算直到TimePoint t。因此，上述计算链变成

s = 0
s_0, t_0  ->  s_1, t_1
s_1, t_1  ->  s_2, t_2
s_2, t_2  ->  s_3, t_3
...

如果隐藏的状态是随机启动的，然后只看到1个单个数据样本，我会天真地认为，当它看到n个数据点

如果您提供的模型就是一个，则在看到n个数据点时，它应该执行得更糟，而这确实是这种情况。随机隐藏状态和一些输入，可以说字母a。然后，您的模型只会输出一些随机值。同样，如果您始终将隐藏状态初始化为0，那么您的模型将充当奇特的查找表，并且只需在培训数据中选择最常见的a。

但是，通常，对于语言模型，您已经在某些输入上调节/提示您的模型。例如。在可能是句子的机器翻译中，在图像标题生成中，它将是编码图像特征。

这使我们回到了这样一个想法，即我们实际上是从data [0：n]中预测数据[n+1]。然后反复将输出附加到输入上以预测下一个输出。

那么，为什么您的培训方式不工作呢？我认为这是因为您不允许模型学习整个联合概率空间。您只是说，如果我有此序列n我想预测他的单数输出y，而不是允许模型学习最终导致<<的预测链代码> y 。

This might not be the perfect explanation, and I suggest you do some reading on RNNs and their inner workings, but I'm going to give it a shot at explaining.

The point of the RNN is to be able to predict the next output at every timepoint, which is how it builds up its state. By taking the joint probability distribution, conditioned on prior outputs, you want to learn to predict y_t by (P(y_t | y_(t-1), y_(t-2), ...). If all you do is feed in an entire timeseries to predict a final output value, then you might as well use a fully connected layer.

Another way to think about it, is that the model is training in the way you suggest but several times over, until the length of the sequence is n.

First, the model predicts the first word t_1 based on some input t_0. t_1 is then added to the input, and t_0 and t_1 are used to predict t_2. We repeat this process until we predict the final word t_n. So we get a growing sequence like below:

t_0           -> t_1
t_0, t_1      -> t_2
t_0, t_1, t_2 -> t_3
...

Obviously this would be quite computationally expensive since we are re-computing the growing sequence n times, and also we have no record of how we computed the prior outputs. So we carry with us a state parameter. This parameter encodes the computations that have taken place up until timepoint t. So the above chain of computations instead becomes

s = 0
s_0, t_0  ->  s_1, t_1
s_1, t_1  ->  s_2, t_2
s_2, t_2  ->  s_3, t_3
...

If a hidden state is initiated randomly, and then only sees 1 single sample of data, I would naively assume it should perform a lot worse vs when it has seen n data points

This would indeed be the case if all you provide your model is a randomized hidden state and some input, lets say the letter a. Then your model will just output some random value. Similarly, if you always initialize your hidden state to 0 then your model will act as a fancy look up table and simply chose what ever value most commonly follows a in your training data.

However, generally for language models you are conditioning/prompting your model on some input already. eg. in machine translation that might be a sentence, in image caption generation it would be encoded image features.

This brings us back to the idea that we are actually predicting data[n+1] from data[0:n] as you suggest. And then repeatedly appending the output onto the input to predict the next output.

So why doesn't your way of training work? I think that is because you are not allowing the model to learn the entire joint probability space. You are simply saying that if I have this sequence n I want to predict his singular output y, rather than allowing the model to learn the chain of predictions that eventually lead to y.

回复收藏 0 原文

~没有更多了~