当前位置：文江博客话题详情

如何在将其传递到Keras LSTM层之前，将多个长度的多元时间序列排列

发布于 2025-02-08 09:11:48 字数 1428 浏览 1 评论 0 原文

我有许多由相同过程产生的多元时间序列，但

长度有显着差异；
每个时间序列都是一个独立的实例，测量值是在不同的随机时间戳上进行的。
每个时间段的每个时间戳都与两个目标相关。

换句话说：

每个时间序列都有的形状（n_timestamps，n_features）
每个目标系列具有的形状（n_timestamps，2）。

举个例子，这可以视为不同公司的股票，这些股票用各种功能描述，而在给定时间戳上的目标是概率是，年底的最终价格将高于x，除了我们学习它们直接从神奇地给出的地面真实概率（而不是观察到的0/1响应）。

我希望能够在每个时间点预测目标，我想尝试一下RNNS。但是，在将数据传递给Keras LSTM层之前，我遇到了弄清楚如何安排数据的问题。我想知道的主要内容是：

我希望我的rnn从系列开始开始使用数据，以便在时间 t 进行预测，不仅最后 k timestamps 。我不能在不爆炸梯度的情况下直接直接使用整个历史记录（太长），因此，即使实际上我的RNN会循环到最后 k ，我也需要一种“记住”以前学习的权重的方法时间戳。
每个时间序列都有不同的长度，因此我不确定如何使事物彼此兼容。我知道填充是一种选项，但是由于示例长度的差异可能与1000 vs 3000一样重要，因此，这将导致许多仅构成填充价值的训练示例。
由于测量是在不同的时间戳上进行的，因此我相信它可能会影响我的网络，从而无法真正了解到最后10个时间戳是最重要的。或者即使可以，这些最后10个时间戳对于每个输入时间序列的现实中都会有不同的长度...这有多大的问题？我应该从将所有示例重新采样到相同的时间点开始（例如，通过插值）？

我目前的想法是：

我可以将每个示例序列添加到相同的长度（ max（n_timestamps））中
创建长度短序列的批次 k ，其中 k 表示RNN层循环的长度。因此，假设我有200个示例序列，其中最长的序列具有3000个时间戳，而我选择的 k 为50，则会导致 3000/50 = 60 （200，50）形状。还是我应该制作 3000-1 批次，其中一个批次与下一个批次有所不同（即，拳头批处理的时间戳为1到50，下一批的时间戳为2到51等。。）？
由于使用了填充物，因此我需要使用掩蔽层。准备批处理中的一些（很多）行将构成应完全忽略的投入（因为它们只有所有50个元素的填充值）。

这是为我的问题准备数据的正确方法吗？可以更好地不引入瓶颈，例如使用仅使用填充值的示例学习（应该用掩蔽层忽略）。还是如何准备该数据以解决点1.，2。和3。上述？

原文

I have a number of multivariate time series that are produced by the same kind of process but:

are of significantly different lengths;
each time series is an independent instance, and the measurements are taken at different, quite random timestamps;
each time series is related at every timestamp to two targets.

In other words:

each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).

To give an example, this could be treated as stocks of different companies, that are described by few various features and the target at a given timestamp are probabilities that the final price at the end of the year will be higher than x, except we learn them directly from magically given ground-truth probabilities (instead of observed 0/1 responses).

I want to be able to predict the target at each time point and I wanted to give RNNs a try. However, I'm having issues with figuring out how I should arrange the data before passing it to Keras LSTM layers. The main things I'm wondering about are:

I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?

My current thinking is that:

I can pad each of my example sequences to the same length (max(n_timestamps))
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape. Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp (i.e. while the fist batch has timestamps from 1 to 50, the next batch has timestamps from 2 to 51 etc.)?
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).

Is this the correct way to prepare the data for my problem? Can it be done better to not introduce bottlenecks such as learning using examples of only padding value (that should be ignored with masking layer). Or how can I prepare that data to address points 1., 2. and 3. described above?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈场末日恋爱 2025-02-15 09:11:48

每个时间序列的形状（n_timestamps，n_features）
每个目标系列的形状为（n_timestamps，2）。

好的，到目前为止，这是非常标准的。

我希望我的RNN从系列开始开始时使用数据在时间t上进行预测，而不仅仅是k时间戳。我不能在不爆炸梯度的情况下直接直接使用整个历史（太长），因此，即使实际上我的RNN会循环到最后的k Timestamps。

检查并确保您实际上需要这个。 RNN（或变压器）可以使用您给它的任何历史记录。但这是假设历史对您做出的预测很有用。

我将尝试对数据的标准尺寸随机-CLIP进行培训（例如）。我会用越来越长的剪辑对其进行重新训练几次，看看在我用完记忆之前，模型性能是否是否。

但是在凯拉斯（Keras）中，要准确地做您要问的事情是相对简单的。

Keras rnns（LSTM，GRU）具有这些参数 return_states 。它允许您允许您通过序列的一部分运行模型，暂停，执行训练步骤，然后继续准确地运行您关闭的位置。

（和状态参数是提供该效果的另一种机制）

代码最终看起来像这样：

class MyModel(keras.Model):
  ...

  def train_step(self, args):
    inputs, labels = args
    state = self.get_initial_state()
    while tf.shape(inputs)[1] != 0:
      in_slice, inputs = inputs[:,:100], inputs[:,100:]
      label_slice, labels = labels[:, :100], labels[:,100:]

      with tf.GradientTape() as tape:
        result, state = self(in_slice, state)

        loss = self.loss(label_slice, result)

      vars = self.trainable_variables
      grads = tape.gradient(loss, vars)
      self.optimizer.apply_gradients(zip(grads, vars))

也可能可以使用 forwardaccumulator 收集梯度。在这种情况下，您不需要将序列切成块，因为前向累加器使用的内存不会随序列长度而生长。我从未尝试过，所以我没有示例代码。

每个时间序列的长度不同，因此我不确定如何使事物彼此兼容。我知道填充是一种选择，但是由于示例长度的差异可能与1000 vs 3000一样显着，因此将导致许多训练示例仅构成填充值。

那可能没关系，只是效率低下。您可以使用：

由于测量是在不同的时间戳上进行的，因此我相信它可能会影响我的网络，从而无法真正了解到最后10个时间戳是最重要的。或者即使可以，这些最后10个时间戳对于每个输入时间序列的现实中都会有不同的长度...这有多大的问题？我应该从将所有示例重新采样到同一时间点开始（例如，通过插值）？

如果不会使您的数据更长的时间，则可以在固定速率上进行插值可能是一件令人鼓舞的事情。只需仔细考虑对插值值的预测：从以后的测量中，有一些数据会泄漏。

另一种方法是使时间步长的大小成为功能。如果每个输入都用自上次输入以来的时间标记了多长时间，则该模型可以学习如何处理小步骤或大型步骤。

i可以将我的每个示例序列粘贴到相同的长度（max（n_timestamps））

。垫子，或制作固定尺寸的夹子。

创建长度为k的短序列的批次，其中k表示RNN层环的长度。因此，假设我有200个示例序列，其中最长的序列具有3000个时间戳，而我的选择的k为50，则会导致3000/50 = 60批次（200，50）形状。

那将与我给出的代码示例一致。

我应该做3000-1批次，其中一批与下一个批次不同，只有一个时间戳

是可以的。但是，如果您想从批处理到批处理状态（我对实际需要携带的持怀疑态度），那么您需要通过块，而不是单端窗口来将它们进行块。

由于使用了填充物，因此我需要使用掩蔽层。准备批处理中的一些（很多）行将构成应完全忽略的输入（因为它们仅具有所有50个元素的填充值）。

是的，这将被浪费在计算中，但不会伤害任何事情。

each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).

Okay, this is pretty standard so far.

I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.

Check and make sure you actually need this. An RNN (or a Transformer) could use any of/all of the history that you give it. But that's assuming that the history is useful for the predictions you're making.

I'd try training on standard-sized random-clips of the data (like in this tutorial). I'd retrain it a few times with longer and longer clips and see if the model performance plateaus before I run out of memory.

But in Keras it is relatively simple to do exactly the thing you're asking.

Keras RNNs (LSTM, GRU) have these this argument return_states. It allows you to allows you to run the model over part of a sequence, pause, execute a training step, and then continue running exactly where you left off.

(and stateful argument is another mechanism to provide that effect)

The code ends up looking something like this:

class MyModel(keras.Model):
  ...

  def train_step(self, args):
    inputs, labels = args
    state = self.get_initial_state()
    while tf.shape(inputs)[1] != 0:
      in_slice, inputs = inputs[:,:100], inputs[:,100:]
      label_slice, labels = labels[:, :100], labels[:,100:]

      with tf.GradientTape() as tape:
        result, state = self(in_slice, state)

        loss = self.loss(label_slice, result)

      vars = self.trainable_variables
      grads = tape.gradient(loss, vars)
      self.optimizer.apply_gradients(zip(grads, vars))

It may also be possible to use ForwardAccumulator to collect the gradients. In that case you don't need to cut the sequences into chunks because the memory used by forward accumulator doesn't grow with sequence length. I've never tried before so I don't have example code.

Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.

That might be okay, just inefficient. You can make batches of similar sequence lengths using: Dataset.bucket_by_sequence_length

Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?

Interpolating to a fixed rate might be a resonable thing to try if it doesn't make your data too much longer. Just think carefully about making predictions on interpolated values: There's some data leaking back in time from a future measurement.

Another approach would be to make the size of the time-step a feature. If each input is tagged with how long it's been since the last input the model can learn how to handle small or large steps.

I can pad each of my example sequences to the same length (max(n_timestamps))

Yes. Pad, or make clips of a fixed size.

Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape.

That would line up with the code example I gave.

Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp

Either way is fine. But if you want to carry the state over from batch to batch (I'm skeptical that you actually need the carry over) then you need to do them chunk by chunk, not by single-stepping your window.

Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).

Yeah, that'll be wasted computation, but it won't hurt anything.

回复收藏 0 原文

~没有更多了~