将 Spark 与 LSTM 结合使用

发布于 2025-01-13 12:09:28 字数 2048 浏览 1 评论 0原文

我仍然对 Spark 和深度学习模型迷失方向。

如果我有一个（2D）时间序列，我想将其用于例如 LSTM 模型。然后我首先将其转换为 3D 数组，然后将其传递给模型。这通常是使用 numpy 在内存中完成的。

但是当我使用 Spark 管理我的大文件时会发生什么？

到目前为止我看到的解决方案都是通过使用 Spark，然后最后在 numpy 中转换 3D 数据来实现的。那个解决方案将所有内容都放在内存中......还是我想错了？

常见的 Spark LSTM 解决方案如下所示：

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make train/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

我的问题是：

如果我的 Spark 数据使用数百万行和数千列，那么当 #创建训练/测试数组程序行尝试转换数据，这会导致内存溢出。我说得对吗？

我的问题是：

SPARK 可以用来在大数据上训练 LSTM 模型吗？还是不可能？

有没有什么生成器函数可以解决这个问题？喜欢 Keras Generator 功能吗？

原文

I am still lost on the Spark and Deep Learning model.

If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy.

But what happens when I manage my BIG file with Spark?

The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. At that solution puts everything in memory.... or am I thinking wrong?

A common Spark LSTM solution is looks like this:

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make train/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

My problem with this is:

If my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?

My question is:

Can SPARK be used to train LSTM models on big data, or is it not possible?

Is there any Generator function that can solve this? Like the Keras Generator function?

分享到QQ

分享到微博