将 Spark 与 LSTM 结合使用
我仍然对 Spark 和深度学习模型迷失方向。
如果我有一个(2D)时间序列,我想将其用于例如 LSTM 模型。然后我首先将其转换为 3D 数组,然后将其传递给模型。这通常是使用 numpy 在内存中完成的。
但是当我使用 Spark 管理我的大文件时会发生什么?
到目前为止我看到的解决方案都是通过使用 Spark,然后最后在 numpy 中转换 3D 数据来实现的。那个解决方案将所有内容都放在内存中......还是我想错了?
常见的 Spark LSTM 解决方案如下所示:
# create fake dataset
import random
from keras import models
from keras import layers
data = []
for node in range(0,100):
for day in range(0,100):
data.append([str(node),
day,
random.randrange(15, 25, 1),
random.randrange(50, 100, 1),
random.randrange(1000, 1045, 1)])
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
# transform the data
df_trans = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
#make train/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)
我的问题是:
如果我的 Spark 数据使用数百万行和数千列,那么当 #创建训练/测试数组程序行尝试转换数据,这会导致内存溢出。我说得对吗?
我的问题是:
SPARK 可以用来在大数据上训练 LSTM 模型吗?还是不可能?
有没有什么生成器函数可以解决这个问题?喜欢 Keras Generator 功能吗?
I am still lost on the Spark and Deep Learning model.
If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy.
But what happens when I manage my BIG file with Spark?
The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. At that solution puts everything in memory.... or am I thinking wrong?
A common Spark LSTM solution is looks like this:
# create fake dataset
import random
from keras import models
from keras import layers
data = []
for node in range(0,100):
for day in range(0,100):
data.append([str(node),
day,
random.randrange(15, 25, 1),
random.randrange(50, 100, 1),
random.randrange(1000, 1045, 1)])
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
# transform the data
df_trans = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
#make train/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)
My problem with this is:
If my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?
My question is:
Can SPARK be used to train LSTM models on big data, or is it not possible?
Is there any Generator function that can solve this? Like the Keras Generator function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
也许您的数据框中有太多列 - 为什么会有数百列?您是否为每个时间戳收集那么多数据点?如果是这样,那么我认为您需要对数据进行子集化。根据我的经验,时间序列主要由时间戳驱动 - 即使在很长的时间集合中延伸的少量数据点也能提供大量信息。换句话说,您有一个又宽又高的数据集,但它可能应该又薄又高。
Perhaps you have too many columns in your dataframe - why would you have hundreds of columns? Are you collecting that many data points for each timestamp? If so, then I would argue that you need to subset your data. In my experience, a time-series is driven largely by the timestamp - even a small number of data points stretched across a long collection of time provides enormous information. In other words, you have a dataset that is wide and tall, but it should perhaps be thin and tall instead.