逐步加载大型任意数据集
我正在在非常大的数据集上训练我的Keras密集模型。
出于实际原因,我将它们保存在单独的.txt文件上。我有1E4个文本文件,每个文件包含1E4示例。
我想找到一种在整个数据集上适合我的Keras模型的方法。目前,我只能在单个文本文件上使用“ model.fit”,即:
for k in range(10000):
X = np.loadtxt('/path/X_'+str(k)+'.txt')
Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)
如果我想在整个数据集中执行几个时代,这是有问题的。
理想情况下,我想拥有一个可以使用以下方式使用的数据加载函数,以将所有子数据集作为一个单一的供电:
mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)
我想我找到了我想要的东西,但仅适用于由图像组成的数据集:tf 。
谢谢!
I'm training my keras dense models on very large datasets.
For practical reasons, I am saving them on my disk on separate .txt files. I have 1e4 text files, each containing 1e4 examples.
I would like to find a way to fit my keras model on this dataset as a whole. For now, I am only able to use "model.fit" on individual text files, i.e. :
for k in range(10000):
X = np.loadtxt('/path/X_'+str(k)+'.txt')
Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)
Which is problematic if I want for instance to perform several epochs on the whole datasets.
Ideally, I would like to have a dataloader function that could be used in the following way to feed all the sub-datasets as a single one:
mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)
I think I found what I want, but only for datasets composed of images: tf.keras.preprocessing.image.ImageDataGenerator.flow_from_directory
Is there any tf/keras function doing something similar, but for datasets which are not composed of images?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以创建一个生成器函数,然后使用from_generator方法使用TensorFlow数据集类来创建数据集,请参见Bellow示例:
You can create a generator function and then use tensorflow Dataset class using from_generator method to create a dataset, see bellow a dummy example: