逐步加载大型任意数据集

发布于 2025-01-24 08:34:24 字数 629 浏览 0 评论 0原文

我正在在非常大的数据集上训练我的Keras密集模型。

出于实际原因，我将它们保存在单独的.txt文件上。我有1E4个文本文件，每个文件包含1E4示例。

我想找到一种在整个数据集上适合我的Keras模型的方法。目前，我只能在单个文本文件上使用“ model.fit”，即：

for k in range(10000):
     X = np.loadtxt('/path/X_'+str(k)+'.txt')
     Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
     mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)

如果我想在整个数据集中执行几个时代，这是有问题的。

理想情况下，我想拥有一个可以使用以下方式使用的数据加载函数，以将所有子数据集作为一个单一的供电：

mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)

我想我找到了我想要的东西，但仅适用于由图像组成的数据集：tf 。

谢谢！

原文

I'm training my keras dense models on very large datasets.

For practical reasons, I am saving them on my disk on separate .txt files. I have 1e4 text files, each containing 1e4 examples.

I would like to find a way to fit my keras model on this dataset as a whole. For now, I am only able to use "model.fit" on individual text files, i.e. :

for k in range(10000):
     X = np.loadtxt('/path/X_'+str(k)+'.txt')
     Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
     mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)

Which is problematic if I want for instance to perform several epochs on the whole datasets.

Ideally, I would like to have a dataloader function that could be used in the following way to feed all the sub-datasets as a single one:

mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)

I think I found what I want, but only for datasets composed of images: tf.keras.preprocessing.image.ImageDataGenerator.flow_from_directory

Is there any tf/keras function doing something similar, but for datasets which are not composed of images?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如何视而不见 2025-01-31 08:34:24

您可以创建一个生成器函数，然后使用from_generator方法使用TensorFlow数据集类来创建数据集，请参见Bellow示例：

def mygenerator():
  for k in range(1000):
    x = np.random.normal(size=1000,)
    y = np.random.randint(low=0, high=5, size=1000)
    yield x, y

from tensorflow.data import Dataset
mydataset = Dataset.from_generator(mygenerator, output_signature=(tf.TensorSpec(shape=(1000,), dtype=tf.float32), tf.TensorSpec(shape=(1000,), dtype=tf.int32)))
mytraindata = mydataset.batch(batch_size)

You can create a generator function and then use tensorflow Dataset class using from_generator method to create a dataset, see bellow a dummy example:

def mygenerator():
  for k in range(1000):
    x = np.random.normal(size=1000,)
    y = np.random.randint(low=0, high=5, size=1000)
    yield x, y

from tensorflow.data import Dataset
mydataset = Dataset.from_generator(mygenerator, output_signature=(tf.TensorSpec(shape=(1000,), dtype=tf.float32), tf.TensorSpec(shape=(1000,), dtype=tf.int32)))
mytraindata = mydataset.batch(batch_size)

回复收藏 0 原文

~没有更多了~