张量流文本生成器
我正在尝试在大量文本上训练 TF 模型,不幸的是,如果我直接使用 model.fit() ,我很快就会耗尽 RAM。有人可以帮助让它使用更少的内存(例如使用发电机代替)吗?下面是我的代码。
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
sequences.append(encoded_seq)
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
epochs = int(input('Num of epochs:'))
model.fit(X, y, epochs=epochs, verbose=2)
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))
char_sequences.txt
是一个文本文件,如下所示:
that could
hat could b
at could be
t could be
could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
be anyone
be anyone W
e anyone Wh
anyone Wha
anyone What
nyone What
yone What r
one What r
ne What r u
e What r u
What r u d
What r u do
hat r u doi
at r u doin
t r u doing
r u doing
r u doing N
u doing Na
u doing Nah
doing Nahh
它在调试点 3 处消耗 RAM(>16GB)。我无法超越这一点,因为我的计算机内存不足并且内核杀死了该进程。 char_sequences.txt
是从每行包含一条文本消息的文本文件生成的:
NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it
长度约为 20K 行。 我很乐意根据需要提供更多信息!
操作系统:Raspberry Pi OS Buster 电脑:树莓派 4B 4G RAM/交换:4GB/13GB zram 张量流版本:2.4.0 Python版本:3.7
I am trying to train a TF model on a load of text, and unfortunatly if I straight up use model.fit()
I run out of RAM very quickly. Could someone help with making it use less RAM (e.g. use a generator instead)? Below is my code.
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
sequences.append(encoded_seq)
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
epochs = int(input('Num of epochs:'))
model.fit(X, y, epochs=epochs, verbose=2)
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))
char_sequences.txt
is a text file that looks like this:
that could
hat could b
at could be
t could be
could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
be anyone
be anyone W
e anyone Wh
anyone Wha
anyone What
nyone What
yone What r
one What r
ne What r u
e What r u
What r u d
What r u do
hat r u doi
at r u doin
t r u doing
r u doing
r u doing N
u doing Na
u doing Nah
doing Nahh
It is at debug point 3 that it eats RAM (>16GB). I cannot get it past this point as my computer runs out of RAM and the kernel kills the process. char_sequences.txt
is generated from a text file containing one text message per line:
NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it
and is approximately 20K lines long.
I am happy to provide more information as needed!
OS: Raspberry Pi OS Buster
PC: Raspberry Pi 4B 4G
RAM/Swap: 4GB/13GB of zram
Tensorflow version: 2.4.0
Python version: 3.7
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论