张量流文本生成器

发布于 2025-01-12 05:47:42 字数 2653 浏览 2 评论 0原文

我正在尝试在大量文本上训练 TF 模型,不幸的是,如果我直接使用 model.fit() ,我很快就会耗尽 RAM。有人可以帮助让它使用更少的内存(例如使用发电机代替)吗?下面是我的代码。

from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model

epochs = int(input('Num of epochs:'))

model.fit(X, y, epochs=epochs, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

char_sequences.txt 是一个文本文件,如下所示:

that could 
hat could b
at could be
t could be 
 could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
 be anyone 
be anyone W
e anyone Wh
 anyone Wha
anyone What
nyone What 
yone What r
one What r 
ne What r u
e What r u 
 What r u d
What r u do
hat r u doi
at r u doin
t r u doing
 r u doing 
r u doing N
 u doing Na
u doing Nah
 doing Nahh

它在调试点 3 处消耗 RAM(>16GB)。我无法超越这一点,因为我的计算机内存不足并且内核杀死了该进程。 char_sequences.txt 是从每行包含一条文本消息的文本文件生成的:

NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it

长度约为 20K 行。 我很乐意根据需要提供更多信息!

操作系统:Raspberry Pi OS Buster 电脑:树莓派 4B 4G RAM/交换:4GB/13GB zram 张量流版本:2.4.0 Python版本:3.7

I am trying to train a TF model on a load of text, and unfortunatly if I straight up use model.fit() I run out of RAM very quickly. Could someone help with making it use less RAM (e.g. use a generator instead)? Below is my code.

from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model

epochs = int(input('Num of epochs:'))

model.fit(X, y, epochs=epochs, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

char_sequences.txt is a text file that looks like this:

that could 
hat could b
at could be
t could be 
 could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
 be anyone 
be anyone W
e anyone Wh
 anyone Wha
anyone What
nyone What 
yone What r
one What r 
ne What r u
e What r u 
 What r u d
What r u do
hat r u doi
at r u doin
t r u doing
 r u doing 
r u doing N
 u doing Na
u doing Nah
 doing Nahh

It is at debug point 3 that it eats RAM (>16GB). I cannot get it past this point as my computer runs out of RAM and the kernel kills the process. char_sequences.txt is generated from a text file containing one text message per line:

NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it

and is approximately 20K lines long.
I am happy to provide more information as needed!

OS: Raspberry Pi OS Buster
PC: Raspberry Pi 4B 4G
RAM/Swap: 4GB/13GB of zram
Tensorflow version: 2.4.0
Python version: 3.7

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文