张量流文本生成器

发布于 2025-01-12 05:47:42 字数 2653 浏览 2 评论 0原文

我正在尝试在大量文本上训练 TF 模型，不幸的是，如果我直接使用 model.fit() ，我很快就会耗尽 RAM。有人可以帮助让它使用更少的内存（例如使用发电机代替）吗？下面是我的代码。

from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model

epochs = int(input('Num of epochs:'))

model.fit(X, y, epochs=epochs, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

char_sequences.txt 是一个文本文件，如下所示：

that could 
hat could b
at could be
t could be 
 could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
 be anyone 
be anyone W
e anyone Wh
 anyone Wha
anyone What
nyone What 
yone What r
one What r 
ne What r u
e What r u 
 What r u d
What r u do
hat r u doi
at r u doin
t r u doing
 r u doing 
r u doing N
 u doing Na
u doing Nah
 doing Nahh

它在调试点 3 处消耗 RAM（>16GB）。我无法超越这一点，因为我的计算机内存不足并且内核杀死了该进程。 char_sequences.txt 是从每行包含一条文本消息的文本文件生成的：

NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it

长度约为 20K 行。我很乐意根据需要提供更多信息！

操作系统：Raspberry Pi OS Buster 电脑：树莓派 4B 4G RAM/交换：4GB/13GB zram 张量流版本：2.4.0 Python版本：3.7

原文

I am trying to train a TF model on a load of text, and unfortunatly if I straight up use model.fit() I run out of RAM very quickly. Could someone help with making it use less RAM (e.g. use a generator instead)? Below is my code.

from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model

epochs = int(input('Num of epochs:'))

model.fit(X, y, epochs=epochs, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

char_sequences.txt is a text file that looks like this:

that could 
hat could b
at could be
t could be 
 could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
 be anyone 
be anyone W
e anyone Wh
 anyone Wha
anyone What
nyone What 
yone What r
one What r 
ne What r u
e What r u 
 What r u d
What r u do
hat r u doi
at r u doin
t r u doing
 r u doing 
r u doing N
 u doing Na
u doing Nah
 doing Nahh

It is at debug point 3 that it eats RAM (>16GB). I cannot get it past this point as my computer runs out of RAM and the kernel kills the process. char_sequences.txt is generated from a text file containing one text message per line:

NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it

and is approximately 20K lines long.
I am happy to provide more information as needed!

OS: Raspberry Pi OS Buster
PC: Raspberry Pi 4B 4G
RAM/Swap: 4GB/13GB of zram
Tensorflow version: 2.4.0
Python version: 3.7

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

以可爱出名

暂无简介

文章

24 人气

关注发私信

友情链接

文江博客

张量流文本生成器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

tomoekana

无边思念无边月

眼角的笑意。

在风中等你

是你

syong71

友情链接

张量流文本生成器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

tomoekana

无边思念无边月

眼角的笑意。

在风中等你

是你

syong71

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。