使用数据加载程序时,如何使用大数据文件训练Pytorch模型?
我正在使用Pytorch数据集类别和数据加载程序来加载数据。班级和加载程序看起来如下。
class Dataset(Dataset):
def __init__(self):
self.input_and_label = json.load(open(path_to_large_json_file)) # Large file
self.dataset_size = len(self.input_and_label)
def __getitem__(self, index):
# Convert to tensor
input_data = torch.LongTensor(self.input_and_label[index][0]) # Input X
label_data = torch.LongTensor(self.input_and_label[index][1]) # Label y
return input_data, label_data
def __len__(self):
return self.dataset_size
并且迭代器的生成就像,
train_loader = torch.utils.data.DataLoader(
Dataset(),
# Batch size
batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
shuffle = True,
pin_memory = False #True
)
数据文件是一个大(JSON)文件。但是我遇到内存错误,
<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >
请注意:
The large json file content is list of numbers like,
[0, 1 , 0, 0,..... 4000 numbers] <-- this is the input_data
[0, 2, 2, ... 50 numbers ] <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much
有人可以帮助我,如何在不一次加载大文件的情况下获取迭代器?或者,欢迎其他任何解决方案。非常感谢您的支持。
I am using pytorch Dataset class and Dataloader to load data. The class and loader looks like the following.
class Dataset(Dataset):
def __init__(self):
self.input_and_label = json.load(open(path_to_large_json_file)) # Large file
self.dataset_size = len(self.input_and_label)
def __getitem__(self, index):
# Convert to tensor
input_data = torch.LongTensor(self.input_and_label[index][0]) # Input X
label_data = torch.LongTensor(self.input_and_label[index][1]) # Label y
return input_data, label_data
def __len__(self):
return self.dataset_size
And the iterator is generated like,
train_loader = torch.utils.data.DataLoader(
Dataset(),
# Batch size
batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
shuffle = True,
pin_memory = False #True
)
The data-file is a large (json) file. But I am getting memory error as,
<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >
Note:
The large json file content is list of numbers like,
[0, 1 , 0, 0,..... 4000 numbers] <-- this is the input_data
[0, 2, 2, ... 50 numbers ] <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much
Can someone please help me, how can I get the iterator without loading the large file at once? Or, any other solution welcome. Thank you very much for your support.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您会遇到一个cuda oom错误,它与文件本身很大无关,但单个示例是大。
JSON文件正确加载到RAM,但是
8
示例无法适合您的GPU(图像/视频通常是这种情况,尤其是高分辨率的情况)。解决方案
1
),然后使用渐变积累。它的运行速度会很慢,但是结果将与较大批次的结果一样好。You get a CUDA OOM error, it is not related to the file itself being large, but single example being large.
JSON file loads correctly to RAM, but
8
examples cannot fit on your GPU (which is often the case for images/videos, especially with high resolution).Solutions
1
) and use gradient accumulation. It will run slow, but the results will be as good as those of larger batches.我使用数据集(数据集)类的主要方法。对于大型数据集,Pytorch提供了一个可觉得的iterabledataset。
我们以下,
类customIterabledataset(iterabledataSet):
有关更多信息,请读取在这里
The primary approach I used, using Dataset(Dataset) class. For large dataset pytorch provides an iterable, IterableDataset.
Us the following,
class CustomIterableDataset(IterableDataset):
For more information, read here