使用数据加载程序时,如何使用大数据文件训练Pytorch模型?

发布于 2025-02-01 19:21:56 字数 1364 浏览 3 评论 0原文

我正在使用Pytorch数据集类别和数据加载程序来加载数据。班级和加载程序看起来如下。

class Dataset(Dataset):
    def __init__(self):

        self.input_and_label   = json.load(open(path_to_large_json_file)) # Large file
        self.dataset_size      = len(self.input_and_label)

    def __getitem__(self, index): 
        # Convert to tensor
        input_data    = torch.LongTensor(self.input_and_label[index][0]) # Input X 
        label_data    = torch.LongTensor(self.input_and_label[index][1]) # Label y
        
        return input_data, label_data
    
    def __len__(self):
        return self.dataset_size

并且迭代器的生成就像,

train_loader = torch.utils.data.DataLoader(
    Dataset(),
    # Batch size
    batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
    shuffle = True,
    pin_memory = False #True 
)

数据文件是一个大(JSON)文件。但是我遇到内存错误,

<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >

请注意:

The large json file content is list of numbers like,

    [0, 1 , 0, 0,..... 4000 numbers]  <-- this is the input_data
    [0, 2, 2, ... 50 numbers ]        <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much

有人可以帮助我,如何在不一次加载大文件的情况下获取迭代器?或者,欢迎其他任何解决方案。非常感谢您的支持。

I am using pytorch Dataset class and Dataloader to load data. The class and loader looks like the following.

class Dataset(Dataset):
    def __init__(self):

        self.input_and_label   = json.load(open(path_to_large_json_file)) # Large file
        self.dataset_size      = len(self.input_and_label)

    def __getitem__(self, index): 
        # Convert to tensor
        input_data    = torch.LongTensor(self.input_and_label[index][0]) # Input X 
        label_data    = torch.LongTensor(self.input_and_label[index][1]) # Label y
        
        return input_data, label_data
    
    def __len__(self):
        return self.dataset_size

And the iterator is generated like,

train_loader = torch.utils.data.DataLoader(
    Dataset(),
    # Batch size
    batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
    shuffle = True,
    pin_memory = False #True 
)

The data-file is a large (json) file. But I am getting memory error as,

<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >

Note:

The large json file content is list of numbers like,

    [0, 1 , 0, 0,..... 4000 numbers]  <-- this is the input_data
    [0, 2, 2, ... 50 numbers ]        <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much

Can someone please help me, how can I get the iterator without loading the large file at once? Or, any other solution welcome. Thank you very much for your support.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心奴独伤 2025-02-08 19:21:56

您会遇到一个cuda oom错误,它与文件本身很大无关,但单个示例是大

JSON文件正确加载到RAM,但是8示例无法适合您的GPU(图像/视频通常是这种情况,尤其是高分辨率的情况)。

解决方案

  1. 使用较大的GPU(例如提供的云)
  2. 使用较小的批量尺寸(甚至大小1),然后使用渐变积累。它的运行速度会很慢,但是结果将与较大批次的结果一样好。

You get a CUDA OOM error, it is not related to the file itself being large, but single example being large.

JSON file loads correctly to RAM, but 8 examples cannot fit on your GPU (which is often the case for images/videos, especially with high resolution).

Solutions

  1. Use a larger GPU (e.g. cloud provided)
  2. Use a smaller batch size (even of size 1) and use gradient accumulation. It will run slow, but the results will be as good as those of larger batches.
抱着落日 2025-02-08 19:21:56

我使用数据集(数据集)类的主要方法。对于大型数据集,Pytorch提供了一个可觉得的iterabledataset。

我们以下,

类customIterabledataset(iterabledataSet):

def __init__(self, filename):

    #Store the filename in object's memory
    self.filename = filename
    self.dataset_size = <Provide the file size here>


 def preprocess(self, text):

     ### Do something with the input here
     text_pp = some_processing_function(text)
              
     return text_pp

def line_mapper(self, line):
    
    #Splits the line into input and label
    #         input, label = line.split(',')
    #         text = self.preprocess(text)
    
    in_label, out_label = eval(line)
    
    input_s     = torch.LongTensor(in_label) 
    output_s    = torch.LongTensor(out_label) 

    return input_s, output_s # Input and output stream

def __len__(self):
    return self.dataset_size

def __iter__(self):

    #Create an iterator
    file_itr = open(self.filename)

    #Map each element using the line_mapper
    mapped_itr = map(self.line_mapper, file_itr)
    
    return mapped_itr

有关更多信息,请读取在这里

The primary approach I used, using Dataset(Dataset) class. For large dataset pytorch provides an iterable, IterableDataset.

Us the following,

class CustomIterableDataset(IterableDataset):

def __init__(self, filename):

    #Store the filename in object's memory
    self.filename = filename
    self.dataset_size = <Provide the file size here>


 def preprocess(self, text):

     ### Do something with the input here
     text_pp = some_processing_function(text)
              
     return text_pp

def line_mapper(self, line):
    
    #Splits the line into input and label
    #         input, label = line.split(',')
    #         text = self.preprocess(text)
    
    in_label, out_label = eval(line)
    
    input_s     = torch.LongTensor(in_label) 
    output_s    = torch.LongTensor(out_label) 

    return input_s, output_s # Input and output stream

def __len__(self):
    return self.dataset_size

def __iter__(self):

    #Create an iterator
    file_itr = open(self.filename)

    #Map each element using the line_mapper
    mapped_itr = map(self.line_mapper, file_itr)
    
    return mapped_itr

For more information, read here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文