带有数据加载器的TypeError
我使用了一个非常大的数据集来测试我的模型。为了快速测试样品,我想构建一个数据加载程序。但是我遇到了错误。我无法解决两天。这是我的代码:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
class GPReviewDataset(Dataset):
def __init__(self, Paragraph, target, tokenizer, max_len):
self.Paragraph = Paragraph
self.target= target
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.Paragraph)
def __getitem__(self, item):
Paragraph = str(self.Paragraph[item])
target = self.target[item]
encoding = self.tokenizer.encode_plus(
Paragraph,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'review_text': Paragraph,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
}
def create_data_loader(df, tokenizer, max_len, batch_size):
ds = GPReviewDataset(
Paragraph=df.Paragraph.to_numpy(),
target=df.target.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
# Main function
paragraph=['Image to PDF Converter. ', 'Test Test']
target=['0','1']
df = pd.DataFrame({'Paragraph': paragraph, 'target': target})
MAX_LEN='512'
BATCH_SIZE = 1
train_data_loader1 = create_data_loader(df, tokenizer, MAX_LEN, BATCH_SIZE)
for d in train_data_loader1:
print(d)
当我迭代数据加载程序时,我会遇到此错误:
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py",
line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "<ipython-input-3-c4f87a4dbb48>", line 20, in __getitem__
return_tensors='pt',
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 1069, in encode_plus
return_special_tokens_mask=return_special_tokens_mask,
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 1365, in prepare_for_model
if max_length and total_len > max_length:
TypeError: '>' not supported between instances of 'int' and 'str'
有人可以帮助我吗?而且,您能否提供有关如何在大型数据集上测试我的模型的提示?我的意思是,在3M数据样本上测试模型的更快方法是什么?
i used a very large dataset for testing my model. to make the testing samples fast, I would like to construct a data loader. but I'm getting errors. I couldn't solve it for two days. Here is my code:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
class GPReviewDataset(Dataset):
def __init__(self, Paragraph, target, tokenizer, max_len):
self.Paragraph = Paragraph
self.target= target
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.Paragraph)
def __getitem__(self, item):
Paragraph = str(self.Paragraph[item])
target = self.target[item]
encoding = self.tokenizer.encode_plus(
Paragraph,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'review_text': Paragraph,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
}
def create_data_loader(df, tokenizer, max_len, batch_size):
ds = GPReviewDataset(
Paragraph=df.Paragraph.to_numpy(),
target=df.target.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
# Main function
paragraph=['Image to PDF Converter. ', 'Test Test']
target=['0','1']
df = pd.DataFrame({'Paragraph': paragraph, 'target': target})
MAX_LEN='512'
BATCH_SIZE = 1
train_data_loader1 = create_data_loader(df, tokenizer, MAX_LEN, BATCH_SIZE)
for d in train_data_loader1:
print(d)
When I iterate over the dataloader I got this error:
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py",
line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "<ipython-input-3-c4f87a4dbb48>", line 20, in __getitem__
return_tensors='pt',
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 1069, in encode_plus
return_special_tokens_mask=return_special_tokens_mask,
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 1365, in prepare_for_model
if max_length and total_len > max_length:
TypeError: '>' not supported between instances of 'int' and 'str'
Can anyone help me? and Also, Can you give tips on how I can test my model on a large dataset? I mean what the faster way to test my model on 3M samples of data is?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
错误是指
您应该将
max_len
从字符串更改为int:The error is as it stated
You should change your
MAX_LEN
from string to int: