在训练Bert变体时，获得IndexError：index超出自我范围

发布于 2025-01-23 22:58:42 字数 3855 浏览 2 评论 0原文

xlm_r_model(input_ids = X_train_batch_input_ids
            , attention_mask = X_train_batch_attention_mask
            , return_dict = False
           )

训练面对以下错误：

Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1218, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

以下是详细信息：

创建模型

  config = xlmrobertaconfig（） 
config.output_hidden_states = false
XLM_R_MODEL = XLMROBERTAFORSECECECECECELACICY（config = config）
xlm_r_model.to（设备）＃设备是设备（type ='cpu'）

令牌

  xlmr_tokenizer = xlmrobertatokenizer.from_pretrataining（'xlm-roberta-large'）

max_tweet_len = 402

＆gt;＆gt;＆gt; df_1000.info（）＃描述我已经填充的数据框架
＆lt; class'pandas.core.frame.dataframe'＆gt;
INT64Index：1000条条目，29639至44633
数据列（总计2列）：
＃列非零计数dtype 
------------------------------------------- 
0文本1000非零对象
1级1000非无效INT64 
dtypes：int64（1），对象（1）
内存使用率：55.7+ kb

x_train = xlmr_tokenizer（list（df_1000 [：800] .text），padding = true，max_length = max_tweet_len +5，truncation = truncation = true）＃+5：特殊令牌 /隔板的头部空间
＆gt;＆gt;＆gt;列表（MAP（LEN，X_TRAIN ['input_ids']））＃为什么它的105？不应该是max_tweet_len+5 = 407吗？
[105、105、105、105、105、105、105、105、105、105、105、105、105、105、105，...]

＆gt;＆gt;＆gt;类型（train_index）＃描述（为清晰）训练折叠索引我预先填充
＆lt;类'numpy.ndarray'＆gt;

＆gt;＆gt;＆gt; train_index.size 
640

x_train_fold_input_ids = np.array（x_train ['input_ids']）[train_index]
x_train_fold_attention_mask = np.array（x_train ['activation_mask']）[train_index]

＆gt;＆gt;＆gt;我＃批次ID
0
＆gt;＆gt;＆gt; batch_size
16

x_train_batch_input_ids = x_train_fold_input_ids [i：i+batch_size]
x_train_batch_input_ids = torch.tensor（x_train_batch_input_ids，dtype = type = torch.long）.to（设备）

x_train_batch_attention_mask = x_train_fold_attention_mask [i：i+batch_size]
x_train_batch_attention_mask = torch.tensor（x_train_batch_attention_mask，dtype = type = torch.long）.to（设备）

＆gt;＆gt;＆gt; x_train_batch_input_ids.size（）
TORCH.SIZE（[16，105]）＃为什么105？这不应该是max_tweet_len+5 = 407吗？

＆gt;＆gt;＆gt; x_train_batch_attention_mask.size（）
TORCH.SIZE（[16，105]）＃为什么105？这不应该是max_tweet_len+5 = 407吗？

之后，我将调用xlm_r_model（...），如本问题开头所述，最终出现指定的错误。

注意所有这些细节，我仍然无法明白为什么我会遇到指定的错误。我在哪里做错了？

原文

While training XLMRobertaForSequenceClassification:

xlm_r_model(input_ids = X_train_batch_input_ids
            , attention_mask = X_train_batch_attention_mask
            , return_dict = False
           )

I faced following error:

Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1218, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Below are details:

Creating model

config = XLMRobertaConfig() 
config.output_hidden_states = False
xlm_r_model = XLMRobertaForSequenceClassification(config=config)
xlm_r_model.to(device) # device is device(type='cpu')

Tokenizer

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')

MAX_TWEET_LEN = 402

>>> df_1000.info() # describing a data frame I have pre populated
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 29639 to 44633
Data columns (total 2 columns):
#    Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
0    text    1000 non-null   object
1    class   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 55.7+ KB

X_train = xlmr_tokenizer(list(df_1000[:800].text), padding=True, max_length=MAX_TWEET_LEN+5, truncation=True) # +5: a head room for special tokens / separators
>>> list(map(len,X_train['input_ids']))  # why its 105? shouldn't it be MAX_TWEET_LEN+5 = 407?
[105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, ...]

>>> type(train_index) # describing (for clarity) training fold indices I pre populated
<class 'numpy.ndarray'>

>>> train_index.size 
640

X_train_fold_input_ids = np.array(X_train['input_ids'])[train_index]
X_train_fold_attention_mask = np.array(X_train['attention_mask'])[train_index]

>>> i # batch id
0
>>> batch_size
16

X_train_batch_input_ids = X_train_fold_input_ids[i:i+batch_size]
X_train_batch_input_ids = torch.tensor(X_train_batch_input_ids,dtype=torch.long).to(device)

X_train_batch_attention_mask = X_train_fold_attention_mask[i:i+batch_size]
X_train_batch_attention_mask = torch.tensor(X_train_batch_attention_mask,dtype=torch.long).to(device)

>>> X_train_batch_input_ids.size()
torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?

>>> X_train_batch_attention_mask.size()
torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?

After this I make the call xlm_r_model(...) as stated at the beginning of this question and ending up with the specified error.

Noticing all these details, I am still not able to get why I am getting the specified error. Where I am doing it wrong?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独留℉清风醉 2025-01-30 22:58:42

根据这篇文章的 github ，可能有很多原因。以下是从该职位总结的原因列表（截至2022年4月24日，请注意，未测试第二和第三个原因）：

与令牌和BERT模型的词汇大小不匹配。这将导致令牌生成模型无法理解的ID。 ref
模型模型和数据存在于不同设备上（cpus，gpus，gpus，gpus，gpus，gpus，gpus，gpus， tpus） ref ref
长度序列的长度序列超过512（最大超过512 bert-like模型）

”原因，词汇大小不匹配，我已经解决了以下内容：

这是我修复的方式：

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
config = XLMRobertaConfig() 
config.vocab_size = xlmr_tokenizer.vocab_size  # setting both to have same vocab size

As per this post on github, there can be possibly many reasons for this. Below is the list of reasons summmarised from that post (as of April 24, 2022, note that 2nd and 3rd reasons are not tested):

Mismatching vocabulary size of tokenizer and bert model. This will cause the tokenizer to generate IDs that the model cannot understand. ref
Model and data to exist on different devices (CPUs, GPUs, TPUs) ref
Sequences of length more than 512 (which is max for BERT-like models) ref

In my case, it was the first reason, mismatching vocab size and I have fixed this as follows:

Here is how I fixed this:

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
config = XLMRobertaConfig() 
config.vocab_size = xlmr_tokenizer.vocab_size  # setting both to have same vocab size

回复收藏 0 原文

独留℉清风醉 2025-01-30 22:58:42

我遇到了同样的问题，我通过替换型模型模型名称的模型路径来解决它。（从“/路径/到/局部/模型”到“ bert-base-phinese”）。

回复收藏 0 原文

~没有更多了~