罗伯塔(Roberta
我有一个用于假新闻的数据集,该数据集有4个不同的类:true
,false
,部分true
,其他
。 当前,我的代码使用这些标签的标签编码,但我想切换到OneHot编码。 因此,不,我试图将这些标签变成孤当矢量。我该如何以某种方式实现这一目标,之后将这些标签传递给罗伯塔模型。 在这里,我将共享我当前的代码:
首先,我将标签转换为数值值(0-3)
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
之后,我将数据拆分为训练和验证集
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
text = df["title"].iloc[i] + " - " + text
texts.append(text)
labels.append(label)
train_test_split(texts, labels, test_size=test_size)
,最终我将数据兼容为bert模型:
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)
其中news> newsgroupsdataset
看起来像这样:
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item
def __len__(self):
return len(self.labels)
如何切换到onehotencoding,因为我不希望模型假设标签之间存在自然顺序?
I have a dataset for fake news, which has 4 different classes: true
, false
, partially true
, other
.
Currently my code uses LabelEncoding to these labels but I would like to switch to OneHot Encoding.
So no I am trying to turn these labels into OneHot vectors. How can I achieve that in a way, where it will be possible after that to pass these labels to RoBERTa model.
Here I will share my current code :
Firstly I convert the labels to numerical values (0-3)
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
After that I split the data to train and validation sets
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
text = df["title"].iloc[i] + " - " + text
texts.append(text)
labels.append(label)
train_test_split(texts, labels, test_size=test_size)
Finally I make the data compatible for Bert model:
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)
Where NewsGroupsDataset
looks like this:
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item
def __len__(self):
return len(self.labels)
How can I switch to OneHotEncoding, because I do not want the model to assume that there is a natural order between the labels?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论