罗伯塔(Roberta

发布于 2025-01-23 21:23:47 字数 1624 浏览 5 评论 0原文

我有一个用于假新闻的数据集,该数据集有4个不同的类:truefalse部分true其他 。 当前,我的代码使用这些标签的标签编码,但我想切换到OneHot编码。 因此,不,我试图将这些标签变成孤当矢量。我该如何以某种方式实现这一目标,之后将这些标签传递给罗伯塔模型。 在这里,我将共享我当前的代码:

首先,我将标签转换为数值值(0-3)

le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

之后,我将数据拆分为训练和验证集

texts = []
labels = []
for i in range(len(df)):
   text = df["text"].iloc[i]
   label = df["label"].iloc[i]
   text = df["title"].iloc[i] + " - " + text
   texts.append(text)
   labels.append(label)

   train_test_split(texts, labels, test_size=test_size)

,最终我将数据兼容为bert模型:

    tokenizer = RobertaTokenizerFast.from_pretrained(model_name)

    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
    valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

    train_dataset = NewsGroupsDataset(train_encodings, train_labels)
    valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)  

其中news> newsgroupsdataset看起来像这样:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

如何切换到onehotencoding,因为我不希望模型假设标签之间存在自然顺序?

I have a dataset for fake news, which has 4 different classes: true, false, partially true, other.
Currently my code uses LabelEncoding to these labels but I would like to switch to OneHot Encoding.
So no I am trying to turn these labels into OneHot vectors. How can I achieve that in a way, where it will be possible after that to pass these labels to RoBERTa model.
Here I will share my current code :

Firstly I convert the labels to numerical values (0-3)

le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

After that I split the data to train and validation sets

texts = []
labels = []
for i in range(len(df)):
   text = df["text"].iloc[i]
   label = df["label"].iloc[i]
   text = df["title"].iloc[i] + " - " + text
   texts.append(text)
   labels.append(label)

   train_test_split(texts, labels, test_size=test_size)

Finally I make the data compatible for Bert model:

    tokenizer = RobertaTokenizerFast.from_pretrained(model_name)

    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
    valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

    train_dataset = NewsGroupsDataset(train_encodings, train_labels)
    valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)  

Where NewsGroupsDataset looks like this:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

How can I switch to OneHotEncoding, because I do not want the model to assume that there is a natural order between the labels?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文