如何更改拥抱面数据集的功能到自定义数据集

发布于 2025-02-09 19:45:54 字数 1925 浏览 3 评论 0原文

因此,下面的函数“ Preprocess_function”是针对拥抱面数据集的。

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 1024
max_target_length = 128

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

我不是使用HuggingFace数据集,而是使用自己的数据集,因此我无法在代码的最后一行中使用dataset.map()函数。因此,我使用应用函数更改了最后一行,因为我的train数据集只是简单的pandas dataframe。

tokenized_datasets = train.apply(preprocess_function)

但这是这样的错误消息。

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)

7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 'input'

有人可以告诉我如何将此代码从原始火车更改为tokenized_train数据集吗?

So, the function 'preprocess_function' below is made for huggingface datasets.

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 1024
max_target_length = 128

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

I'm not using huggingface datasets but using my own dataset, so I can't use dataset.map() function at the last line of code. So I changed the last line like below using apply function, because my train dataset is just simple pandas dataframe.

tokenized_datasets = train.apply(preprocess_function)

But it's making error message like this.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)

7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 'input'

Can someone tell me how I can change this code from raw train to tokenized_train dataset?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尾戒 2025-02-16 19:45:54

您可以通过从pandas dataframe加载huggingface数据集,如下所示,如下所示, dataset.from_pandas ds = dataset.from_pandas(df)应该起作用。这将使您能够使用数据集映射功能。

You can use a Huggingface dataset by loading it from a pandas dataframe, as shown here Dataset.from_pandas. ds = Dataset.from_pandas(df) should work. This will let you be able to use the dataset map feature.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文