如何更改拥抱面数据集的功能到自定义数据集

发布于 2025-02-09 19:45:54 字数 1925 浏览 3 评论 0原文

因此，下面的函数“ Preprocess_function”是针对拥抱面数据集的。

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 1024
max_target_length = 128

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

我不是使用HuggingFace数据集，而是使用自己的数据集，因此我无法在代码的最后一行中使用dataset.map（）函数。因此，我使用应用函数更改了最后一行，因为我的train数据集只是简单的pandas dataframe。

tokenized_datasets = train.apply(preprocess_function)

但这是这样的错误消息。

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)

7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 'input'

有人可以告诉我如何将此代码从原始火车更改为tokenized_train数据集吗？

原文

So, the function 'preprocess_function' below is made for huggingface datasets.

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 1024
max_target_length = 128

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

I'm not using huggingface datasets but using my own dataset, so I can't use dataset.map() function at the last line of code. So I changed the last line like below using apply function, because my train dataset is just simple pandas dataframe.

tokenized_datasets = train.apply(preprocess_function)

But it's making error message like this.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)

7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 'input'

Can someone tell me how I can change this code from raw train to tokenized_train dataset?

分享到QQ

分享到微博