如何更改拥抱面数据集的功能到自定义数据集
因此,下面的函数“ Preprocess_function”是针对拥抱面数据集的。
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
max_input_length = 1024
max_target_length = 128
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "summarize: "
else:
prefix = ""
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
我不是使用HuggingFace数据集,而是使用自己的数据集,因此我无法在代码的最后一行中使用dataset.map()
函数。因此,我使用应用
函数更改了最后一行,因为我的train
数据集只是简单的pandas dataframe。
tokenized_datasets = train.apply(preprocess_function)
但这是这样的错误消息。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
386 except ValueError as err:
387 raise KeyError(key) from err
--> 388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
390
KeyError: 'input'
有人可以告诉我如何将此代码从原始火车更改为tokenized_train数据集吗?
So, the function 'preprocess_function' below is made for huggingface datasets.
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
max_input_length = 1024
max_target_length = 128
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "summarize: "
else:
prefix = ""
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
I'm not using huggingface datasets but using my own dataset, so I can't use dataset.map()
function at the last line of code. So I changed the last line like below using apply
function, because my train
dataset is just simple pandas dataframe.
tokenized_datasets = train.apply(preprocess_function)
But it's making error message like this.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
386 except ValueError as err:
387 raise KeyError(key) from err
--> 388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
390
KeyError: 'input'
Can someone tell me how I can change this code from raw train to tokenized_train dataset?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以通过从pandas dataframe加载huggingface数据集,如下所示,如下所示, dataset.from_pandas 。
ds = dataset.from_pandas(df)
应该起作用。这将使您能够使用数据集映射功能。You can use a Huggingface dataset by loading it from a pandas dataframe, as shown here Dataset.from_pandas.
ds = Dataset.from_pandas(df)
should work. This will let you be able to use the dataset map feature.