使用从 s3 下载的数据实现自定义 Huggingface 数据集
为了实现自定义 Huggingface 数据集,我需要实现三种方法:
from datasets import DatasetBuilder, DownloadManager
class MyDataset(DatasetBuilder):
def _info(self):
...
def _split_generator(self, dl_manager: DownloadManager):
'''
Method in charge of downloading (or retrieving locally
the data files), organizing them according to the splits
and defining specific arguments for the generation process
if needed.
'''
...
def _generate_examples():
...
现在,在 _split_generator 方法中,我需要从 S3 下载一个 CSV 文件(一个私有存储桶,需要密钥才能访问它)。下载后,该文件将被进一步处理。
不知道有没有办法使用参数dl_manager
来下载呢?我想我可以使用其他一些方法/外部库下载该文件,但我想知道是否可以使用 Huggingface 的 datasets
对象和功能来完成此操作。
在此存储库中,您可以看到许多自定义数据集的示例。例如,用于构建亚马逊美国评论的数据是从https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_" + name + "下载的。 tsv.gz"
(如您所见 此处)。可以通过以下方式访问它相反,我想使用 Downloadmanager
对象从 S3 下载我的私人数据。
In order to implement a custom Huggingface dataset I need to implement three methods:
from datasets import DatasetBuilder, DownloadManager
class MyDataset(DatasetBuilder):
def _info(self):
...
def _split_generator(self, dl_manager: DownloadManager):
'''
Method in charge of downloading (or retrieving locally
the data files), organizing them according to the splits
and defining specific arguments for the generation process
if needed.
'''
...
def _generate_examples():
...
Now, in the _split_generator
method I need to download a CSV file from S3 (a private bucket, one needs keys to access it). This file will be then further processed once it's been downloaded.
Do you know if there is a way to use the parameter dl_manager
to download it? I guess I can download the file with some other methods/external libraries, but I'm wondering if I can do it with Huggingface's datasets
objects and functionalities.
In this repo you can see many examples of custom datasets. For instance, the data used to build the amazon us reviews is downloaded from https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_" + name + ".tsv.gz"
(as you can see here). This is a public link though, and it can be accessed by everybody. Instead, I would like to use a Downloadmanager
object to download my private data from S3.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
datasets
提供了一些从 S3(和其他云提供商)下载内容的类:https://huggingface.co/docs/datasets/v2.4.0/en/filesystems所以你可以这样做:
datasets
provides some class to download things from S3 (and other Cloud providers) : https://huggingface.co/docs/datasets/v2.4.0/en/filesystemsSo you can do something like :
我遇到了同样的问题,发现
DownloadManager
有一个download_custom
方法专门用于此目的。https://huggingface.co/docs/datasets/package_reference/builder_classes# datasets.DownloadManager.download_custom
从他们的示例中:
请注意,您可以将其与
extract
结合起来以获得类似的行为download_and_extract
使用您的自定义下载功能。I was having the same issue and found
DownloadManager
has adownload_custom
method just for this.https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadManager.download_custom
From their example:
Note you can combine this with
extract
to get the analogous behavior ofdownload_and_extract
with your custom download function.