使用从 s3 下载的数据实现自定义 Huggingface 数据集

发布于 2025-01-12 10:00:48 字数 1278 浏览 0 评论 0原文

为了实现自定义 Huggingface 数据集,我需要实现三种方法:

from datasets import DatasetBuilder, DownloadManager

class MyDataset(DatasetBuilder):
    def _info(self):
        ...

    def _split_generator(self, dl_manager: DownloadManager):
        '''
        Method in charge of downloading (or retrieving locally
        the data files), organizing them according to the splits
        and defining specific arguments for the generation process
        if needed.
        '''
        ...

    def _generate_examples():
        ...

现在,在 _split_generator 方法中,我需要从 S3 下载一个 CSV 文件(一个私有存储桶,需要密钥才能访问它)。下载后,该文件将被进一步处理。

不知道有没有办法使用参数dl_manager来下载呢?我想我可以使用其他一些方法/外部库下载该文件,但我想知道是否可以使用 Huggingface 的 datasets 对象和功能来完成此操作。

此存储库中,您可以看到许多自定义数据集的示例。例如,用于构建亚马逊美国评论的数据是从https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_" + name + "下载的。 tsv.gz" (如您所见 此处)。可以通过以下方式访问它相反,我想使用 Downloadmanager 对象从 S3 下载我的私人数据。

In order to implement a custom Huggingface dataset I need to implement three methods:

from datasets import DatasetBuilder, DownloadManager

class MyDataset(DatasetBuilder):
    def _info(self):
        ...

    def _split_generator(self, dl_manager: DownloadManager):
        '''
        Method in charge of downloading (or retrieving locally
        the data files), organizing them according to the splits
        and defining specific arguments for the generation process
        if needed.
        '''
        ...

    def _generate_examples():
        ...

Now, in the _split_generator method I need to download a CSV file from S3 (a private bucket, one needs keys to access it). This file will be then further processed once it's been downloaded.

Do you know if there is a way to use the parameter dl_manager to download it? I guess I can download the file with some other methods/external libraries, but I'm wondering if I can do it with Huggingface's datasets objects and functionalities.

In this repo you can see many examples of custom datasets. For instance, the data used to build the amazon us reviews is downloaded from https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_" + name + ".tsv.gz" (as you can see here). This is a public link though, and it can be accessed by everybody. Instead, I would like to use a Downloadmanager object to download my private data from S3.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瑾兮 2025-01-19 10:00:48

datasets 提供了一些从 S3(和其他云提供商)下载内容的类:https://huggingface.co/docs/datasets/v2.4.0/en/filesystems

所以你可以这样做:

def _split_generators(self, dl_manager):
    s3 = datasets.filesystems.S3FileSystem()

    _, f = os.path.split(MY_S3_URI)
    s3.get(MY_S3_URI, os.path.join(CACHE_DIR, f))

    return [
        datasets.SplitGenerator(name=datasets.Split.ALL, gen_kwargs={"filepath": os.path.join(CACHE_DIR, f)}),
    ]

datasets provides some class to download things from S3 (and other Cloud providers) : https://huggingface.co/docs/datasets/v2.4.0/en/filesystems

So you can do something like :

def _split_generators(self, dl_manager):
    s3 = datasets.filesystems.S3FileSystem()

    _, f = os.path.split(MY_S3_URI)
    s3.get(MY_S3_URI, os.path.join(CACHE_DIR, f))

    return [
        datasets.SplitGenerator(name=datasets.Split.ALL, gen_kwargs={"filepath": os.path.join(CACHE_DIR, f)}),
    ]
妳是的陽光 2025-01-19 10:00:48

我遇到了同样的问题,发现 DownloadManager 有一个 download_custom 方法专门用于此目的。

https://huggingface.co/docs/datasets/package_reference/builder_classes# datasets.DownloadManager.download_custom

从他们的示例中:

downloaded_files = dl_manager.download_custom(
    's3://my-bucket/data.zip',
    custom_download_for_my_private_bucket
)

请注意,您可以将其与 extract 结合起来以获得类似的行为download_and_extract 使用您的自定义下载功能。

extracted_path = dl_manager.extract(
    dl_manager.download_custom(
        's3://my-bucket/data.zip', 
        custom_download_for_my_private_bucket
    )
)

I was having the same issue and found DownloadManager has a download_custom method just for this.

https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadManager.download_custom

From their example:

downloaded_files = dl_manager.download_custom(
    's3://my-bucket/data.zip',
    custom_download_for_my_private_bucket
)

Note you can combine this with extract to get the analogous behavior of download_and_extract with your custom download function.

extracted_path = dl_manager.extract(
    dl_manager.download_custom(
        's3://my-bucket/data.zip', 
        custom_download_for_my_private_bucket
    )
)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文