在 Elastic Map Reduce 上使用带有 Pig 的分布式缓存

发布于 2024-12-17 16:13:51 字数 833 浏览 7 评论 0原文

我正在尝试在 Amazon 的 Elastic Map Reduce 上运行我的 Pig 脚本（使用 UDF）。我需要使用 UDF 中的一些静态文件。

我在我的 UDF 中做了类似的事情：

public class MyUDF extends EvalFunc<DataBag> {
    public DataBag exec(Tuple input) {
        ...
        FileReader fr = new FileReader("./myfile.txt");
        ...
    }
    public List<String> getCacheFiles() {
        List<String> list = new ArrayList<String>(1);
        list.add("s3://path/to/myfile.txt#myfile.txt");
        return list;
    }
}

我已将文件存储在我的 s3 存储桶 /path/to/myfile.txt

但是，在运行我的 Pig 作业时，我看到一个异常：

有异常 java.io.FileNotFoundException : ./myfile.txt (没有这样的文件或目录)

那么，我的问题是：在亚马逊的 EMR 上运行 Pig 脚本时如何使用分布式缓存文件？

编辑：我发现pig-0.6 与pig-0.9 不同，它没有一个名为getCacheFiles() 的函数。亚马逊不支持pig-0.6，所以我需要找到一种不同的方法来在0.6中获得分布式缓存工作

原文

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce.
I need to use some static files from within my UDFs.

I do something like this in my UDF:

public class MyUDF extends EvalFunc<DataBag> {
    public DataBag exec(Tuple input) {
        ...
        FileReader fr = new FileReader("./myfile.txt");
        ...
    }
    public List<String> getCacheFiles() {
        List<String> list = new ArrayList<String>(1);
        list.add("s3://path/to/myfile.txt#myfile.txt");
        return list;
    }
}

I have stored the file in my s3 bucket /path/to/myfile.txt

However, on running my Pig job, I see an exception:

Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)

So, my question is: how do I use distributed cache files when running pig script on amazon's EMR?

EDIT: I figured out that pig-0.6, unlike pig-0.9 does not have a function called getCacheFiles(). Amazon does not support pig-0.6 and so I need to figure out a different way to get distributed cache work in 0.6

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凡尘雨 2024-12-24 16:13:51

我认为将此额外参数添加到 Pig 命令行调用中应该可行（使用 s3 或 s3n，具体取决于文件的存储位置）：

–cacheFile s3n://bucket_name/file_name#cache_file_name

您应该能够在创建作业流程时将其添加到“额外参数”框中。

I think adding this extra arg to the Pig command line call should work (with s3 or s3n, depending on where your file is stored):

–cacheFile s3n://bucket_name/file_name#cache_file_name

You should be able to add that in the "Extra Args" box when creating the Job flow.

回复收藏 0 原文

~没有更多了~

关于作者

爱人如己

暂无简介

文章

27 人气

关注发私信

Serendipity

文章 0 评论 0

关注

xxxx

文章 0 评论 0

关注

迷离°

文章 0 评论 0

关注

文章 0 评论 0

关注

wkeithbarry

文章 0 评论 0

关注

只有一腔孤勇

文章 0 评论 0

友情链接

文江博客

在 Elastic Map Reduce 上使用带有 Pig 的分布式缓存

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Serendipity

xxxx

迷离°

wkeithbarry

只有一腔孤勇

友情链接

在 Elastic Map Reduce 上使用带有 Pig 的分布式缓存

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Serendipity

xxxx

迷离°

wkeithbarry

只有一腔孤勇

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。