Apache Pig 可以从 STDIN 而不是文件加载数据吗?
我想使用 Apache Pig 来转换/连接两个文件中的数据,但我想一步一步地实现它,这意味着,从真实数据中测试它,但尺寸较小(例如 10 行),是否可以使用从 STDIN 读取并输出到 STDOUT 的 Pig?
I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
基本上Hadoop以各种方式支持Streaming,但Pig最初缺乏对加载的支持通过流式传输数据。不过,还是有一些解决方案的。
您可以查看 HStreaming:
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
答案是否定的。在任何 MR 作业可以运行数据之前,数据需要传输到集群中的数据节点上。
但是,如果您使用少量数据样本并且只想做一些简单的事情,您可以在本地模式下使用 Pig,只需将 stdin 写入本地文件并通过脚本运行它。
但更大的问题是为什么要在数据流上使用 MR/Pig?它过去和现在都不是用于此类用途。
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.