在 Amazon MapReduce 上调用已编译的二进制文件
我正在尝试在 Amazon Elastic MapReduce 上进行一些数据分析。映射器步骤是一个 python 脚本,其中包含对名为“./formatData”的已编译 C++ 二进制文件的调用。例如:
# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)
我可以在 Amazon EMR 上调用这样的二进制可执行文件吗?如果是这样,我将在哪里存储二进制文件(在 S3 中?),我应该在什么平台上编译它,以及如何确保我的映射器脚本可以访问它(理想情况下它将位于当前工作目录中)。
谢谢!
I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:
# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)
Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您确保二进制文件正确复制到工作节点,则可以通过这种方式调用二进制文件。
请参阅:
https://forums.aws.amazon.com/thread.jspa?threadID =35158
有关如何使用分布式缓存使二进制文件在工作节点上可访问的说明。
You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.
See:
https://forums.aws.amazon.com/thread.jspa?threadID=35158
For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.