EMR 的输入数据如何跨节点分布[使用 MRJob]?
我正在考虑使用 Yelp 的 MRJob 来使用 Amazon 的 Elastic Map Reduce 进行计算。在计算密集型工作中,我需要读取和写入大量数据。每个节点应该只获取一部分数据,我对这是如何完成的感到困惑。目前,我的数据位于 MongoDB 中,并存储在持久 EBS 驱动器上。
使用 EMR 时,如何在节点上分解数据?应该如何告诉 MRJob 将数据分区到哪个键上? MRJob EMR 文档 隐式保留分解步骤:如果您打开文件或连接对于 S3 键值存储,它如何划分键?它是否假设输入是一个序列并在此基础上自动对其进行分区?
也许有人可以使用 MRJob wordcount 示例来解释输入数据如何传播到节点。在该示例中,输入是一个文本文件——它是复制到所有节点,还是由一个节点串行读取并分段分布?
I'm looking into using Yelp's MRJob to compute using Amazon's Elastic Map Reduce. I will need to read and write a large amount of data during the computationally intensive job. Each node should only get a part of the data, and I'm confused about how this is done. Currently, my data is in a MongoDB and is stored on a persistent EBS drive.
When using EMR, how is the data factored over the nodes? How should one tell MRJob which key to partition the data over? The MRJob EMR documentation leaves the factoring step implicit: if you open a file or connection to a S3 key-value store, how does it divide the keys? Does it assume that the input is a sequence and partition it automatically on that basis?
Perhaps someone can explain how input data is propagated to nodes using the MRJob wordcount example. In that example, the input is a text file -- is it copied to all nodes, or read serially by one node and distributed in pieces?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该示例假设您正在使用文本文件。我不确定您是否可以传入参数来使用 MongoDB hadoop 驱动程序。
你想在这里做什么?我正在研究 MongoDB hadoop 驱动程序,并且正在寻找示例和测试用例。
That example assumes you are working with text files. I'm not sure you can pass in a parameter to use the MongoDB hadoop driver.
What are you trying to do here? I'm working on the MongoDB hadoop driver and I'm looking for examples and test cases.