获取流式hadoop程序中的输入文件名
在用 Java 编写程序时,我可以使用 FileSplit 找到映射器类中的输入文件的名称。
当我用Python编写程序时(使用流式传输?)是否有相应的方法可以做到这一点?
我在apache上的hadoop流式传输文档中发现了以下内容:
请参阅配置的参数。在执行流作业期间, “mapred”参数的名称被转换。点 (.) 变为下划线 (_)。例如,mapred.job.id 变为 mapred_job_id 和mapred.jar 变为mapred_jar。在您的代码中,使用 参数名称带下划线。
但我仍然不明白如何在我的映射器中使用它。
非常感谢任何帮助。
谢谢
I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.
Is there a corresponding way to do this when I write a program in Python (using streaming?)
I found the following in the hadoop streaming document on apache:
See Configured Parameters. During the execution of a streaming job,
the names of the "mapred" parameters are transformed. The dots ( . )
become underscores ( _ ). For example, mapred.job.id becomes
mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the
parameter names with the underscores.
But I still cant understand how to make use of this inside my mapper.
Any help is highly appreciated.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据《Hadoop : The Definitive Guide》
Hadoop 将作业配置参数设置为环境变量流媒体节目。但是,它将非字母数字字符替换为下划线,以确保它们是有效的名称。以下 Python 表达式说明了如何从 Python Streaming 脚本中检索 mapred.job.id 属性的值:
os.environ["mapred_job_id"]
您还可以通过应用以下命令为 MapReduce 启动的 Streaming 进程设置环境变量: Streaming 启动程序的 -cmdenv 选项(对于您要设置的每个变量一次)。例如,以下设置 MAGIC_PARAMETER 环境变量:
-cmdenv MAGIC_PARAMETER=abracadabra
According to the "Hadoop : The Definitive Guide"
Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:
os.environ["mapred_job_id"]
You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:
-cmdenv MAGIC_PARAMETER=abracadabra
通过解析
mapreduce_map_input_file
(新)或(已弃用)环境变量,您将获得映射输入文件名。map_input_file
注意:
这两个环境变量区分大小写,所有字母都是小写。
By parsing the
mapreduce_map_input_file
(new) or(deprecated) environment variable, you will get the map input file name.map_input_file
Notice:
The two environment variables are case-sensitive, all letters are lower-case.
Hadoop 2.x 的新 ENV_VARIABLE 是 MAPREDUCE_MAP_INPUT_FILE
The new ENV_VARIABLE for Hadoop 2.x is MAPREDUCE_MAP_INPUT_FILE