如何在 Elastic MapReduce 中将 Python UDF 与 Pig 结合使用?
我真的很想在我们的 AWS Elastic MapReduce 集群上利用 Pig 中的 Python UDF,但我无法让一切正常工作。无论我如何尝试,我的 Pig 作业都会失败,并记录以下异常:
ERROR 2998: Unhandled internal error. org/python/core/PyException
java.lang.NoClassDefFoundError: org/python/core/PyException
at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
at org.apache.pig.PigServer.registerCode(PigServer.java:568)
at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:437)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 14 more
What do you need do to use Python UDFs for Pig in Elastic MapReduce?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
嗯...为了澄清我刚刚在这里读到的一些内容,此时在存储在 s3 上的 EMR 上运行的 Pig 中使用 python UDF,就像 Pig 脚本中的这一行一样简单:
REGISTER 's3://path/to /bucket/udfs.py' 使用 jython 作为 mynamespace
也就是说,无需修改类路径。我现在正在生产中使用它,但需要注意的是我没有在我的 udf 中引入任何额外的 python 模块。我认为这可能会影响您需要做什么才能使其发挥作用。
Hmm...to clarify some of what I just read here, at this point using a python UDF in Pig running on EMR stored on s3, it's as simple as this line in your pig script:
REGISTER 's3://path/to/bucket/udfs.py' using jython as mynamespace
That is, no classpath modifications necessary. I'm using this in production right now, though with the caveat that I'm not pulling in any additional python modules in my udf. I think that may affect what you need to do to make it work.
经过相当多的转错之后,我发现,至少在Hadoop的elastic map reduce实现上,Pig似乎忽略了CLASSPATH环境变量。我发现我可以使用 HADOOP_CLASSPATH 变量来控制类路径。
一旦我意识到这一点,设置使用 Python UDFS 就相当容易了:
sudo apt-get install jython -y -qq
导出 HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
sudo mkdir /usr/share/java/cachedir/
sudo chmod a+rw /usr/share/java/cachedir
我应该指出,这似乎直接与我在寻找此问题的解决方案时发现的其他建议相矛盾:
register
语句中使用的 .py 文件的路径可以是相对的或绝对的,这似乎并不重要。After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.
Once I made that realization, it was fairly easy to get things setup to use Python UDFS:
sudo apt-get install jython -y -qq
export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
sudo mkdir /usr/share/java/cachedir/
sudo chmod a+rw /usr/share/java/cachedir
I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:
register
statement may be relative or absolute, it doesn't seem to matter.我最近遇到了同样的问题。你的答案可以简化。您根本不需要安装 jython 或创建缓存目录。您确实需要将 jython jar 包含在 EMR 引导脚本中(或执行类似的操作)。我使用以下几行编写了 EMR 引导脚本。人们可以通过根本不使用 s3cmd 而是使用作业流程(将文件放置在某个目录中)来进一步简化这一过程。通过 s3cmd 获取 UDF 肯定很不方便,但是,当我使用 emr 版本的 pig 时,无法在 s3 上注册 udf 文件。
如果您使用 CharStream,则还必须将该 jar 包含到 Piglib 路径中。根据您使用的框架,您可以将这些引导脚本作为选项传递给您的作业,EMR 通过其 elastic-mapreduce ruby 客户端支持此操作。一个简单的选择是将引导脚本放置在 s3 上。
如果您在引导脚本中使用 s3cmd,则需要另一个执行类似操作的引导脚本。该脚本应按照引导顺序放置在另一个脚本之前。我将不再使用 s3cmd,但对于我的成功尝试,s3cmd 成功了。此外,s3cmd 可执行文件已安装在亚马逊的 Pig 映像中(例如 ami 版本 2.0 和 hadoop 版本 0.20.205)。
脚本 #1(播种 s3cmd)
脚本 #2(播种 jython jar)
I faced the same problem recently. Your answer can be simplified. You don't need to install jython at all or create the cache directory. You do need to include the jython jar in the EMR bootstrap script (or do something similar). I wrote an EMR bootstrap script with the following lines. One can simplify this even further by not using s3cmd at all, but by using your job flow (to place the files in a certain directory). Getting the UDF via s3cmd is definitely inconvenient, however, I was unable to register a udf file on s3 when using the EMR version of pig.
If you are using CharStream, you have to include that jar as well to the piglib path. Depending on the framework you use, you can pass these bootstrap scripts as options to your job, EMR supports this via their elastic-mapreduce ruby client. A simple option is to place the bootstrap scripts on s3.
If you are using s3cmd in the bootstrap script, you need another bootstrap script that does something like this. This script should be placed before the other in bootstrap order. I am moving away from using s3cmd, but for my successful try, s3cmd did the trick. Also, the s3cmd executable is already installed in the pig image for amazon (e.g. ami version 2.0 and hadoop version 0.20.205.
Script #1 (Seeding s3cmd)
Script #2 (seeding jython jars)
截至今天,在 EMR 上使用 Pig 0.9.1,我发现以下内容就足够了:
其中
script.pig
注册了 Python 脚本,但不是jython.jar
:As of today, using Pig 0.9.1 on EMR, I found the following is sufficient:
where
script.pig
does a register of the Python script, but notjython.jar
: