如何在 Elastic MapReduce 中将 Python UDF 与 Pig 结合使用?

发布于 2025-01-06 03:25:07 字数 1851 浏览 3 评论 0 原文

我真的很想在我们的 AWS Elastic MapReduce 集群上利用 Pig 中的 Python UDF,但我无法让一切正常工作。无论我如何尝试,我的 Pig 作业都会失败,并记录以下异常:

ERROR 2998: Unhandled internal error. org/python/core/PyException

java.lang.NoClassDefFoundError: org/python/core/PyException
        at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
        at org.apache.pig.PigServer.registerCode(PigServer.java:568)
        at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:437)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 14 more

What do you need do to use Python UDFs for Pig in Elastic MapReduce?

I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged:

ERROR 2998: Unhandled internal error. org/python/core/PyException

java.lang.NoClassDefFoundError: org/python/core/PyException
        at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
        at org.apache.pig.PigServer.registerCode(PigServer.java:568)
        at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:437)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 14 more

What do you need to do to use Python UDFs for Pig in Elastic MapReduce?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

迷乱花海 2025-01-13 03:25:07

嗯...为了澄清我刚刚在这里读到的一些内容,此时在存储在 s3 上的 EMR 上运行的 Pig 中使用 python UDF,就像 Pig 脚本中的这一行一样简单:

REGISTER 's3://path/to /bucket/udfs.py' 使用 jython 作为 mynamespace

也就是说,无需修改类路径。我现在正在生产中使用它,但需要注意的是我没有在我的 udf 中引入任何额外的 python 模块。我认为这可能会影响您需要做什么才能使其发挥作用。

Hmm...to clarify some of what I just read here, at this point using a python UDF in Pig running on EMR stored on s3, it's as simple as this line in your pig script:

REGISTER 's3://path/to/bucket/udfs.py' using jython as mynamespace

That is, no classpath modifications necessary. I'm using this in production right now, though with the caveat that I'm not pulling in any additional python modules in my udf. I think that may affect what you need to do to make it work.

强者自强 2025-01-13 03:25:07

经过相当多的转错之后,我发现,至少在Hadoop的elastic map reduce实现上,Pig似乎忽略了CLASSPATH环境变量。我发现我可以使用 HADOOP_CLASSPATH 变量来控制类路径。

一旦我意识到这一点,设置使用 Python UDFS 就相当容易了:

  • 安装 Jython
    • sudo apt-get install jython -y -qq
  • 设置HADOOP_CLASSPATH环境变量。
    • 导出 HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
      • jython.jar 确保 Hadoop 可以找到 PyException 类
      • antlr-runtime-3.2.jar 确保 Hadoop 可以找到 CharStream 类
  • 为 Jython 创建缓存目录(这是 记录在 Jython FAQ 中)
    • sudo mkdir /usr/share/java/cachedir/
    • sudo chmod a+rw /usr/share/java/cachedir

我应该指出,这似乎直接与我在寻找此问题的解决方案时发现的其他建议相矛盾:

  • 设置 CLASSPATH 和 PIG_CLASSPATH 环境变量似乎没有任何作用。
  • 包含UDF的.py文件不需要包含在HADOOP_CLASSPATH环境变量中。
  • Pig register 语句中使用的 .py 文件的路径可以是相对的或绝对的,这似乎并不重要。

After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.

Once I made that realization, it was fairly easy to get things setup to use Python UDFS:

  • Install Jython
    • sudo apt-get install jython -y -qq
  • Set the HADOOP_CLASSPATH environment variable.
    • export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
      • jython.jar ensures that Hadoop can find the PyException class
      • antlr-runtime-3.2.jar ensures that Hadoop can find the CharStream class
  • Create the cache directory for Jython (this is documented in the Jython FAQ)
    • sudo mkdir /usr/share/java/cachedir/
    • sudo chmod a+rw /usr/share/java/cachedir

I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:

  • Setting the CLASSPATH and PIG_CLASSPATH environment variables doesn't seem to do anything.
  • The .py file containing the UDF does not need to be included in the HADOOP_CLASSPATH environment variable.
  • The path to the .py file used in the Pig register statement may be relative or absolute, it doesn't seem to matter.
清晰传感 2025-01-13 03:25:07

我最近遇到了同样的问题。你的答案可以简化。您根本不需要安装 jython 或创建缓存目录。您确实需要将 jython jar 包含在 EMR 引导脚本中(或执行类似的操作)。我使用以下几行编写了 EMR 引导脚本。人们可以通过根本不使用 s3cmd 而是使用作业流程(将文件放置在某个目录中)来进一步简化这一过程。通过 s3cmd 获取 UDF 肯定很不方便,但是,当我使用 emr 版本的 pig 时,无法在 s3 上注册 udf 文件。

如果您使用 CharStream,则还必须将该 jar 包含到 Piglib 路径中。根据您使用的框架,您可以将这些引导脚本作为选项传递给您的作业,EMR 通过其 elastic-mapreduce ruby​​ 客户端支持此操作。一个简单的选择是将引导脚本放置在 s3 上。

如果您在引导脚本中使用 s3cmd,则需要另一个执行类似操作的引导脚本。该脚本应按照引导顺序放置在另一个脚本之前。我将不再使用 s3cmd,但对于我的成功尝试,s3cmd 成功了。此外,s3cmd 可执行文件已安装在亚马逊的 Pig 映像中(例如 ami 版本 2.0 和 hadoop 版本 0.20.205)。

脚本 #1(播种 s3cmd)

#!/bin/bash
cat <<-OUTPUT > /home/hadoop/.s3cfg
[default]
access_key = YOUR KEY
bucket_location = US
cloudfront_host = cloudfront.amazonaws.com
cloudfront_resource = /2010-07-15/distribution
default_mime_type = binary/octet-stream
delete_removed = False
dry_run = False
encoding = UTF-8
encrypt = False
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/local/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %  (passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = YOUR PASSPHRASE
guess_mime_type = True
host_base = s3.amazonaws.com
host_bucket = %(bucket)s.s3.amazonaws.com
human_readable_sizes = False
list_md5 = False
log_target_prefix =
preserve_attrs = True
progress_meter = True
proxy_host =
proxy_port = 0
recursive = False
recv_chunk = 4096
reduced_redundancy = False
secret_key = YOUR SECRET
send_chunk = 4096
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 10
urlencoding_mode = normal
use_https = False
verbosity = WARNING
OUTPUT

脚本 #2(播种 jython jar)

#!/bin/bash
set -e

s3cmd get <jython.jar>
# Very useful for extra libraries not available in the jython jar. I got these libraries from the 
# jython site and created a jar archive.
s3cmd get <jython_extra_libs.jar>
s3cmd get <UDF>

PIG_LIB_PATH=/home/hadoop/piglibs

mkdir -p $PIG_LIB_PATH

mv <jython.jar> $PIG_LIB_PATH
mv <jython_extra_libs.jar> $PIG_LIB_PATH
mv <UDF> $PIG_LIB_PATH

# Change hadoop classpath as well.
echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >>    /home/hadoop/conf/hadoop-user-env.sh

I faced the same problem recently. Your answer can be simplified. You don't need to install jython at all or create the cache directory. You do need to include the jython jar in the EMR bootstrap script (or do something similar). I wrote an EMR bootstrap script with the following lines. One can simplify this even further by not using s3cmd at all, but by using your job flow (to place the files in a certain directory). Getting the UDF via s3cmd is definitely inconvenient, however, I was unable to register a udf file on s3 when using the EMR version of pig.

If you are using CharStream, you have to include that jar as well to the piglib path. Depending on the framework you use, you can pass these bootstrap scripts as options to your job, EMR supports this via their elastic-mapreduce ruby client. A simple option is to place the bootstrap scripts on s3.

If you are using s3cmd in the bootstrap script, you need another bootstrap script that does something like this. This script should be placed before the other in bootstrap order. I am moving away from using s3cmd, but for my successful try, s3cmd did the trick. Also, the s3cmd executable is already installed in the pig image for amazon (e.g. ami version 2.0 and hadoop version 0.20.205.

Script #1 (Seeding s3cmd)

#!/bin/bash
cat <<-OUTPUT > /home/hadoop/.s3cfg
[default]
access_key = YOUR KEY
bucket_location = US
cloudfront_host = cloudfront.amazonaws.com
cloudfront_resource = /2010-07-15/distribution
default_mime_type = binary/octet-stream
delete_removed = False
dry_run = False
encoding = UTF-8
encrypt = False
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/local/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %  (passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = YOUR PASSPHRASE
guess_mime_type = True
host_base = s3.amazonaws.com
host_bucket = %(bucket)s.s3.amazonaws.com
human_readable_sizes = False
list_md5 = False
log_target_prefix =
preserve_attrs = True
progress_meter = True
proxy_host =
proxy_port = 0
recursive = False
recv_chunk = 4096
reduced_redundancy = False
secret_key = YOUR SECRET
send_chunk = 4096
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 10
urlencoding_mode = normal
use_https = False
verbosity = WARNING
OUTPUT

Script #2 (seeding jython jars)

#!/bin/bash
set -e

s3cmd get <jython.jar>
# Very useful for extra libraries not available in the jython jar. I got these libraries from the 
# jython site and created a jar archive.
s3cmd get <jython_extra_libs.jar>
s3cmd get <UDF>

PIG_LIB_PATH=/home/hadoop/piglibs

mkdir -p $PIG_LIB_PATH

mv <jython.jar> $PIG_LIB_PATH
mv <jython_extra_libs.jar> $PIG_LIB_PATH
mv <UDF> $PIG_LIB_PATH

# Change hadoop classpath as well.
echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >>    /home/hadoop/conf/hadoop-user-env.sh
避讳 2025-01-13 03:25:07

截至今天,在 EMR 上使用 Pig 0.9.1,我发现以下内容就足够了:

env HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/path/to/jython.jar pig -f script.pig

其中 script.pig 注册了 Python 脚本,但不是 jython.jar :

register Pig-UDFs/udfs.py using jython as mynamespace;

As of today, using Pig 0.9.1 on EMR, I found the following is sufficient:

env HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/path/to/jython.jar pig -f script.pig

where script.pig does a register of the Python script, but not jython.jar:

register Pig-UDFs/udfs.py using jython as mynamespace;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文