在HDF和S3之间传输文件的有效方法

发布于 2025-02-04 11:19:21 字数 2532 浏览 2 评论 0 原文

我正在寻找在S3和HDF之间传输文件的有效方法。在我的项目中，Ozzie作业正在启用它处理文件创建TMP文件，然后进入关键部分，在其中，该作业必须获得Zookeeper锁定，然后执行一些操作。获得锁后执行的操作之一是将文件从HDFS移至S3。由于Zookeeper锁定的位置，由于超时，我们很少有工作无法获得Zookeeper锁。为了确保由于超时问题而没有失败的工作，我正在尝试提高文件传输的效率。我也无法消除Zookeeper锁。我正在使用CranscessMutex锁。

我尝试了几种方法。

方法1：我尝试使用以下更改使用Apache DistCP API，该项目不会在Maven错误中构建。

final String[] args = new String[4];
          args[0] = "-overwrite";
           args[1] = "-pb";
            args[2] = source.toString();
            args[3] = destination.toString();
            LOGGER.info("Copying contents");
            DistCp distCp = null;
            try {
                DistCpOptions distCpOptions = new DistCpOptions.Builder(source, destination)
                       .withSyncFolder(true)
                        .withCRC(true)
                        .withOverwrite(true)
                       .build();
                distCp = new DistCp(configuration, distCpOptions);
            } catch (final Exception e) {
                throw new IOException("An Exception occured while creating a distCp object", e);
            }
           LOGGER.info("Copying contents of source path {} to destination path {} ", source, destination);
           final int distCopyExitCode = distCp.run(args);

错误：为了纠正此错误，我看到了一个建议添加了GUAVA-11.0.2 MAVEN依赖性，该依赖性无法解决问题。关于如何解决此问题有任何想法吗？

java.lang.NoClassDefFoundError: org/apache/hadoop/thirdparty/com/google/common/base/Preconditions
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.hadoop.tools.DistCpOptions$Builder.<init>(DistCpOptions.java:530)

方法2：  我遇到了AWS S3DISTCP工具，该工具在HDFS和S3之间传输文件，但是我没有找到S3DistCP Java API。使用S3DISTCP的一种方法是通过在EMR群集中创建步骤来来自EMR。（）。

在我的情况下，我有一个EMR步骤，实际上可以从作业开始，作业将处理输入文件，然后将文件从HDFS移动到S3，然后终止。

如果我使用上面链接中提供的解决方案，则下面的步骤将是动作的顺序。

Ozzie Job启动
作业过程输入文件并在HDFS中创建TMP文件，
获得Zookeeper锁定
创建EMR步骤，以使用S3DISTCP
传输文件从HDFS到S3
EMR步骤完成
触发文件传输，

如果我使用此方法，请完成工作 - 有多个并行作业 - 有多个并行作业被踢了起来，对于每项工作，都会创建一个新的EMR步骤。如果我弄错了，请纠正我。任何人都可以提供有关如何解决此问题的建议。

原文

I'm looking for efficient way to transfer files between S3 and hdfs. In my project a ozzie job is kickedoff it processes files create tmp files and enter into critical section where the job has to obtain zookeeper lock and then perform some operation. One of the operation that it performs after acquiring the lock is moving files from hdfs to S3. Due to the zookeeper lock in place we have few jobs that fail to obtain zookeeper lock due to timeout. In order make sure there are no jobs that fail due to timeout issue, I’m trying improve the efficiency of the file transfer. I can’t eliminate the zookeeper lock either. I’m using InterProcessMutex lock.

I’ve couple of approaches that I tried.

Approach 1: I tried to use apache DistCp api using below changes, the project does not build with maven error.

final String[] args = new String[4];
          args[0] = "-overwrite";
           args[1] = "-pb";
            args[2] = source.toString();
            args[3] = destination.toString();
            LOGGER.info("Copying contents");
            DistCp distCp = null;
            try {
                DistCpOptions distCpOptions = new DistCpOptions.Builder(source, destination)
                       .withSyncFolder(true)
                        .withCRC(true)
                        .withOverwrite(true)
                       .build();
                distCp = new DistCp(configuration, distCpOptions);
            } catch (final Exception e) {
                throw new IOException("An Exception occured while creating a distCp object", e);
            }
           LOGGER.info("Copying contents of source path {} to destination path {} ", source, destination);
           final int distCopyExitCode = distCp.run(args);

Error: To rectify this error I saw a suggestion to added guava-11.0.2 maven dependency, which didn’t fix the issue. Any idea on how to fix this issue?

java.lang.NoClassDefFoundError: org/apache/hadoop/thirdparty/com/google/common/base/Preconditions
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.hadoop.tools.DistCpOptions$Builder.<init>(DistCpOptions.java:530)

Approach 2:
I came across aws S3Distcp tool which transfers files between hdfs and S3, however I didn’t find S3Distcp java api. The one way to use S3Distcp is from EMR by creating a step in EMR cluster. (https://docs.aws.amazon.com/code-samples/latest/catalog/java-emr-emr-add-steps.java.html)  .

In my scenario,I have a EMR step which actually kickoffs off the jobs, a job will process input files, then moves files from hdfs to S3 and then terminates.

If I use the solution provided in the above link, below steps will be the sequence of actions.

Ozzie Job kicks-off
Job process input files and creates tmp file in hdfs
Obtain zookeeper lock
Create EMR step to trigger file transfer using S3Distcp
Transfer files from hdfs to S3
EMR Step complete
Job Complete

If I use this approach- there are multiple parallel jobs that are kicked offs, for each job there will be a new EMR step that will be created. Correct me if I got it wrong. Can anyone provide suggestion on how to approach this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

潇烟暮雨 2025-02-11 11:19:21

java.lang.NoClassDefFoundError: org/apache/hadoop/thirdparty/com/google/common/base/Preconditions

这显示出缺失的Hadoop-tirdparty依赖性。添加此问题应解决此错误，

      <dependency>
        <groupId>org.apache.hadoop.thirdparty</groupId>
        <artifactId>hadoop-shaded-guava</artifactId>
        <version>${hadoop-thirdparty-guava.version}</version>
     </dependency>

您可以根据此依赖项的Hadoop版本选择相应的版本，最新版本是Hadoop版本3.3.3的1.1.1

java.lang.NoClassDefFoundError: org/apache/hadoop/thirdparty/com/google/common/base/Preconditions

This is showing missing hadoop-thirdparty dependency. Adding this should solve this error

      <dependency>
        <groupId>org.apache.hadoop.thirdparty</groupId>
        <artifactId>hadoop-shaded-guava</artifactId>
        <version>${hadoop-thirdparty-guava.version}</version>
     </dependency>

You can choose the corresponding version as per the hadoop version for this dependency, the latest being 1.1.1 for hadoop version 3.3.3

回复收藏 0 原文

~没有更多了~