关于hadoop文件系统transferFromLocalFile

发布于 2024-12-29 08:09:13 字数 209 浏览 3 评论 0原文

我正在编写代码以并行传输文件到 hadoop hdfs。所以我有很多线程调用 filesystem.copyFromLocalFile。

我认为打开一个文件系统的成本不小,所以我的项目中只打开了一个文件系统。所以我认为当这么多线程同时调用它时可能会出现问题。但到目前为止,它运行良好,没有任何问题。

有人可以给我一些有关这种复制方法的信息吗? 非常感谢&周末愉快。

I am writing code to transfer files to hadoop hdfs parallel. So I have many threads calling filesystem.copyFromLocalFile.

I think the cost of opening a filesystem is not small, so I just have one filesystem opened in my project. So I though there might be a a problem when so many threads calling it at the same time. But so far, it works fine with no problem.

Could anyone please give me some information about this copy method?
Thank you very much& have a great weekend.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

猫九 2025-01-05 08:09:13

我认为需要考虑以下设计要点:
a) 流程的瓶颈在哪里?我认为在 2-3 个并行复制操作中本地磁盘或 1GB 以太网将成为瓶颈。您可以以多线程应用程序的形式执行此操作,也可以运行几个进程。无论如何,我认为您不需要高水平的并行性。
b) 错误处理。一个线程的故障不应停止整个进程,并且同时文件不应丢失。在这种情况下,我通常所做的就是同意在最坏的情况下文件可以复制两次。如果没问题 - 系统可以在简单的“复制然后删除”场景中工作。
c) 如果您从集群节点之一进行复制 - HDFS 将变得不平衡,因为一个副本将存储在您进行复制的主机上。你需要不断地保持平衡。

I see the following design points to consider:
a) Where will be bottleneck of the process? I think in 2-3 parallel copy operations local disk or 1GB Ethernet will became a bottleneck. You can do it in form of multithreaded application or you can run a few processes. In any case I do not think you need a high level of parallelism.
b) Error handling. Failure of the one thread should not stop the whole process, and, in the same time file should not be lost. What I am usually doing in such cases is to agree that in a worst case file can be copied twice. If it is Ok - system can work in simple "copy then delete" scenario.
c) If you copy from the one of the cluster nodes - HDFS will became unbalanced, since one replica will be stored on the host from where you copy. You will need to do the balance constantly.

夜司空 2025-01-05 08:09:13

您能告诉我有关 copyFromLocalFile() 的更多信息吗?

我不确定,但我想在你的情况下,线程之间共享相同的资源。由于您只有一个 FileSystem 实例,因此每个ad 可能会在时间共享的基础上共享该对象。

Can you tell me what more information you want about copyFromLocalFile()?

I'm not sure but I guess in your case, threads share the same resource among themselves. Since, you have only one instance of FileSystem, each thead will probably share this object in a time sharing basis among themselves.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文