当前位置：文江博客话题详情

在 CPython 中一次处理 hdfs 文件中的数据的最佳方法（不使用 stdin）？

发布于 2024-12-21 23:16:59 字数 180 浏览 1 评论 0 原文

我想在 hadoop 流作业中使用 CPython，该作业需要从保存在 hadoop 文件系统中的面向行的文件访问补充信息。我所说的“补充”是指该文件是对通过标准输入传递的信息的补充。补充文件足够大，我无法将其放入内存并解析出行尾字符。是否有一种特别优雅的方法（或库）来一次一行处理该文件？

谢谢，

SetJmp

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天赋异禀 2024-12-28 23:16:59

查看此文档使用 Hadoop 流作业中的 rel="nofollow">Hadoop 分布式缓存。首先将文件上传到 hdfs，然后告诉 Hadoop 在运行作业之前将其复制到各处，然后它会方便地在作业的工作目录中放置一个符号链接。然后，您可以使用 python 的 open() 来使用 for line in f 或其他内容来读取文件。

分布式缓存是推送文件（开箱即用）以供作业用作资源的最有效方法。您不仅仅想从进程中打开 hdfs 文件，因为每个任务都会尝试通过网络传输文件...使用分布式缓存，即使多个任务在同一节点上运行，也会下载一份副本。

首先，在运行作业时将 -files hdfs://NN:9000/user/sup.txt#sup.txt 添加到命令行参数中。

然后：

for line in open('sup.txt'):
    # do stuff

Check out this documentation for Streaming for using the Hadoop Distributed Cache in Hadoop Streaming jobs. You first upload the file to hdfs, then you tell Hadoop to replicate it everywhere before running the job, then it conveniently places a symlink in the working directory of the job. You can then just use python's open() to read the file with for line in f or whatever.

The distributed cache is the most efficient way to push files around (out of the box) for a job to utilize as a resource. You do not just want to open the hdfs file from your process, as each task will attempt to stream the file over the network... With the distributed cache, one copy is downloaded even if several tasks are running on the same node.

First, add -files hdfs://NN:9000/user/sup.txt#sup.txt to your command-line arguments when you run the job.

Then:

for line in open('sup.txt'):
    # do stuff

回复收藏 0 原文

眼波传意 2024-12-28 23:16:59

您在找这个吗？

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api .html#module-pydoop.hdfs

with pydoop.hdfs.open( "supplementary", "r" ) as supplementary:
    for line in supplementary:
        # process line

Are you looking for this?

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#module-pydoop.hdfs

with pydoop.hdfs.open( "supplementary", "r" ) as supplementary:
    for line in supplementary:
        # process line

回复收藏 0 原文

~没有更多了~

关于作者

伪装你

暂无简介

文章

311 人气

关注发私信

友情链接

文江博客

在 CPython 中一次处理 hdfs 文件中的数据的最佳方法（不使用 stdin）？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

在 CPython 中一次处理 hdfs 文件中的数据的最佳方法（不使用 stdin）？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。