在 CPython 中一次处理 hdfs 文件中的数据的最佳方法(不使用 stdin)?

发布于 2024-12-21 23:16:59 字数 180 浏览 1 评论 0 原文

我想在 hadoop 流作业中使用 CPython,该作业需要从保存在 hadoop 文件系统中的面向行的文件访问补充信息。我所说的“补充”是指该文件是对通过标准输入传递的信息的补充。补充文件足够大,我无法将其放入内存并解析出行尾字符。是否有一种特别优雅的方法(或库)来一次一行处理该文件?

谢谢,

SetJmp

I would like to use CPython in a hadoop streaming job that needs access to supplementary information from a line-oriented file kept in a hadoop file system. By "supplementary" I mean that this file is in addition to the information delivered via stdin. The supplementary file is large enough that I can't just slurp it into memory and parse out the end-of-line characters. Is there a particularly elegant way (or library) to process this file one line at a time?

Thanks,

SetJmp

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天赋异禀 2024-12-28 23:16:59

查看此文档使用 Hadoop 流作业中的 rel="nofollow">Hadoop 分布式缓存。首先将文件上传到 hdfs,然后告诉 Hadoop 在运行作业之前将其复制到各处,然后它会方便地在作业的工作目录中放置一个符号链接。然后,您可以使用 python 的 open() 来使用 for line in f 或其他内容来读取文件。

分布式缓存是推送文件(开箱即用)以供作业用作资源的最有效方法。您不仅仅想从进程中打开 hdfs 文件,因为每个任务都会尝试通过网络传输文件...使用分布式缓存,即使多个任务在同一节点上运行,也会下载一份副本。


首先,在运行作业时将 -files hdfs://NN:9000/user/sup.txt#sup.txt 添加到命令行参数中。

然后:

for line in open('sup.txt'):
    # do stuff

Check out this documentation for Streaming for using the Hadoop Distributed Cache in Hadoop Streaming jobs. You first upload the file to hdfs, then you tell Hadoop to replicate it everywhere before running the job, then it conveniently places a symlink in the working directory of the job. You can then just use python's open() to read the file with for line in f or whatever.

The distributed cache is the most efficient way to push files around (out of the box) for a job to utilize as a resource. You do not just want to open the hdfs file from your process, as each task will attempt to stream the file over the network... With the distributed cache, one copy is downloaded even if several tasks are running on the same node.


First, add -files hdfs://NN:9000/user/sup.txt#sup.txt to your command-line arguments when you run the job.

Then:

for line in open('sup.txt'):
    # do stuff
眼波传意 2024-12-28 23:16:59

您在找这个吗?

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api .html#module-pydoop.hdfs

with pydoop.hdfs.open( "supplementary", "r" ) as supplementary:
    for line in supplementary:
        # process line

Are you looking for this?

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#module-pydoop.hdfs

with pydoop.hdfs.open( "supplementary", "r" ) as supplementary:
    for line in supplementary:
        # process line
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文