代码工作簿 - 使用Hadoop_path找不到文件

发布于 2025-02-05 03:28:07 字数 465 浏览 5 评论 0 原文

我有一个正在运行此代码的代码工作簿中的Python变换:

import pandas as pd

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem()
    filenames = [f.path for f in fs.ls()]
    fp = fs.hadoop_path + "/" + filenames[0]
    with open(fp, 'r') as f:
        t = f.read()
    rows = {"text": [t]}
    return pd.DataFrame(rows)

但是我会收到错误 filenotfounderror:[Errno 2]没有这样的文件或目录:

我的理解是,这是访问的正确方法HDFS中的文件,这是一个存储库与代码工作簿限制吗?

I have a python transform in code workbooks that is running this code:

import pandas as pd

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem()
    filenames = [f.path for f in fs.ls()]
    fp = fs.hadoop_path + "/" + filenames[0]
    with open(fp, 'r') as f:
        t = f.read()
    rows = {"text": [t]}
    return pd.DataFrame(rows)

But I am getting the error FileNotFoundError: [Errno 2] No such file or directory:

My understanding is that this is the correct way to access a file in the hdfs, is this a repository versus code workbooks limitation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你没皮卡萌 2025-02-12 03:28:07

该文档帮助我弄清楚了:
<

这实际上是一个很小的变化。如果您使用的是 filesystem(),则只需要相对路径。

import pandas as pd

def contents_old(pycel_test):
    fs = pycel_test.filesystem()
    filenames = [f.path for f in fs.ls()]
    with fs.open(filenames[0], 'r') as f:
        value = ...
    rows = {"values": [value]}
    return pd.DataFrame(rows)

还有这个选项,但我发现它慢10倍。

from pyspark.sql import Row

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem() # This is the FileSystem object.
    MyRow = Row("column")
    def process_file(file_status):
        with fs.open(file_status.path, 'r') as f:
            ...
    rdd = fs.files().rdd
    rdd = rdd.flatMap(process_file)
    df = rdd.toDF()
    return df

This documentation helped me figure it out:
https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/

It was actually a pretty small change. If you are using the filesystem() you only need the relative path.

import pandas as pd

def contents_old(pycel_test):
    fs = pycel_test.filesystem()
    filenames = [f.path for f in fs.ls()]
    with fs.open(filenames[0], 'r') as f:
        value = ...
    rows = {"values": [value]}
    return pd.DataFrame(rows)

There is also this option, but I found it 10x slower.

from pyspark.sql import Row

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem() # This is the FileSystem object.
    MyRow = Row("column")
    def process_file(file_status):
        with fs.open(file_status.path, 'r') as f:
            ...
    rdd = fs.files().rdd
    rdd = rdd.flatMap(process_file)
    df = rdd.toDF()
    return df
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文