代码工作簿 - 使用Hadoop_path找不到文件

发布于 2025-02-05 03:28:07 字数 465 浏览 5 评论 0 原文

我有一个正在运行此代码的代码工作簿中的Python变换：

import pandas as pd

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem()
    filenames = [f.path for f in fs.ls()]
    fp = fs.hadoop_path + "/" + filenames[0]
    with open(fp, 'r') as f:
        t = f.read()
    rows = {"text": [t]}
    return pd.DataFrame(rows)

但是我会收到错误 filenotfounderror：[Errno 2]没有这样的文件或目录：

我的理解是，这是访问的正确方法HDFS中的文件，这是一个存储库与代码工作簿限制吗？

原文

I have a python transform in code workbooks that is running this code:

import pandas as pd

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem()
    filenames = [f.path for f in fs.ls()]
    fp = fs.hadoop_path + "/" + filenames[0]
    with open(fp, 'r') as f:
        t = f.read()
    rows = {"text": [t]}
    return pd.DataFrame(rows)

But I am getting the error FileNotFoundError: [Errno 2] No such file or directory:

My understanding is that this is the correct way to access a file in the hdfs, is this a repository versus code workbooks limitation?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你没皮卡萌 2025-02-12 03:28:07

该文档帮助我弄清楚了：
<

这实际上是一个很小的变化。如果您使用的是 filesystem（），则只需要相对路径。

import pandas as pd

def contents_old(pycel_test):
    fs = pycel_test.filesystem()
    filenames = [f.path for f in fs.ls()]
    with fs.open(filenames[0], 'r') as f:
        value = ...
    rows = {"values": [value]}
    return pd.DataFrame(rows)

还有这个选项，但我发现它慢10倍。

from pyspark.sql import Row

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem() # This is the FileSystem object.
    MyRow = Row("column")
    def process_file(file_status):
        with fs.open(file_status.path, 'r') as f:
            ...
    rdd = fs.files().rdd
    rdd = rdd.flatMap(process_file)
    df = rdd.toDF()
    return df

This documentation helped me figure it out:
https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/

It was actually a pretty small change. If you are using the filesystem() you only need the relative path.

import pandas as pd

def contents_old(pycel_test):
    fs = pycel_test.filesystem()
    filenames = [f.path for f in fs.ls()]
    with fs.open(filenames[0], 'r') as f:
        value = ...
    rows = {"values": [value]}
    return pd.DataFrame(rows)

There is also this option, but I found it 10x slower.

from pyspark.sql import Row

def contents(dataset_with_files):
    fs = dataset_with_files.filesystem() # This is the FileSystem object.
    MyRow = Row("column")
    def process_file(file_status):
        with fs.open(file_status.path, 'r') as f:
            ...
    rdd = fs.files().rdd
    rdd = rdd.flatMap(process_file)
    df = rdd.toDF()
    return df

回复收藏 0 原文

~没有更多了~