如何将.parquet文件从本地机器上传到Azure存储数据湖Gen2?

发布于 2025-02-09 01:24:01 字数 1259 浏览 2 评论 0原文

我在本地计算机中有一组.parquet文件,这些文件正在尝试上传到Data Lake Gen2中的容器。

我不能执行以下操作:

def upload_file_to_directory():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.create_file("uploaded-file.parquet")
        local_file = open("C:\\file-to-upload.parquet",'r')

        file_contents = local_file.read()

        file_client.append_data(data=file_contents, offset=0, length=len(file_contents))

        file_client.flush_data(len(file_contents))

    except Exception as e:
      print(e)

因为.parquet文件无法通过.read()函数读取。

当我尝试这样做时:

def upload_file_to_directory():

     file_system_client = service_client.get_file_system_client(file_system="my-file-system")

     directory_client = file_system_client.get_directory_client("my-directory")
        
     file_client = directory_client.create_file("uploaded-file.parquet")
     file_client.upload_file("C:\\file-to-upload.txt",'r')


我会收到以下错误:

AttributeError: 'DataLakeFileClient' object has no attribute 'upload_file'

有什么建议吗?

I have a set of .parquet files in my local machine that I am trying to upload to a container in Data Lake Gen2.

I cannot do the following:

def upload_file_to_directory():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.create_file("uploaded-file.parquet")
        local_file = open("C:\\file-to-upload.parquet",'r')

        file_contents = local_file.read()

        file_client.append_data(data=file_contents, offset=0, length=len(file_contents))

        file_client.flush_data(len(file_contents))

    except Exception as e:
      print(e)

because the .parquet file cannot read by the .read() function.

When I try do this:

def upload_file_to_directory():

     file_system_client = service_client.get_file_system_client(file_system="my-file-system")

     directory_client = file_system_client.get_directory_client("my-directory")
        
     file_client = directory_client.create_file("uploaded-file.parquet")
     file_client.upload_file("C:\\file-to-upload.txt",'r')


I get the following error:

AttributeError: 'DataLakeFileClient' object has no attribute 'upload_file'

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

天煞孤星 2025-02-16 01:24:01

您之所以收到此信息,是因为您已导入dataLakeFileClient模块。尝试安装datalakeserviceclient,因为它具有upload_file方法。

pip install DataLakeServiceClient

但是,要读取.parquet文件,解决方案之一是使用pandas。以下是对我有用的代码。

storage_account_name='<ACCOUNT_NAME>'
storage_account_key='ACCOUNT_KEY'

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=storage_account_key)
    
file_system_client = service_client.get_file_system_client(file_system="container")

directory_client = file_system_client.get_directory_client(directory="directory")
        
file_client = directory_client.create_file("uploaded-file.parquet")

local_file = pd.read_parquet("<YOUR_FILE_NAME>.parquet")
df = pd.DataFrame(local_file).to_parquet()

file_client.upload_data(data=df,overwrite=True) #Either of the lines works
#file_client.append_data(data=df, offset=0, length=len(df)) 
file_client.flush_data(len(df))

而且,您可能需要导入dataLakeFileClient库来使此工作:

from azure.storage.filedatalake import DataLakeServiceClient
import pandas as pd

结果:

“

You are receiving this because you have imported DataLakeFileClient module. Try installing DataLakeServiceClient since it has upload_file method.

pip install DataLakeServiceClient

However, to read the .parquet file, one of the workarounds is to use pandas. Below is the code that worked for me.

storage_account_name='<ACCOUNT_NAME>'
storage_account_key='ACCOUNT_KEY'

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=storage_account_key)
    
file_system_client = service_client.get_file_system_client(file_system="container")

directory_client = file_system_client.get_directory_client(directory="directory")
        
file_client = directory_client.create_file("uploaded-file.parquet")

local_file = pd.read_parquet("<YOUR_FILE_NAME>.parquet")
df = pd.DataFrame(local_file).to_parquet()

file_client.upload_data(data=df,overwrite=True) #Either of the lines works
#file_client.append_data(data=df, offset=0, length=len(df)) 
file_client.flush_data(len(df))

and you may be required to import DataLakeFileClient library to make this work:

from azure.storage.filedatalake import DataLakeServiceClient
import pandas as pd

RESULTS:

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文