如何为pyspark.pandas使用ODBC连接

发布于 2025-01-23 04:52:00 字数 1304 浏览 0 评论 0 原文

在我的以下 python 代码中,我可以成功连接到MS Azure SQL DB 使用ODBC连接,并可以使用Pandas的DataFrame方法 to_sql(...)。但是,当我使用 pyspark.pandas 而不是 to_sql(...)方法失败,说明不支持此类方法。我知道火花上的熊猫API已达到97%的覆盖范围。但是我想知道在仍在使用ODBC时是否有其他方法可以实现同样的方法。

问题:在以下代码示例中,我们如何使用 odbc pyspark.pandas连接连接到Azure SQL DB并将数据帧加载到SQL表中?

import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime

data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

.......

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})

参考: this

Update :数据文件约为6.5GB,有150列和1500万列记录。因此,熊猫无法处理它,正如预期的那样,它给出了OOM(不记忆)错误。

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.

Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?

import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime

data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

.......

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})

Ref: Pandas API on Spark and this

UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

千里故人稀 2025-01-30 04:52:00

我注意到您正在将数据附加到桌子上,因此想到了这项工作。

pyspark.pandas 将其分成块,然后将每个块导出到熊猫,然后从那里添加块。

n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]

for df in list_dfs:
    df = df.to_pandas()
    df.to_sql()

I noticed you were appending the data to the table, so this work around came to mind.

Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.

n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]

for df in list_dfs:
    df = df.to_pandas()
    df.to_sql()
甜妞爱困 2025-01-30 04:52:00

根据官方 pyspark.pandas Apache Spark的文档,该模块没有可用的方法可以将pandas dataframe加载到SQL表。

请参阅所有提供的方法

作为另一种方法,这些SO线程中提到了一些类似的问题。这可能会有所帮助。

如何将pyspark.sql.dataframe.dataframe转换回databricks Notebook中的SQL表

As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.

Please see all provided methods here.

As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.

How to write to a Spark SQL table from a Panda data frame using PySpark?

How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文