如何为pyspark.pandas使用ODBC连接

发布于 2025-01-23 04:52:00 字数 1304 浏览 0 评论 0 原文

在我的以下 python 代码中，我可以成功连接到MS Azure SQL DB 使用ODBC连接，并可以使用Pandas的DataFrame方法 to_sql（...）。但是，当我使用 pyspark.pandas 而不是 to_sql（...）方法失败，说明不支持此类方法。我知道火花上的熊猫API已达到97％的覆盖范围。但是我想知道在仍在使用ODBC时是否有其他方法可以实现同样的方法。

问题：在以下代码示例中，我们如何使用 odbc pyspark.pandas连接连接到Azure SQL DB并将数据帧加载到SQL表中？

import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime

data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

.......

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})

参考： this

Update ：数据文件约为6.5GB，有150列和1500万列记录。因此，熊猫无法处理它，正如预期的那样，它给出了OOM（不记忆）错误。

原文

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.

Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?

import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime

data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

.......

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})

Ref: Pandas API on Spark and this

UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千里故人稀 2025-01-30 04:52:00

我注意到您正在将数据附加到桌子上，因此想到了这项工作。

将 pyspark.pandas 将其分成块，然后将每个块导出到熊猫，然后从那里添加块。

n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]

for df in list_dfs:
    df = df.to_pandas()
    df.to_sql()

I noticed you were appending the data to the table, so this work around came to mind.

Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.

n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]

for df in list_dfs:
    df = df.to_pandas()
    df.to_sql()

回复收藏 0 原文