在我的以下 python
代码中,我可以成功连接到MS Azure SQL DB 使用ODBC连接,并可以使用Pandas的DataFrame方法 to_sql(...)。但是,当我使用 pyspark.pandas
而不是 to_sql(...)
方法失败,说明不支持此类方法。我知道火花上的熊猫API已达到97%的覆盖范围。但是我想知道在仍在使用ODBC时是否有其他方法可以实现同样的方法。
问题:在以下代码示例中,我们如何使用 odbc
pyspark.pandas连接连接到Azure SQL DB并将数据帧加载到SQL表中?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
参考: this
Update :数据文件约为6.5GB,有150列和1500万列记录。因此,熊猫无法处理它,正如预期的那样,它给出了OOM(不记忆)错误。
In my following python
code I successfully can connect to MS Azure SQL Db
using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas
instead, the to_sql(...)
method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC
connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.
发布评论
评论(2)
我注意到您正在将数据附加到桌子上,因此想到了这项工作。
将
pyspark.pandas
将其分成块,然后将每个块导出到熊猫,然后从那里添加块。I noticed you were appending the data to the table, so this work around came to mind.
Break the
pyspark.pandas
into chunks, and then export each chunk to pandas, and from there append the chunk.根据官方
pyspark.pandas
Apache Spark的文档,该模块没有可用的方法可以将pandas dataframe加载到SQL表。请参阅所有提供的方法
作为另一种方法,这些SO线程中提到了一些类似的问题。这可能会有所帮助。
如何将pyspark.sql.dataframe.dataframe转换回databricks Notebook中的SQL表
As per the official
pyspark.pandas
documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.Please see all provided methods here.
As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.
How to write to a Spark SQL table from a Panda data frame using PySpark?
How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook