pandas .to_sql()输出超过尺寸限制

发布于 2025-02-13 18:27:44 字数 2091 浏览 0 评论 0原文

我正在使用pydobc和sqlalchemy将数据插入SQL Server中的表格,并且遇到了此错误。

https://i.sstatic.net/misp9.png 使用。

这是我用来连接到SQL Server(使用Fast_executeMany)的功能

def connect(server, database):
    global cnxn_str, cnxn, cur, quoted, engine
    cnxn_str = ("Driver={SQL Server Native Client 11.0};"
                "Server=<server>;"
                "Database=<database>;"
                "UID=<user>;"
                "PWD=<password>;")
    cnxn = pyodbc.connect(cnxn_str)
    cur = cnxn.cursor()
    cur.fast_executemany=True

    quoted = quote_plus(cnxn_str)
    engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted), fast_executemany=True)

,这是我用来查询和将数据插入SQL Server的功能,

def insert_to_sql_server():
    global df, np_array
    
    # Dataframe df is from a numpy array dtype = object
    df = pd.DataFrame(np_array[1:,],columns=np_array[0])
    
    # add new columns, data processing
    df['comp_key'] = df['col1']+"-"+df['col2'].astype(str)
    df['comp_key2'] = df['col3']+"-"+df['col4'].astype(str)+"-"+df['col5'].astype(str)
    df['comp_statusID'] = df['col6']+"-"+df['col7'].astype(str)

    convert_dict = {'col1': 'string', 'col2': 'string', ..., 'col_n': 'string'}
    
    # convert data types of columns from objects to strings
    df = df.astype(convert_dict)

    connect(<server>, <database>)
    cur.rollback()

    # Delete old records
    cur.execute("DELETE FROM <table>")

    cur.commit()

    # Insert dataframe to table
    df.to_sql(<table name>, engine, index=False, \
        if_exists='replace', schema='dbo', chunksize=1000, method='multi')

插入功能运行约30分钟,然后最终返回错误消息。

使用较小的DF尺寸时,我没有遇到任何错误。当前的DF大小是27963行和9列。我认为有助于错误的一件事是字符串的长度。默认情况下,numpy数组为dtype ='&lt; u25',但是我必须将其覆盖为dtype ='对象',因为它正在截断文本数据。

我没有想法,因为错误似乎是指熊猫或SQL Server的局限性,而SQL Server我不熟悉。

谢谢

I'm using pydobc and sqlalchemy to insert data into a table in SQL Server, and I'm getting this error.

https://i.sstatic.net/miSp9.png

Here are snippets of the functions I use.

This is the function I use to connect to the SQL server (using fast_executemany)

def connect(server, database):
    global cnxn_str, cnxn, cur, quoted, engine
    cnxn_str = ("Driver={SQL Server Native Client 11.0};"
                "Server=<server>;"
                "Database=<database>;"
                "UID=<user>;"
                "PWD=<password>;")
    cnxn = pyodbc.connect(cnxn_str)
    cur = cnxn.cursor()
    cur.fast_executemany=True

    quoted = quote_plus(cnxn_str)
    engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted), fast_executemany=True)

And this is the function I'm using to query and insert the data into the SQL server

def insert_to_sql_server():
    global df, np_array
    
    # Dataframe df is from a numpy array dtype = object
    df = pd.DataFrame(np_array[1:,],columns=np_array[0])
    
    # add new columns, data processing
    df['comp_key'] = df['col1']+"-"+df['col2'].astype(str)
    df['comp_key2'] = df['col3']+"-"+df['col4'].astype(str)+"-"+df['col5'].astype(str)
    df['comp_statusID'] = df['col6']+"-"+df['col7'].astype(str)

    convert_dict = {'col1': 'string', 'col2': 'string', ..., 'col_n': 'string'}
    
    # convert data types of columns from objects to strings
    df = df.astype(convert_dict)

    connect(<server>, <database>)
    cur.rollback()

    # Delete old records
    cur.execute("DELETE FROM <table>")

    cur.commit()

    # Insert dataframe to table
    df.to_sql(<table name>, engine, index=False, \
        if_exists='replace', schema='dbo', chunksize=1000, method='multi')

The insert function runs for about 30 minutes before finally returning the error message.

I encountered no errors when doing it with a smaller df size. The current df size I have is 27963 rows and 9 columns. One thing which I think contributes to the error is the length of the string. By default the numpy array is dtype='<U25', but I had to override this to dtype='object' because it was truncating the text data.

I'm out of ideas because it seems like the error is referring to limitations of either Pandas or the SQL Server, which I'm not familiar with.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

£冰雨忧蓝° 2025-02-20 18:27:44

感谢您的所有输入(在这里仍然是新的)!意外偶然发现了解决方案,这是通过从

df.to_sql(chunksize = 1000)

降低到

df.to_sql(chunksize = 200)

在挖掘后转弯后,通过 df.to_sql(chunksize = 1000) SQL Server有一个限制( httpps://discuss.dizzycoding.com/to_sql-pyodbc -count-field-correct-orsyntax-error/

Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from

df.to_sql(chunksize=1000)

to

df.to_sql(chunksize=200)

After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)

我纯我任性 2025-02-20 18:27:44

就我而言,我的“输出超过尺寸限制”错误,并将其修复为df.to_sql中的“ method ='multi'”(方法='multi')。
首先,我尝试了“ Chuncksize”解决方案,但它不起作用。
所以...检查您是否处于相同的情况!

with engine.connect().execution_options(autocommit=True) as conn:
    df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)

In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi').
First I tried the "chuncksize" solution and it didn't work.
So... check that if you're at the same scenario!

with engine.connect().execution_options(autocommit=True) as conn:
    df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)
凉城凉梦凉人心 2025-02-20 18:27:44

您可以在将数据插入服务器表中显示进度栏,可以使用tqdm库。这是一个例子:

import pandas as pd
from sqlalchemy import create_engine
from tqdm import tqdm

# Define the SQL Server connection string
server = 'your_server_name'
database = 'your_database_name'
table_name = 'your_table_name'
connection_string = f'mssql+pyodbc://@{server}/{database}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server'

# Create the SQLAlchemy engine
engine = create_engine(connection_string)

# Set the chunk size
chunk_size = 10000  # Adjust this value based on your requirements

# Get the total number of chunks
total_chunks = len(df) // chunk_size + (len(df) % chunk_size > 0)

# Create a progress bar
progress_bar = tqdm(total=total_chunks, desc='Inserting data', unit='chunk')

# Function to generate chunks of the DataFrame
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

# Insert the DataFrame in chunks
for i, chunk in enumerate(chunker(df, chunk_size)):
    replace = 'replace' if i == 0 else 'append'

    # Check if the chunk contains any new data
    if i == 0 or chunk.isin(existing_data).sum().sum() < len(chunk):
        # Insert the chunk into the SQL Server table
        chunk.to_sql(name=table_name, con=engine, if_exists=replace, index=False)

    # Update the progress bar
    progress_bar.update()

# Close the progress bar
progress_bar.close()

# Print a message when the insertion is complete
print('Data insertion complete.')

You can show a progress bar while inserting data into a Server table, you can utilize the tqdm library. Here's an example:

import pandas as pd
from sqlalchemy import create_engine
from tqdm import tqdm

# Define the SQL Server connection string
server = 'your_server_name'
database = 'your_database_name'
table_name = 'your_table_name'
connection_string = f'mssql+pyodbc://@{server}/{database}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server'

# Create the SQLAlchemy engine
engine = create_engine(connection_string)

# Set the chunk size
chunk_size = 10000  # Adjust this value based on your requirements

# Get the total number of chunks
total_chunks = len(df) // chunk_size + (len(df) % chunk_size > 0)

# Create a progress bar
progress_bar = tqdm(total=total_chunks, desc='Inserting data', unit='chunk')

# Function to generate chunks of the DataFrame
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

# Insert the DataFrame in chunks
for i, chunk in enumerate(chunker(df, chunk_size)):
    replace = 'replace' if i == 0 else 'append'

    # Check if the chunk contains any new data
    if i == 0 or chunk.isin(existing_data).sum().sum() < len(chunk):
        # Insert the chunk into the SQL Server table
        chunk.to_sql(name=table_name, con=engine, if_exists=replace, index=False)

    # Update the progress bar
    progress_bar.update()

# Close the progress bar
progress_bar.close()

# Print a message when the insertion is complete
print('Data insertion complete.')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文