在Linux上使用pyodbc在nvarchar mssql字段中插入unicode或utf-8字符

发布于 2024-07-23 12:51:29 字数 2413 浏览 7 评论 0原文

我正在使用 Ubuntu 9.04

我已经安装了以下软件包版本：

unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
libsybdb5: 0.82-4
freetds-common and freetds-dev: 0.82-4

我已经配置了 /etc/unixodbc.ini 如下：

[FreeTDS]
Description             = TDS driver (Sybase/MS SQL)
Driver          = /usr/lib/odbc/libtdsodbc.so
Setup           = /usr/lib/odbc/libtdsS.so
CPTimeout               = 
CPReuse         = 
UsageCount              = 2

我已经配置了 /etc/freetds/ freetds.conf 像这样：

[global]
    tds version = 8.0
    client charset = UTF-8

我从 http://github.com/mkleehammer/pyodbc 获取了 pyodbc 修订版 31e2fae4adbf1b2af1726e5668a3414cf46b454f 并使用“安装它python setup.py install"

我的本地网络上有一台安装了 Microsoft SQL Server 2000 的 Windows 计算机，正在监听本地 IP 地址 10.32.42.69。我创建了一个名为“Common”的空数据库。我的用户“sa”和密码“secret”具有完全权限。

我正在使用以下 python 代码来设置连接：

import pyodbc
odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS"
con = pyodbc.connect(s)
cur = con.cursor()
cur.execute('''
CREATE TABLE testing (
    id INTEGER NOT NULL IDENTITY(1,1), 
    name NVARCHAR(200) NULL, 
    PRIMARY KEY (id)
)
    ''')
con.commit()

到目前为止，一切都正常。我在服务器上使用了 SQLServer 的企业管理器，新表就在那里。现在我想在表中插入一些数据。

cur = con.cursor()
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',))

那失败了！这是我收到的错误：

pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type 
(0) (SQLBindParameter)'

由于我的客户端配置为使用 UTF-8，我认为可以通过将数据编码为 UTF-8 来解决。这可行，但后来我得到了奇怪的数据：

cur = con.cursor()
cur.execute('DELETE FROM testing')
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),))
con.commit()
# fetching data back
cur = con.cursor()
cur.execute('SELECT name FROM testing')
data = cur.fetchone()
print type(data[0]), data[0]

这没有给出错误，但返回的数据与发送的数据不同！我得到：

<type 'unicode'> somÃ© string

也就是说，pyodbc不会直接接受unicode对象，但它会将unicode对象返回给我！而且编码是混淆的！

现在的问题是：

我想要代码在 NVARCHAR 和/或 NTEXT 字段中插入 unicode 数据。当我查询回来时，我想要插入回来的相同数据。

这可以通过以不同的方式配置系统，或者使用能够在插入或检索时正确地将数据转换为 unicode 或从 unicode 转换来的包装函数来实现。

这并不要求太多，不是吗？

原文

I am using Ubuntu 9.04

I have installed the following package versions:

unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
libsybdb5: 0.82-4
freetds-common and freetds-dev: 0.82-4

I have configured /etc/unixodbc.ini like this:

[FreeTDS]
Description             = TDS driver (Sybase/MS SQL)
Driver          = /usr/lib/odbc/libtdsodbc.so
Setup           = /usr/lib/odbc/libtdsS.so
CPTimeout               = 
CPReuse         = 
UsageCount              = 2

I have configured /etc/freetds/freetds.conf like this:

[global]
    tds version = 8.0
    client charset = UTF-8

I have grabbed pyodbc revision 31e2fae4adbf1b2af1726e5668a3414cf46b454f from http://github.com/mkleehammer/pyodbc and installed it using "python setup.py install"

I have a windows machine with Microsoft SQL Server 2000 installed on my local network, up and listening on the local ip address 10.32.42.69. I have an empty database created with name "Common". I have the user "sa" with password "secret" with full priviledges.

I am using the following python code to setup the connection:

import pyodbc
odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS"
con = pyodbc.connect(s)
cur = con.cursor()
cur.execute('''
CREATE TABLE testing (
    id INTEGER NOT NULL IDENTITY(1,1), 
    name NVARCHAR(200) NULL, 
    PRIMARY KEY (id)
)
    ''')
con.commit()

Everything WORKS up to this point. I have used SQLServer's Enterprise Manager on the server and the new table is there.
Now I want to insert some data on the table.

cur = con.cursor()
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',))

That fails!! Here's the error I get:

pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type 
(0) (SQLBindParameter)'

Since my client is configured to use UTF-8 I thought I could solve by encoding data to UTF-8. That works, but then I get back strange data:

cur = con.cursor()
cur.execute('DELETE FROM testing')
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),))
con.commit()
# fetching data back
cur = con.cursor()
cur.execute('SELECT name FROM testing')
data = cur.fetchone()
print type(data[0]), data[0]

That gives no error, but the data returned is not the same data sent! I get:

<type 'unicode'> somÃ© string

That is, pyodbc won't accept an unicode object directly, but it returns unicode objects back to me! And the encoding is being mixed up!

Now for the question:

I want code to insert unicode data in a NVARCHAR and/or NTEXT field. When I query back, I want the same data I inserted back.

That can be by configuring the system differently, or by using a wrapper function able to convert the data correctly to/from unicode when inserting or retrieving

That's not asking much, is it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小瓶盖 2024-07-30 12:51:29

我记得使用 odbc 驱动程序时遇到过这种愚蠢的问题，即使当时是 java+oracle 组合。

核心是 odbc 驱动程序在将查询字符串发送到数据库时显然对其进行了编码。即使该字段是 Unicode，并且如果您提供 Unicode，在某些情况下似乎也没关系。

您需要确保驱动程序发送的内容与您的数据库（不仅是服务器，还包括数据库）具有相同的编码。否则，您当然会得到奇怪的字符，因为客户端或服务器在编码/或解码时会混淆事物。您是否知道您的服务器用作默认解码数据的字符集（MS 喜欢说的代码点）？

排序规则与此问题无关:)

请参阅该 MS 页面例如。对于 Unicode 字段，排序规则仅用于定义列中的排序顺序，不用于指定数据的存储方式。

如果您将数据存储为 Unicode，则有一种独特的方式来表示它，这就是 Unicode 的目的：无需定义与您将要使用的所有语言兼容的字符集:)

这里的问题是“什么”当我向服务器提供非 Unicode 数据时会发生这种情况吗？”。例如：

当我向服务器发送一个UTF-8字符串时，它如何理解它？
当我向服务器发送 UTF-16 字符串时，它如何理解它？
当我向服务器发送 Latin1 字符串时，它如何理解它？

从服务器的角度来看，这3个字符串都只是一个字节流。服务器无法猜测您对它们进行编码的编码。这意味着如果您的 odbc 客户端最终向服务器发送字节串（编码字符串）而不是发送unicode数据，您将会遇到麻烦：如果这样做，服务器将使用预定义的编码（这就是我的问题：服务器将使用什么编码？因为它不是猜测，所以它必须是一个参数值），并且如果字符串已使用不同的编码进行编码，dzing，数据将被损坏。

它与 Python 中的操作完全相同：

uni = u'Hey my name is André'
in_utf8 = uni.encode('utf-8')
# send the utf-8 data to server
# send(in_utf8)

# on server side
# server receives it. But server is Japanese.
# So the server treats the data with the National charset, shift-jis:
some_string = in_utf8 # some_string = receive()    
decoded = some_string.decode('sjis')

尝试一下。好有趣。解码后的字符串应该是“Hey my name is André”，但实际上是“Hey my name is Andrﾃｩ”。 é 被日语取代ﾃｩ

因此我的建议：您需要确保 pyodbc 能够直接以 Unicode 形式发送数据。如果 pyodbc 未能做到这一点，您将得到意想不到的结果。

我以客户端到服务器的方式描述了这个问题。但是当从服务器向客户端进行通信时，可能会出现同样类型的问题。如果客户端无法理解 Unicode 数据，您可能会遇到麻烦。

FreeTDS 为您处理 Unicode。

实际上，FreeTDS 会为您处理所有事情并将所有数据转换为 UCS2 unicode。（来源）。

服务器<--> FreeTDS ：UCS2 数据
FreeTDS <--> pyodbc ：编码字符串，以 UTF-8 编码（来自 /etc/freetds/freetds.conf）

因此，如果您将 UTF-8 数据传递给 pyodbc，我希望您的应用程序能够正常工作。事实上，正如django-pyodbc 票证所述， django-pyodbc 以 UTF-8 与 pyodbc 进行通信，所以你应该没问题。

FreeTDS 0.82

但是，cramm0 表示 FreeTDS 0.82并非完全没有错误，0.82 和官方修补的 0.82 版本之间存在显着差异，可以在此处找到。您可能应该尝试使用修补后的 FreeTDS

已编辑：删除了旧数据，这些数据与 FreeTDS 无关，仅与 Easysoft 商业 odbc 驱动程序相关。抱歉。

I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.

The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.

You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?

Collation has nothing to do with this problem :)

See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.

If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)

The question here is "what happens when I give data to the server that is not Unicode?". For example:

When I send an UTF-8 string to the server, how does it understand it?
When I send an UTF-16 string to the server, how does it understand it?
When I send a Latin1 string to the server, how does it understand it?

From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.

It's exactly similar as doing in Python:

uni = u'Hey my name is André'
in_utf8 = uni.encode('utf-8')
# send the utf-8 data to server
# send(in_utf8)

# on server side
# server receives it. But server is Japanese.
# So the server treats the data with the National charset, shift-jis:
some_string = in_utf8 # some_string = receive()    
decoded = some_string.decode('sjis')

Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrﾃｩ". é gets replaced by Japanese ﾃｩ

Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.

And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.

FreeTDS handles Unicode for you.

Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).

Server <--> FreeTDS : UCS2 data
FreeTDS <--> pyodbc : encoded strings, encoded in UTF-8 (from /etc/freetds/freetds.conf)

So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.

FreeTDS 0.82

However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS

Edited: removed old data, which had nothing to do with FreeTDS but was only relevant to Easysoft commercial odbc driver. Sorry.

回复收藏 0 原文

谜兔 2024-07-30 12:51:29

我使用 UCS-2 与 SQL Server 交互，而不是 UTF-8。

更正：我更改了 .freetds.conf 条目，以便客户端使用 UTF-8

    tds version = 8.0
    client charset = UTF-8
    text size = 32768

现在，绑定值对于 UTF-8 编码的字符串可以正常工作。
该驱动程序在用于数据服务器端存储的 UCS-2 和提供给客户端/从客户端获取的 UTF-8 编码字符串之间进行透明转换。

这是在 Solaris 10 上运行 Python 2.5 和 FreeTDS freetds-0.82.1.dev.20081111 和 SQL Server 2008 的 pyodbc 2.0

import pyodbc
test_string = u"""Comment ça va ? Très bien ?"""

print type(test_string),repr(test_string)
utf8 = 'utf8:' + test_string.encode('UTF-8')
print type(utf8), repr(utf8)

c = pyodbc.connect('DSN=SA_SQL_SERVER_TEST;UID=XXX;PWD=XXX')

cur = c.cursor()
# This does not work as test_string is not UTF-encoded
try: 
    cur.execute('INSERT unicode_test(t) VALUES(?)', test_string)
    c.commit()
except pyodbc.Error,e:
    print e


# This one does:
try:
    cur.execute('INSERT unicode_test(t) VALUES(?)', utf8)
    c.commit()
except pyodbc.Error,e:
    print e

通过 Management Studio 手动输入了一堆测试数据）

In [41]: for i in cur.execute('SELECT t FROM unicode_test'):
   ....:     print i
   ....:
   ....:
('this is not a banana', )
('\xc3\x85kergatan 24', )
('\xc3\x85kergatan 24', )
('\xe6\xb0\xb4 this is code-point 63CF', )
('Mich\xc3\xa9l', )
('Comment a va ? Trs bien ?', )
('utf8:Comment \xc3\xa7a va ? Tr\xc3\xa8s bien ?', )

这是测试表的输出（我能够通过“编辑前 200 行”对话框将一些 unicode 代码点直接从 Management Studio 放入表中，输入 unicode 代码点的十六进制数字，然后按 Alt-X

I use UCS-2 to interact with SQL Server, not UTF-8.

Correction: I changed the .freetds.conf entry so that the client uses UTF-8

    tds version = 8.0
    client charset = UTF-8
    text size = 32768

Now, bind values work fine for UTF-8 encoded strings.
The driver converts transparently between the UCS-2 used for storage on the dataserver side and the UTF-8 encoded strings given to/taken from the client.

This is with pyodbc 2.0 on Solaris 10 running Python 2.5 and FreeTDS freetds-0.82.1.dev.20081111 and SQL Server 2008

import pyodbc
test_string = u"""Comment ça va ? Très bien ?"""

print type(test_string),repr(test_string)
utf8 = 'utf8:' + test_string.encode('UTF-8')
print type(utf8), repr(utf8)

c = pyodbc.connect('DSN=SA_SQL_SERVER_TEST;UID=XXX;PWD=XXX')

cur = c.cursor()
# This does not work as test_string is not UTF-encoded
try: 
    cur.execute('INSERT unicode_test(t) VALUES(?)', test_string)
    c.commit()
except pyodbc.Error,e:
    print e


# This one does:
try:
    cur.execute('INSERT unicode_test(t) VALUES(?)', utf8)
    c.commit()
except pyodbc.Error,e:
    print e

Here is the output from the test table (I had manually put in a bunch of test data via Management Studio)

In [41]: for i in cur.execute('SELECT t FROM unicode_test'):
   ....:     print i
   ....:
   ....:
('this is not a banana', )
('\xc3\x85kergatan 24', )
('\xc3\x85kergatan 24', )
('\xe6\xb0\xb4 this is code-point 63CF', )
('Mich\xc3\xa9l', )
('Comment a va ? Trs bien ?', )
('utf8:Comment \xc3\xa7a va ? Tr\xc3\xa8s bien ?', )

I was able to put in some in unicode code points directly into the table from Management Studio by the 'Edit Top 200 rows' dialog and entering the hex digits for the unicode code point and then pressing Alt-X

回复收藏 0 原文