pyodbc如何确定编码?
到目前为止,我已经与 Sybase SQL Anywhere 12 和 Python(和 Twisted)一起对抗了几个星期,我什至让我的东西正常工作了。
只剩下一个烦恼了:如果我使用自定义 Python 2.7.1(部署平台)在 CentOS 5 上运行我的脚本,我得到的结果为 UTF-8。
如果我在我的 Ubuntu 机器(Natty Narwhal)上运行它,我会得到 latin1 版本。
不用说,我更愿意以 Unicode 格式获取所有数据,但这不是这个问题的重点。 :)
两者都是 64 位机器,都有自定义的 Python 2.7.1。使用 UCS4 和定制的 unixODBC 2.3.0。
我在这里不知所措。我找不到任何相关文档。是什么使 pyodbc 或 unixODBC 在两个机器上的行为不同?
铁证如山:
- Python:2.7.1
- DB:SQL Anywhere 12
- unixODBC:2.3.0(2.2.14 的行为相同),使用相同标志自行编译
- ODBC 驱动程序:源自 Sybase。
- CentOS 5 给我 UTF-8,Ubuntu Natty Narwhal 给我 latin1。
我的 odbc.ini 如下所示:
[sybase]
Uid = user
Pwd = password
Driver = /opt/sqlanywhere/lib64/libdbodbc12_r.so
Threading = True
ServerName = dbname
CommLinks = tcpip(host=the-host;DoBroadcast=None)
我仅使用 DNS='sybase' 进行连接。
蒂亚!
I'm fighting Sybase SQL Anywhere 12 together with Python (and Twisted) for several weeks by now and I even got my stuff working.
There's only one annoyance left: If I run my script on CentOS 5 with a custom Python 2.7.1, which is the deployment platform, I get my results as UTF-8.
If I run it on my Ubuntu box (Natty Narwhal) I get them in latin1.
Needless to say, that I would prefer to get all my data in Unicode but that's not the point of this question. :)
Both are 64bit boxes, both have a custom Python 2.7.1. with UCS4 and a custom built unixODBC 2.3.0.
I'm at a loss here. I can't find any documentation on that. What makes pyodbc or unixODBC behave differently on the two boxes?
Hard facts:
- Python: 2.7.1
- DB: SQL Anywhere 12
- unixODBC: 2.3.0 (2.2.14 did behave the same), self compiled with identical flags
- ODBC driver: original from Sybase.
- CentOS 5 gives me UTF-8, Ubuntu Natty Narwhal gives me latin1.
My odbc.ini looks like this:
[sybase]
Uid = user
Pwd = password
Driver = /opt/sqlanywhere/lib64/libdbodbc12_r.so
Threading = True
ServerName = dbname
CommLinks = tcpip(host=the-host;DoBroadcast=None)
I connect just by using DNS='sybase'.
TIA!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我无法告诉你为什么不同,但如果你将“Charset=utf-8”添加到你的 DSN,你应该在两台机器上得到你想要的结果。
免责声明:我在 Sybase 从事 SQL Anywhere 工程工作。
I can't tell you why it's different, but if you add "Charset=utf-8" to your DSN, you should get the results you want on both machines.
Disclaimer: I work for Sybase in SQL Anywhere engineering.
pyodbc使用ODBC规范,仅支持2种编码。所有以“W”结尾的 ODBC 函数都是使用 SQLWCHAR 的宽字符版本。这是由 ODBC 标头定义的,通常是 UCS2,但有时是 UCS4。非宽版本使用 SQLCHAR 并且始终(?)单字节 ANSI/ASCII。
ODBC 绝对不支持可变宽度编码(例如 UTF8)。如果 ODBC 驱动程序提供了这一点,那么它绝对是错误的。即使数据以 UTF8 存储,驱动程序也必须将其转换为 ANSI 或 UCS2。不幸的是,大多数 ODBC 驱动程序都是完全不正确的。
发送到驱动程序时,如果数据是“str”对象,pyodbc 将使用 ANSI;如果数据是“unicode”对象,则 pyodbc 将使用 UCS2/UCS4(无论 SQLWCHAR 在您的平台上定义是什么)。驱动程序在返回数据时确定数据是 SQLCHAR 还是 SQLWCHAR,而 pyodbc 对此没有任何发言权。如果是 SQLCHAR,则将其转换为“str”对象;如果是 SQLWCHAR,则将其转换为“unicode”对象。
这对于 3.x 版本略有不同,它将同时转换 SQLCHAR 和 SQLCHAR。默认情况下,SQLWCHAR 为 Unicode。
pyodbc uses the ODBC specification, which only supports 2 encodings. All ODBC functions that end with 'W' are the wide character versions that use SQLWCHAR. This is defined by the ODBC headers and is usually UCS2 but is occasionally UCS4. The non-wide versions use SQLCHAR and are always(?) single-byte ANSI/ASCII.
There is absolutely no support in ODBC for variable width encodings such as UTF8. If ODBC drivers supply that, it is absolutely incorrect. Even if data is stored in UTF8, it must be converted into ANSI or UCS2 by the driver. Unfortunately most ODBC drivers are completely incorrect.
When sending to the driver, pyodbc will use ANSI if the data is a 'str' object and will use UCS2/UCS4 (whatever SQLWCHAR is defined to be on your platform) if the data is a 'unicode' object. The drivers determine whether data is SQLCHAR or SQLWCHAR when returning it and pyodbc does not have any say in the matter. If it is SQLCHAR, it is converted to a 'str' object and if SQLWCHAR is converted to a 'unicode' object.
This will be slightly different for 3.x versions which will convert both SQLCHAR & SQLWCHAR to Unicode by default.