PostgreSQL 的 rodbc 字符编码错误
使用 RODBC 从 R 连接到 GreenPlum PostgreSQL 数据库时,我遇到了以前从未遇到过的新错误。我在使用 EMACS/ESS 和 RStudio 时都遇到了错误,并且 RODBC 调用的工作方式与过去一样。
library(RODBC)
gp <- odbcConnect("greenplum", believeNRows = FALSE)
data <- sqlQuery(gp, "select * from mytable")
> data
[1] "22P05 7 ERROR: character 0xc280 of encoding \"UTF8\" has no equivalent in "WIN1252\";\nError while executing the query"
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select * from mytable'"
编辑: 刚刚尝试查询另一个表并得到了结果。所以我猜这不是 RODBC 问题而是 PostgreSQL 表编码问题。
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-2
>
I'm getting a new error which I've never gotten before when connecting from R to a GreenPlum PostgreSQL database using RODBC. I've gotten the error using both EMACS/ESS and RStudio, and the RODBC call has worked as is in the past.
library(RODBC)
gp <- odbcConnect("greenplum", believeNRows = FALSE)
data <- sqlQuery(gp, "select * from mytable")
> data
[1] "22P05 7 ERROR: character 0xc280 of encoding \"UTF8\" has no equivalent in "WIN1252\";\nError while executing the query"
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select * from mytable'"
EDIT:
Just tried querying another table and did get results. So I guess it's not an RODBC problem but a PostgreSQL table encoding problem.
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-2
>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先,出现此问题是因为 R 正在尝试转换为支持 UTF8 的 Windows 语言环境。不幸的是,Brian Ripley 多次报告 Windows 没有 UTF8 语言环境。通过花费数小时在网络、StackOverflow、Microsoft 等上搜索,我得出的结论是 Microsoft
讨厌 UTF-8Windows 不会支持 UTF8。因此,如果有任何解决方案的话,我不确定是否有一个简单的解决方案。我能推荐的最好的方法是在服务器端包装某种转换,如果可以的话考虑过滤数据,或者如果合适的话尝试不同的语言(例如中文,日语,韩语)。
如果您决定封装转换器,unicode.org 建议 此 ICU 工具包。
First, the issue arises because R is trying to convert to a Windows locale that supports UTF8. Unfortunately, Brian Ripley has reported numerous times that Windows has no UTF8 locales. From hours spent searching the web, StackOverflow, Microsoft, etc., I have come to the conclusion that Microsoft
hates UTF-8Windows won't support UTF8.As a result, I'm not sure that there's an easy solution to this, if there is any solution at all. The best I can recommend is to wrap some kind of conversion on the server side, look at filtering the data if you can, or try a different language, if appropriate (e.g. Chinese, Japanese, Korean).
If you do decide to wrap a converter, unicode.org recommends this ICU toolkit.
0xc280 是一个控制元素(Unicode 中的 U+0080),在使用 SQL 等时经常引起问题。问题通常在于转换链,当您使用使用不同编码方案的不同应用程序时,总会发生这种情况。 Windows 现已包含 UTF-8,因此严格来说这不是 Windows 问题。我相信问题在R读取数据之前就出现了。
事实上,在链中,UNICODE中的字符序列0x80将被映射到UTF-8中的0xc280。这应该是一个控制序列,无法打印。但很有可能 0x80 实际上不是 UNICODE,而是 Windows Latin-1 或 Latin-2。在这种情况下,0x80 代表欧元符号。这可能可以解释它如何最终出现在您的数据中。检查一下你是否能在数据中找到类似的东西,这已经可以解释一些事情了。
我的猜测是,解决方案不会位于该工作链的 R 端,而是在此之前。它将尝试自动转换,但据报告在某些情况下会失败(顺便说一句,也适用于 SQL 和 Oracle)。检查您在 Postgresql 中使用的编码,并尝试使用任何拉丁类型。可能还涉及其他链接(例如 Putty 或类似终端)。我很确定所有编码都是 ISO8859-1,即 Latin-1。 UTF-8 被夹在中间,当 0x80 字符被错误地映射到 0xc280 时,就会遇到麻烦。
因此,请检查完整工作链中的编码,并确保它们全部匹配。如果不这样做,每个步骤之间完成的自动转换势必会给某些字符带来麻烦。
希望这有帮助。
0xc280 is a control element ( U+0080 in Unicode) that is causing trouble pretty often when using SQL and the likes. The problem often lies in the conversion chain that invariably happens when you use different applications that use different encoding schemes. Windows has UTF-8 included by now, so it's not strictly a Windows problem. I believe the problem arises before R reads the data in.
In fact, in the chain the character sequence 0x80 in UNICODE will be mapped to 0xc280 in UTF-8. This is supposed to be a control sequence, and cannot be printed. But chances are big that the 0x80 is in fact not UNICODE, but Windows Latin-1 or Latin-2. In that case, the 0x80 represents the euro sign. That might explain how it ends up in your data. Check if you can find something like that in the data, that would explain something already.
My guess is that the solution will not lie at the R-end of this workchain, but before that. It will try automatic conversion, but this one is reported to fail in some cases (also for SQL and Oracle btw). Check in which encoding you're working in Postgresql, and try to use any of the latin types. There might be other links involved (a Putty or similar terminal for example). I'm pretty sure all the encodings there are ISO8859-1, which is Latin-1. Somewhere UTF-8 gets thrown in between, and when the 0x80 character gets wrongly mapped to 0xc280, you get trouble.
So check the encodings in your complete workchain, and make sure that they all match. If they don't, the automatic conversion done between each step is bound to give trouble for some characters.
Hope this helps.
我可能已经在其他地方发布了此回复,但这里是。
从 MS SQL 管理客户端连接到 Postgres DB 时,我遇到类似的错误。在我的例子中,试图修复源数据几乎是不可能的。
我的场景:
ODBC 系统 DSN,并看到诸如“ERROR: character 0xc280 of
编码“UTF8”在“WIN1252”中没有等效项;
修复:使用支持 Unicode 的 ODBC 驱动程序。我正在使用 PostgreSQL Global Development Group 的 ODBC 驱动程序。转至配置 DSN/管理 DSN 并选择 Unicode 驱动程序。
祝你好运。
I might have posted this response elswhere but here goes.
I get similar error when connecting to Postgres DB from MS SQL Management client. Tyring to fix the source data is almost impossible in my case.
My Scenario:
ODBC System DSN, and see errors such as "ERROR: character 0xc280 of
encoding "UTF8" has no equivalent in"WIN1252";
Fix: Use an ODBC driver that supports Unicode. I am using an ODBC driver from PostgreSQL Global Development Group. Go to Configure DSN/Manage DSN and select the Unicode driver.
Good luck.
默认情况下,Greenplum 使用 UTF8 进行字符编码。您可以通过登录 Greenplum 服务器并启动 psql - Greenplum 控制台客户端来检查这一点。
在此控制台应用程序中,您可以发出命令:
\l
列出 Greenplum 中配置的所有数据库 - 这也应该描述数据库的字符集。我认为你的问题是 R 不支持 UTF8 字符(你使用不同的区域设置)
但是您可以在 ODBC 驱动程序中使用即时转码。不确定所有 ODBC 驱动程序,但 DataDirect 驱动程序支持 odbc.ini 文件(通常位于用户主目录中)中的额外选项 - IANAAppCodePage。
您可以在此链接上找到此参数的适当代码:
http://www.iana.org/assignments/character-sets
这是示例od ODBC.ini 内容:
By default Greenplum use UTF8 for character encoding. You could check this by logging in to Greenplum server and launching psql - console client for Greenplum.
In this console application you could issue command:
\l
to list all of the databases configured in the Greenplum - this should also describe character set for database.I think your prblem is that R doesnt support UTF8 for chars (You use different locale)
But you could use On-the-fly transcoding in ODBC driver. Not sure about all ODBC drivers but DataDirect drivers support extra option in odbc.ini file (usually located in user home directory) - IANAAppCodePage.
You could find appropriate code for this parameter on this link:
http://www.iana.org/assignments/character-sets
Here is the example od ODBC.ini content: