奇怪的字符:R 和 Windows 语言环境的交互?
WinXP-x32,R-2.13.0
亲爱的列表,
我有一个问题(我认为)与 Windows 和 R 之间的交互有关。
我正在尝试抓取包含夏威夷群岛数据的表格。这是我的 R 代码:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
输出是(第一组列):
岛屿昵称> >岛屿 岛屿昵称> >位置 1 夏威夷[7] The Big
岛屿北纬 19°34° 西经 155°30° / 纬度 19.567°N 155.5°W 纬度 / 19.567; -155.5 2 毛伊岛[8] 山谷岛 20°48”N 156°20”W / 20.8°N 156.333°W / 20.8; -156.333 3 KahoÊ»olawe[9] 目标岛 北纬 20°33° 156°36°W / 20.55°N 156.6°W/20.55; -156.6 4 拉纳伊[10] 菠萝岛 20°50°N 156°56°W / 20.833°N 156.933°W / 20.833; -156.933 5 摩洛卡岛[11] 友好岛 北纬 21°08” 157°02°W / 21.133°N 157.033°西经 / 21.133; -157.033 6 欧阿胡[12] 聚集地 北纬 21°28° 西经 157°59° / 21.467°N 157.983°W / 21.467; -157.983 7 KauaÊ»i[13] 花园岛 22°05–N 159°30°W / 22.083°N 159.5°西经 / 22.083; -159.5 8 NiÊ»ihau[14] 禁忌岛
21°54°N 160°10°W / 21.9°N 160.167°W/21.9; -160.167
正如你所看到的,里面有“奇怪”的字符。我还尝试过 readHTMLTable(u,encoding = "UTF-16") 和 readHTMLTable(u,encoding = "UTF-8") 但这没有帮助。
在我看来,字符集和 R 的 Windows 设置的交互可能存在问题。
sessionInfo()
给出
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
我还尝试通过输入以下内容让 R 使用另一个设置: Sys.setlocale("LC_ALL", "en_US.UTF-8")
,但这会产生响应:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
此外,我尝试直接从 Windows 命令提示符进行更改,使用: CHCP 65001
及其变体,但这并没有改变任何东西。
我在网上搜索发现其他人也有这个问题,但一直无法找到解决方案。我看起来这是 Windows 和 R 如何交互的问题。不幸的是,我手头的三台电脑都存在这个问题。它发生在WinXP-x32 和Win7-x86 下。
有没有办法让 R 覆盖 Windows 设置或者可以通过其他方式解决问题吗? 我也尝试过其他网站,每当要抓取的文本中有 é、ü、ä、î 等时,都会出现此问题。
谢谢你, 罗杰
WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands Island Nickname > > Location 1 Hawaiʻi[7] The Big
Island 19°34′N 155°30′W /
19.567°N 155.5°W / 19.567;
-155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W /
20.8°N 156.333°W / 20.8;
-156.333 3 Kahoʻolawe[9] The Target Isle 20°33′N
156°36′W / 20.55°N
156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle
20°50′N 156°56′W /
20.833°N 156.933°W / 20.833;
-156.933 5 Molokaʻi[11] The Friendly Isle 21°08′N
157°02′W / 21.133°N
157.033°W / 21.133; -157.033 6 Oʻahu[12] The Gathering Place
21°28′N 157°59′W /
21.467°N 157.983°W / 21.467;
-157.983 7 Kauaʻi[13] The Garden Isle 22°05′N
159°30′W / 22.083°N
159.5°W / 22.083; -159.5 8 Niʻihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N
160.167°W / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16")
and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo()
gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8")
, but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001
and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise?
I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
Thank you,
Roger
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个不完全的答案:
如果您查看维基百科页面并将浏览器中的编码(在 IE 中,查看 -> 编码;在 Firefox 中,查看 -> 字符编码)更改为 Western (ISO-8869-1) 或西方(Windows-1252)然后你会看到愚蠢的字符。这应该意味着您可以使用 iconv 来更改编码并解决您的问题。
不幸的是,它不起作用。通过使用不同的转换可能会获得正确的文本(
iconvlist()
显示了所有可能性)。它可能只是简单地删除有问题的字符,尽管这并不理想。
A not quite an answer:
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use
iconv
to change the encoding and fix your problems.Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (
iconvlist()
shows all the possibilities).It is possible it simply strip out the offending characters, though this isn't ideal.
无法复制该错误,但是查看帮助文件很有用。
对于 Windows,您应该使用“English”或“Dutch_Netherlands.1252”等格式来更改这些设置。
我试图复制您的状态
但是我没有在控制台中得到有趣的字符,在我自己的语言环境中,` 被标记为 ,但仍然保留所有功能。
这些有趣的字符很容易阅读,并可以从表中找到。
如果您仍然遇到问题,它将依赖于其他地方,但是要更改 Windows 下的区域设置,您必须使用与 Linux 或 OS X 不同的名称(例如,请参阅您自己的区域设置信息)。在 Windows 中,“荷兰语”可能就足够了。
Unable to replicate the error, however looking at the help files is useful.
For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.
I tried to replicate your state
However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.
And these funny characters can be read easily, and found from the table.
If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.