奇怪的字符:R 和 Windows 语言环境的交互?

发布于 2024-11-04 20:02:27 字数 2258 浏览 1 评论 0原文

WinXP-x32,R-2.13.0

亲爱的列表,

我有一个问题(我认为)与 Windows 和 R 之间的交互有关。

我正在尝试抓取包含夏威夷群岛数据的表格。这是我的 R 代码:

library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]

输出是(第一组列):

岛屿昵称> >岛屿
      岛屿昵称> >位置 1 夏威夷[7] The Big

岛屿北纬 19°34° 西经 155°30° / 纬度 19.567°N 155.5°W 纬度 / 19.567; -155.5 2 毛伊岛[8] 山谷岛 20°48”N 156°20”W / 20.8°N 156.333°W / 20.8; -156.333 3 KahoÊ»olawe[9] 目标岛 北纬 20°33° 156°36°W / 20.55°N 156.6°W/20.55; -156.6 4 拉纳伊[10] 菠萝岛 20°50°N 156°56°W / 20.833°N 156.933°W / 20.833; -156.933 5 摩洛卡岛[11] 友好岛 北纬 21°08” 157°02°W / 21.133°N 157.033°西经 / 21.133; -157.033 6 欧阿胡[12] 聚集地 北纬 21°28° 西经 157°59° / 21.467°N 157.983°W / 21.467; -157.983 7 KauaÊ»i[13] 花园岛 22°05–N 159°30°W / 22.083°N 159.5°西经 / 22.083; -159.5 8 NiÊ»ihau[14] 禁忌岛
21°54°N 160°10°W / 21.9°N 160.167°W/21.9; -160.167

正如你所看到的,里面有“奇怪”的字符。我还尝试过 readHTMLTable(u,encoding = "UTF-16") 和 readHTMLTable(u,encoding = "UTF-8") 但这没有帮助。

在我看来,字符集和 R 的 Windows 设置的交互可能存在问题。

sessionInfo() 给出

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252  

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] XML_3.2-0.2

我还尝试通过输入以下内容让 R 使用另一个设置: Sys.setlocale("LC_ALL", "en_US.UTF-8"),但这会产生响应:

> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

此外,我尝试直接从 Windows 命令提示符进行更改,使用: CHCP 65001 及其变体,但这并没有改变任何东西。

我在网上搜索发现其他人也有这个问题,但一直无法找到解决方案。我看起来这是 Windows 和 R 如何交互的问题。不幸的是,我手头的三台电脑都存在这个问题。它发生在WinXP-x32 和Win7-x86 下。

有没有办法让 R 覆盖 Windows 设置或者可以通过其他方式解决问题吗? 我也尝试过其他网站,每当要抓取的文本中有 é、ü、ä、î 等时,都会出现此问题。

谢谢你, 罗杰

WinXP-x32, R-2.13.0

Dear list,

I have a problem that (I think) relates to the interaction between Windows and R.

I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:

library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]

The output is (first set of columns):

      Island            Nickname                                                                  > > Islands
      Island            Nickname                                                                  > > Location 1    Hawaiʻi[7]      The Big

Island 19°34′N 155°30′W /
19.567°N 155.5°W / 19.567;
-155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W /
20.8°N 156.333°W / 20.8;
-156.333 3 Kahoʻolawe[9] The Target Isle 20°33′N
156°36′W / 20.55°N
156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle
20°50′N 156°56′W /
20.833°N 156.933°W / 20.833;
-156.933 5 Molokaʻi[11] The Friendly Isle 21°08′N
157°02′W / 21.133°N
157.033°W / 21.133; -157.033 6 Oʻahu[12] The Gathering Place
21°28′N 157°59′W /
21.467°N 157.983°W / 21.467;
-157.983 7 Kauaʻi[13] The Garden Isle 22°05′N
159°30′W / 22.083°N
159.5°W / 22.083; -159.5 8 Niʻihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N
160.167°W / 21.9; -160.167

As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.

It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.

sessionInfo() gives

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252  

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] XML_3.2-0.2

I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:

> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001 and variations of that, but that didn't change anything.

I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.

Is there a way to make R override the windows settings or can the issue be solved otherwise?
I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.

Thank you,
Roger

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九歌凝 2024-11-11 20:02:27

一个不完全的答案:

如果您查看维基百科页面并将浏览器中的编码(在 IE 中,查看 -> 编码;在 Firefox 中,查看 -> 字符编码)更改为 Western (ISO-8869-1) 或西方(Windows-1252)然后你会看到愚蠢的字符。这应该意味着您可以使用 iconv 来更改编码并解决您的问题。

#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)

iconv(Islands$Island, "windows-1252", "UTF-8")

不幸的是,它不起作用。通过使用不同的转换可能会获得正确的文本(iconvlist() 显示了所有可能性)。

它可能只是简单地删除有问题的字符,尽管这并不理想。

iconv(Islands$Island, "windows-1252", "ASCII", "")

A not quite an answer:

If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv to change the encoding and fix your problems.

#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)

iconv(Islands$Island, "windows-1252", "UTF-8")

Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist() shows all the possibilities).

It is possible it simply strip out the offending characters, though this isn't ideal.

iconv(Islands$Island, "windows-1252", "ASCII", "")
開玄 2024-11-11 20:02:27

无法复制该错误,但是查看帮助文件很有用。

Sys.setlocale("LC_TIME", "de")     # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8")   # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8")  # ditto
Sys.setlocale("LC_TIME", "de_DE")  # OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows

对于 Windows,您应该使用“English”或“Dutch_Netherlands.1252”等格式来更改这些设置。

我试图复制您的状态

> Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"

library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]

但是我没有在控制台中得到有趣的字符,在我自己的语言环境中,` 被标记为 ,但仍然保留所有功能。

> Islands[1,1]
[1] Hawaiʻi[27]
8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]

这些有趣的字符很容易阅读,并可以从表中找到。

> Encoding(as.character("Hawaiʻi"))
[1] "UTF-8"
> Encoding(as.character(Islands[1,1]))
[1] "UTF-8"
> grep("Hawaiʻi", as.character(Islands[1,1]))
[1] 1

如果您仍然遇到问题,它将依赖于其他地方,但是要更改 Windows 下的区域设置,您必须使用与 Linux 或 OS X 不同的名称(例如,请参阅您自己的区域设置信息)。在 Windows 中,“荷兰语”可能就足够了。

Unable to replicate the error, however looking at the help files is useful.

Sys.setlocale("LC_TIME", "de")     # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8")   # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8")  # ditto
Sys.setlocale("LC_TIME", "de_DE")  # OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows

For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.

I tried to replicate your state

> Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"

library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]

However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.

> Islands[1,1]
[1] Hawaiʻi[27]
8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]

And these funny characters can be read easily, and found from the table.

> Encoding(as.character("Hawaiʻi"))
[1] "UTF-8"
> Encoding(as.character(Islands[1,1]))
[1] "UTF-8"
> grep("Hawaiʻi", as.character(Islands[1,1]))
[1] 1

If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文