这个问题与之前关于如何替换的一个相关带重音的字符串,例如 México
以及等效的 Latex
代码 M\'{e}xico
。
我这里的问题略有不同。我正在使用第三方数据库,其中包含带有西班牙口音的字符串变量,如上所述。然而,编码看起来很奇怪,因为这是我得到的行为:
> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"
其中 temp$dest_nom_ent
是一个带有墨西哥州名称的变量。
那么,我的问题是如何将第三方数据库中的字符串变量转换为标准 R 函数可以识别的编码。请注意:
> Encoding(temp$dest_nom_ent)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"
有关更多信息,我使用的是 Windows 7 64。另请注意:
> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f
来自此 源与 Windows 西班牙语(传统排序)区域设置一致。
M=4d
é=e9
x=78
i=69
c=63
o=6f
另请注意:
> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"
我尝试了以下方法但未成功(例如,表示 grep("é",temp$dest_nom_ent)
返回空向量):
Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent <- enc2utf8(temp$dest_nom_ent)
...
我使用 iconvlist()
和 "WINDOWS-1252"
支持。然而,以下内容不起作用:
> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)
与以下内容相比:
> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)
尝试通过暴力找出编码,例如:
try(for(i in 1:length(iconvlist())){
temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
Encoding(temp1)<-iconvlist()[i]
temp1 <- iconv(temp1,iconvlist()[i],"latin1")
print(grep("é",temp1))
print(i)
},silent=FALSE)
我不熟悉 try
函数,但它仍然会出现错误而不是忽略它,因此不能检查整个列表:
...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252
最后:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2
看来我必须按照建议更改计算机的区域设置此处。另请参阅此处
PS:如果您想知道如何使用English_United States.1252 语言环境 我设法输入 d<-c("México","México")
方法是通过设置辅助西班牙语键盘(传统排序)使用控制面板>时钟、语言和区域>地区和语言>键盘和语言 >更改键盘
,然后在已安装的服务
下单击“添加”并导航到西班牙语传统排序。然后,在高级按键设置
下,您可以创建切换键盘的快捷方式。就我而言,Shit+Alt
。因此,如果我想在英语默认语言环境中输入 ñ
,我会先执行 Shift+Alt
,然后输入 ;
,然后再输入 Shift+Alt< /code> 返回英文键盘。
This question is related to this previous one on how to replace accented strings like México
with equivalent Latex
code M\'{e}xico
.
My problem here is slightly different. I am using a third party database with string variables with Spanish accents like above. However, the encoding appears odd since this is the behavior I get:
> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"
where temp$dest_nom_ent
is a variable with state names of México.
My question, then, is how to convert the string variable from the third party database into an encoding that standard R
functions will recognize. Please note:
> Encoding(temp$dest_nom_ent)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"
For further info I am using Windows 7 64. Also note:
> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f
Which from this source coincides with Windows Spanish (Traditional Sort) locale.
M=4d
é=e9
x=78
i=69
c=63
o=6f
And also note:
> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"
I have tried the following unsuccessfully (e.g. meaning grep("é",temp$dest_nom_ent)
returns null vector):
Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent <- enc2utf8(temp$dest_nom_ent)
...
I checked supported character sets using iconvlist()
and "WINDOWS-1252"
is supported. The following, however, did not work:
> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)
which compares to:
> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)
Tried to find out the encoding by brute force like:
try(for(i in 1:length(iconvlist())){
temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
Encoding(temp1)<-iconvlist()[i]
temp1 <- iconv(temp1,iconvlist()[i],"latin1")
print(grep("é",temp1))
print(i)
},silent=FALSE)
I am not familiar with try
function but it still scapes at error instead of ignoring it so cannot check whole list:
...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252
Finally:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2
So it seems I will have to change the computer's locale as suggested here. Also see here
PS: In case you wonder how with an English_United States.1252 locale I managed to type d<-c("México","México")
the way is by setting up a secondary Spanish keyboard (traditional sort) using Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards
and under installed services
click add and navigate to Spanish traditional sort. Then under advanced key settings
you can create a short-cut to switch keyboards. In my case Shit+Alt
. So if I want to type ñ
in English default locale, I do Shift+Alt
followed by ;
and then Shift+Alt
to go back to English keyboard.
发布评论
评论(3)
使用
Encoding(x)
查看temp$dest_nom_ent
和“México”的编码是什么。您可能需要使用enc2native
或enc2utf8
进行转换。Take a look at what the encodings of
temp$dest_nom_ent
and "México" are, usingEncoding(x)
. You may need to convert withenc2native
orenc2utf8
.尝试将字符串的编码设置为“ISO_8859-1”“ISO_8859-15”之一。
还有两个建议...,然后我放弃: "UTF-16" "UTF-16LE" 。第二个是 UTF 小端字节序,我相信并且已经读到它是 Windows 7 实际使用的。不妨也尝试“UTF-16BE”。 (材料来自另一个 stackexchange 帖子;https://superuser.com/questions/221593/ windows-7-utf-8-and-unicode )
Try setting encoding of the string to one of "ISO_8859-1" "ISO_8859-15".
Two more suggestions..., then I give up: "UTF-16" "UTF-16LE" . The second is UTF little-endian I believe and have read that it is what Windows 7 actually uses. Might as well try "UTF-16BE" as well. (Material garnered from another stackexchange posting; https://superuser.com/questions/221593/windows-7-utf-8-and-unicode )
好吧,我无法确定重音的编码,但以下内容实现了我想要的。诀窍是转换为 UTF-8,设置
sub()
选项useBytes=TRUE
和 Joran 的 建议使用sanitize.text.function=function(x){x}
xtable()
。这是示例代码。轻松循环所有重音元音:正如在循环中实际实现的那样:
仍然,答案是不完整的,因为我无法解释到底发生了什么。通过反复试验发现了它。
Well, I could not determine the coding of accents but the following accomplishes what I wanted. The trick was to convert to UTF-8, set the
sub()
optionuseBytes=TRUE
and Joran's suggestion to usesanitize.text.function=function(x){x}
forxtable()
. Here is the sample code. Easy to loop over all accented vowels:As actually implemented in a loop:
Still, the answer is incomplete in that I cannot explain what exactly was going on. Found it through trial and error.