如何确定重音编码?

发布于 2024-11-24 02:54:45 字数 4359 浏览 2 评论 0 原文

这个问题与之前关于如何替换的一个相关带重音的字符串,例如 México 以及等效的 Latex 代码 M\'{e}xico

我这里的问题略有不同。我正在使用第三方数据库,其中包含带有西班牙口音的字符串变量,如上所述。然而,编码看起来很奇怪,因为这是我得到的行为:

> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"

其中 temp$dest_nom_ent 是一个带有墨西哥州名称的变量。

那么,我的问题是如何将第三方数据库中的字符串变量转换为标准 R 函数可以识别的编码。请注意:

> Encoding(temp$dest_nom_ent)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"

有关更多信息,我使用的是 Windows 7 64。另请注意:

> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f

来自此 与 Windows 西班牙语(传统排序)区域设置一致。

M=4d
é=e9
x=78
i=69
c=63
o=6f

另请注意:

> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"

我尝试了以下方法但未成功(例如,表示 grep("é",temp$dest_nom_ent) 返回空向量):

Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent  <- enc2utf8(temp$dest_nom_ent)
...

我使用 iconvlist()"WINDOWS-1252" 支持。然而,以下内容不起作用:

> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)

与以下内容相比:

> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)

尝试通过暴力找出编码,例如:

try(for(i in 1:length(iconvlist())){
    temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
    Encoding(temp1)<-iconvlist()[i]
    temp1 <- iconv(temp1,iconvlist()[i],"latin1")
    print(grep("é",temp1))
    print(i)
        },silent=FALSE)

我不熟悉 try 函数,但它仍然会出现错误而不是忽略它,因此不能检查整个列表:

...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") : 
  unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252

最后:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2

看来我必须按照建议更改计算机的区域设置此处。另请参阅此处

PS:如果您想知道如何使用English_United States.1252 语言环境 我设法输入 d<-c("México","México") 方法是通过设置辅助西班牙语键盘(传统排序)使用控制面板>时钟、语言和区域>地区和语言>键盘和语言 >更改键盘,然后在已安装的服务下单击“添加”并导航到西班牙语传统排序。然后,在高级按键设置下,您可以创建切换键盘的快捷方式。就我而言,Shit+Alt。因此,如果我想在英语默认语言环境中输入 ñ,我会先执行 Shift+Alt,然后输入 ;,然后再输入 Shift+Alt< /code> 返回英文键盘。

This question is related to this previous one on how to replace accented strings like México with equivalent Latex code M\'{e}xico.

My problem here is slightly different. I am using a third party database with string variables with Spanish accents like above. However, the encoding appears odd since this is the behavior I get:

> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"

where temp$dest_nom_ent is a variable with state names of México.

My question, then, is how to convert the string variable from the third party database into an encoding that standard R functions will recognize. Please note:

> Encoding(temp$dest_nom_ent)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"

For further info I am using Windows 7 64. Also note:

> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f

Which from this source coincides with Windows Spanish (Traditional Sort) locale.

M=4d
é=e9
x=78
i=69
c=63
o=6f

And also note:

> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"

I have tried the following unsuccessfully (e.g. meaning grep("é",temp$dest_nom_ent) returns null vector):

Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent  <- enc2utf8(temp$dest_nom_ent)
...

I checked supported character sets using iconvlist() and "WINDOWS-1252" is supported. The following, however, did not work:

> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)

which compares to:

> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)

Tried to find out the encoding by brute force like:

try(for(i in 1:length(iconvlist())){
    temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
    Encoding(temp1)<-iconvlist()[i]
    temp1 <- iconv(temp1,iconvlist()[i],"latin1")
    print(grep("é",temp1))
    print(i)
        },silent=FALSE)

I am not familiar with try function but it still scapes at error instead of ignoring it so cannot check whole list:

...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") : 
  unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252

Finally:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2

So it seems I will have to change the computer's locale as suggested here. Also see here

PS: In case you wonder how with an English_United States.1252 locale I managed to type d<-c("México","México") the way is by setting up a secondary Spanish keyboard (traditional sort) using Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards and under installed services click add and navigate to Spanish traditional sort. Then under advanced key settings you can create a short-cut to switch keyboards. In my case Shit+Alt. So if I want to type ñ in English default locale, I do Shift+Alt followed by ; and then Shift+Alt to go back to English keyboard.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

洒一地阳光 2024-12-01 02:54:45

使用 Encoding(x) 查看 temp$dest_nom_ent 和“México”的编码是什么。您可能需要使用 enc2nativeenc2utf8 进行转换。

Take a look at what the encodings of temp$dest_nom_ent and "México" are, using Encoding(x). You may need to convert with enc2native or enc2utf8.

寂寞陪衬 2024-12-01 02:54:45

尝试将字符串的编码设置为“ISO_8859-1”“ISO_8859-15”之一。

还有两个建议...,然后我放弃: "UTF-16" "UTF-16LE" 。第二个是 UTF 小端字节序,我相信并且已经读到它是 Windows 7 实际使用的。不妨也尝试“UTF-16BE”。 (材料来自另一个 stackexchange 帖子;https://superuser.com/questions/221593/ windows-7-utf-8-and-unicode )

Try setting encoding of the string to one of "ISO_8859-1" "ISO_8859-15".

Two more suggestions..., then I give up: "UTF-16" "UTF-16LE" . The second is UTF little-endian I believe and have read that it is what Windows 7 actually uses. Might as well try "UTF-16BE" as well. (Material garnered from another stackexchange posting; https://superuser.com/questions/221593/windows-7-utf-8-and-unicode )

残龙傲雪 2024-12-01 02:54:45

好吧,我无法确定重音的编码,但以下内容实现了我想要的。诀窍是转换为 UTF-8,设置 sub() 选项 useBytes=TRUE 和 Joran 的 建议使用 sanitize.text.function=function(x){x} xtable()。这是示例代码。轻松循环所有重音元音:

> temp1 <- unique(temp$dest_nom_ent)
> temp1
 [1] "Aguascalientes"                  "Baja California"                
 [3] "Baja California Sur"             "Campeche"                       
 [5] "Coahuila de Zaragoza"            "Colima"                         
 [7] "Chiapas"                         "Guanajuato"                     
 [9] "Guerrero"                        "Hidalgo"                        
[11] "Jalisco"                         "México"                         
[13] "Michoacán de Ocampo"             "Morelos"                        
[15] "Nayarit"                         "Oaxaca"                         
[17] "Puebla"                          "Querétaro"                      
[19] "Quintana Roo"                    "San Luis Potosí"                
[21] "Sinaloa"                         "Tabasco"                        
[23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"                      
> temp1 <- iconv(unique(temp1),"","UTF-8")
> temp1
 [1] "Aguascalientes"                  "Baja California"                
 [3] "Baja California Sur"             "Campeche"                       
 [5] "Coahuila de Zaragoza"            "Colima"                         
 [7] "Chiapas"                         "Guanajuato"                     
 [9] "Guerrero"                        "Hidalgo"                        
[11] "Jalisco"                         "México"                         
[13] "Michoacán de Ocampo"             "Morelos"                        
[15] "Nayarit"                         "Oaxaca"                         
[17] "Puebla"                          "Querétaro"                      
[19] "Quintana Roo"                    "San Luis Potosí"                
[21] "Sinaloa"                         "Tabasco"                        
[23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"                      
> Encoding(temp1)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "unknown"
[15] "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "unknown" "unknown" "unknown" "unknown"
> temp2 <- sub("é", "\\\\'{e}", temp1, useBytes = TRUE)
> temp2 <- data.frame(temp2)
> print(xtable(temp2),sanitize.text.function=function(x){x})
% latex table generated in R 2.13.1 by xtable 1.5-6 package
% Fri Jul 15 13:52:44 2011
\begin{table}[ht]
\begin{center}
\begin{tabular}{rl}
  \hline
 & temp2 \\ 
  \hline
1 & Aguascalientes \\ 
  2 & Baja California \\ 
  3 & Baja California Sur \\ 
  4 & Campeche \\ 
  5 & Coahuila de Zaragoza \\ 
  6 & Colima \\ 
  7 & Chiapas \\ 
  8 & Guanajuato \\ 
  9 & Guerrero \\ 
  10 & Hidalgo \\ 
  11 & Jalisco \\ 
  12 & M\'{e}xico \\ 
  13 & Michoacán de Ocampo \\ 
  14 & Morelos \\ 
  15 & Nayarit \\ 
  16 & Oaxaca \\ 
  17 & Puebla \\ 
  18 & Quer\'{e}taro \\ 
  19 & Quintana Roo \\ 
  20 & San Luis Potosí \\ 
  21 & Sinaloa \\ 
  22 & Tabasco \\ 
  23 & Tlaxcala \\ 
  24 & Veracruz de Ignacio de la Llave \\ 
  25 & Zacatecas \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

正如在循环中实际实现的那样:

temp$dest_nom_ent <- iconv(
        temp$dest_nom_ent,"","UTF-8")
temp$dest_nom_mun <- iconv(
        temp$dest_nom_mun,"","UTF-8")
accents <-c("á","é","í","ó","ú")
latex <-c("\\\\'{a}","\\\\'{e}","\\\\'{i}","\\\\'{o}","\\\\'{u}")
for(i in 1:5){
    temp$dest_nom_ent<-sub(accents[i], latex[i], 
            temp$dest_nom_ent, useBytes = TRUE)
    temp$dest_nom_mun<-sub(accents[i], latex[i], 
            temp$dest_nom_ent, useBytes = TRUE)
}
capture.output(
        print(xtable(temp),sanitize.text.function=function(x){x}),
        file = "../paper/rTables.tex", append = FALSE)

仍然,答案是不完整的,因为我无法解释到底发生了什么。通过反复试验发现了它。

Well, I could not determine the coding of accents but the following accomplishes what I wanted. The trick was to convert to UTF-8, set the sub() option useBytes=TRUE and Joran's suggestion to use sanitize.text.function=function(x){x} for xtable(). Here is the sample code. Easy to loop over all accented vowels:

> temp1 <- unique(temp$dest_nom_ent)
> temp1
 [1] "Aguascalientes"                  "Baja California"                
 [3] "Baja California Sur"             "Campeche"                       
 [5] "Coahuila de Zaragoza"            "Colima"                         
 [7] "Chiapas"                         "Guanajuato"                     
 [9] "Guerrero"                        "Hidalgo"                        
[11] "Jalisco"                         "México"                         
[13] "Michoacán de Ocampo"             "Morelos"                        
[15] "Nayarit"                         "Oaxaca"                         
[17] "Puebla"                          "Querétaro"                      
[19] "Quintana Roo"                    "San Luis Potosí"                
[21] "Sinaloa"                         "Tabasco"                        
[23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"                      
> temp1 <- iconv(unique(temp1),"","UTF-8")
> temp1
 [1] "Aguascalientes"                  "Baja California"                
 [3] "Baja California Sur"             "Campeche"                       
 [5] "Coahuila de Zaragoza"            "Colima"                         
 [7] "Chiapas"                         "Guanajuato"                     
 [9] "Guerrero"                        "Hidalgo"                        
[11] "Jalisco"                         "México"                         
[13] "Michoacán de Ocampo"             "Morelos"                        
[15] "Nayarit"                         "Oaxaca"                         
[17] "Puebla"                          "Querétaro"                      
[19] "Quintana Roo"                    "San Luis Potosí"                
[21] "Sinaloa"                         "Tabasco"                        
[23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"                      
> Encoding(temp1)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "unknown"
[15] "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "unknown" "unknown" "unknown" "unknown"
> temp2 <- sub("é", "\\\\'{e}", temp1, useBytes = TRUE)
> temp2 <- data.frame(temp2)
> print(xtable(temp2),sanitize.text.function=function(x){x})
% latex table generated in R 2.13.1 by xtable 1.5-6 package
% Fri Jul 15 13:52:44 2011
\begin{table}[ht]
\begin{center}
\begin{tabular}{rl}
  \hline
 & temp2 \\ 
  \hline
1 & Aguascalientes \\ 
  2 & Baja California \\ 
  3 & Baja California Sur \\ 
  4 & Campeche \\ 
  5 & Coahuila de Zaragoza \\ 
  6 & Colima \\ 
  7 & Chiapas \\ 
  8 & Guanajuato \\ 
  9 & Guerrero \\ 
  10 & Hidalgo \\ 
  11 & Jalisco \\ 
  12 & M\'{e}xico \\ 
  13 & Michoacán de Ocampo \\ 
  14 & Morelos \\ 
  15 & Nayarit \\ 
  16 & Oaxaca \\ 
  17 & Puebla \\ 
  18 & Quer\'{e}taro \\ 
  19 & Quintana Roo \\ 
  20 & San Luis Potosí \\ 
  21 & Sinaloa \\ 
  22 & Tabasco \\ 
  23 & Tlaxcala \\ 
  24 & Veracruz de Ignacio de la Llave \\ 
  25 & Zacatecas \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

As actually implemented in a loop:

temp$dest_nom_ent <- iconv(
        temp$dest_nom_ent,"","UTF-8")
temp$dest_nom_mun <- iconv(
        temp$dest_nom_mun,"","UTF-8")
accents <-c("á","é","í","ó","ú")
latex <-c("\\\\'{a}","\\\\'{e}","\\\\'{i}","\\\\'{o}","\\\\'{u}")
for(i in 1:5){
    temp$dest_nom_ent<-sub(accents[i], latex[i], 
            temp$dest_nom_ent, useBytes = TRUE)
    temp$dest_nom_mun<-sub(accents[i], latex[i], 
            temp$dest_nom_ent, useBytes = TRUE)
}
capture.output(
        print(xtable(temp),sanitize.text.function=function(x){x}),
        file = "../paper/rTables.tex", append = FALSE)

Still, the answer is incomplete in that I cannot explain what exactly was going on. Found it through trial and error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文