r进口Stata文件有法语重音字符的问题

发布于 2025-01-28 19:53:41 字数 2269 浏览 5 评论 0原文

我有一个很大的Stata文件,我认为有一些法国重音角色的节省不佳。

当我将编码设置设置为空白的文件导入时,它不会读入。法语重音角色无法正确呈现。我在另一个Stata文件中也有类似的问题 - 静止不动供pro/71844774?noredirect = 1#comment126972805_71844774“> fix (实际上在这种情况下实际上不起作用,但似乎很重要)。

老实说,这似乎是真正的问题。许多乱七八糟的角色都是“实际”,它们与“预期”相匹配,但我不知道回去。

可重复的代码在这里:


library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()

download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")

#Try with encoding set to blank, it won't work. 
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")

unlink(c(temp, temp2))

#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec. 
#I know this occupation has messed up characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) 
#Write out messy occupation titles
ces19web %>% 
  filter(str_detect(pes19_occ_text,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) %>% 
  write_csv(file=here("Data/messy.csv"))

#Try to fix

source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)

#Examine
ces19web %>% 
  filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>% 
head()

I have a large stata file that I think has some French accented characters that have been saved poorly.

When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.

To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.

Reproducible code is here:


library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()

download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")

#Try with encoding set to blank, it won't work. 
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")

unlink(c(temp, temp2))

#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec. 
#I know this occupation has messed up characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) 
#Write out messy occupation titles
ces19web %>% 
  filter(str_detect(pes19_occ_text,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) %>% 
  write_csv(file=here("Data/messy.csv"))

#Try to fix

source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)

#Examine
ces19web %>% 
  filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>% 
head()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧情勿念 2025-02-04 19:53:41

您的数据文件是a dta版本113 file(文件中的第一个字节IS 113 )。也就是说,它是 stata 8 file ,尤其是pre- stata 14 ,因此,使用自定义编码(Stata&gt; = 14使用UTF-8)。

因此,使用编码 read_dta的参数似乎是正确的。但是,这里有一些问题,正如十六进制编辑可以看出的那样。

首先,加重字母的截断标签(例如魁北克→Qu)实际上不是由避风港引起的:它们存储在DTA文件中。

pes19_occ_text在UTF-8中编码,因为您可以检查:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

“â©”是UTF-8数据的特征(此处”É''< /code>)读为Latin1。
但是,如果您尝试使用encoding =“ utf-8”read_dta将失败:文件中可能有其他非UTF-8字符,则<代码> read_dta
无法将其读取为UTF-8。进口后我们必须做些事情。

在这里,read_dta正在做一些令人讨厌的事情:它导入“producteurttâlâLâ©lâ”就像是Latin1数据一样,然后转换为UTF-8,因此编码字符串确实是具有UTF-8字符“”和“©”。

要解决此问题,您首先要转换回拉丁1。该字符串仍将是“ producteurtt©lâ©©”,但在Latin1中编码。

然后,您不仅可以强制编码为UTF-8,而无需更改数据即可。

这是代码:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

您可以在带有变量的其他变量上执行相同的操作。


如果我们使用ChartOraw转换为RAW,则使用ICONV,以查看实际字节。导入数据之后,“tâ©lâ©”是UTF-8中“ 74 C3 83 C2 A9 6C C3 83 C2 A9”的表示形式。第一个字节0x74(在十六进制中)是字母“ t”,而0x6c是字母“ l”。在两者之间,我们有四个字节,而不是在UTF-8中的字母“é”(“ C3 a9”,即读为latin1时)。

实际上,“ C3 83”是“”和“ C2 A9”是“©”。

因此,我们首先将这些字符转换回Latin1,以便它们每个字节。然后“ 74 C3 A9 6C C3 A9”是“Tâ©lâ©”的编码,但这次是在Latin1中。也就是说,字符串具有与UTF-8中编码的“télé”相同的字节,我们只需要告诉R编码不是拉丁语1,而是utf-8(这是 not> a转换)。

另请参见编码和 iconv


现在一个好问题可能是:您是如何最终出现如此糟糕的DTA文件的? STATA 8文件持有UTF-8数据是非常令人惊讶的。

想到的第一个想法是 saveold 命令,允许一个用于将数据保存在Stata文件中的较旧版本。但是根据参考手册只能存储Stata&gt; = 11的文件。

也许第三方工具做到了这一点,以及标签的不良截断?例如,它可能是SAS或SPSS。我不知道您的数据来自,但是公共提供商将SAS用于内部工作并发布转换后的数据集并不少见。例如,来自


回答评论:如何循环浏览字符变量以进行相同的操作? DPLYR有一种更聪明的方式,但这是一个简单的循环。

ces19web <- read_dta("CES-E-2019-online_F1.dta")

for (n in names(ces19web)) {
  v <- ces19web[[n]]
  if (is.character(v)) {
    v <- iconv(v, from = "UTF-8", to = "latin1") 
    Encoding(v) <- "UTF-8"
  }
  ces19web[[n]] <- v
}

Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).

So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.

First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.

The pes19_occ_text is encoded in UTF-8, as you can check with:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.

Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".

To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.

Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.

Here is the code:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

You can do the same on other variables with diacritics.


The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).

Actually, "c3 83" is "Ã" and "c2 a9" is "©".

Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).

See also the help pages of Encoding and iconv.


Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.

The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.

Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.


Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.

ces19web <- read_dta("CES-E-2019-online_F1.dta")

for (n in names(ces19web)) {
  v <- ces19web[[n]]
  if (is.character(v)) {
    v <- iconv(v, from = "UTF-8", to = "latin1") 
    Encoding(v) <- "UTF-8"
  }
  ces19web[[n]] <- v
}
薄荷港 2025-02-04 19:53:41

假设可以检查一个UTF-8语言环境:

Sys.getlocale()
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

起初我们有一个地方,一切都很好

utf8 <- "Producteur télé"
Encoding(utf8)
#> [1] "UTF-8"
charToRaw(utf8) # é encoded to c3 a9 as expected for utf-8
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
utf8
#> [1] "Producteur télé"

。 > a9 是2个单独的聊天室“”和“©”,并且从Latin1转换为UTF8,因此现在而不是在UTF-8中使用É(带有将“在拉丁语中转换为“©”的字节) ,我们在UTF8中具有“â©”,用2个字符编码c3 83c2 a9

oops <- iconv(utf8, from = "latin1", to = "UTF-8")
Encoding(oops)
#> [1] "UTF-8"
charToRaw(oops)
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 83 c2 a9 6c c3 83 c2 a9
oops
#> [1] "Producteur télé"

此字符串不再是合适的(有意义的)UTF-8或Latin1字符串,“é”是拉丁语中的e9,或c3 a9在UTF-8中,但从不C3 83 C2 A9

但是,我们可以撤消不良的翻译:

proper_utf8_encoding_with_latin1_marking <- 
  iconv(oops, from = "UTF-8", to = "latin1")
Encoding(proper_utf8_encoding_with_latin1_marking)
#> [1] "latin1"
# c3 a9 is é in utf-8, not in latin1!
charToRaw(proper_utf8_encoding_with_latin1_marking) 
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
proper_utf8_encoding_with_latin1_marking
#> [1] "Producteur télé"

从那里我们可以构建一个适当的UTF-8字符串(推荐)或

utf8 <- proper_utf8_encoding_with_latin1_marking
Encoding(utf8) <- "UTF-8"
Encoding(utf8)
#> [1] "UTF-8"

charToRaw(utf8)
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9

utf8
#> [1] "Producteur télé"

latin1 <- 
  iconv(proper_utf8_encoding_with_latin1_marking, from = "UTF-8", to = "latin1")
Encoding(latin1)
#> [1] "latin1"
charToRaw(latin1) # e9 is é in latin1
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 e9 6c e9
latin1
#> [1] "Producteur télé"

编码地狱的适当拉丁1字符串部分,因为R将它们视为相同,因为它大多并不重要,

identical(utf8, latin1)
#> [1] TRUE

但是可以用编码()chartoraw()函数或序列化时,可以看到真相。

waldo::compare(
  serialize(utf8, NULL),
  serialize(latin1, NULL)
)
#> `old[31:42]`: "01" "00" "00" "80" "09" "00" "00" "00" "11" "50" and 2 more...
#> `new[31:42]`: "01" "00" "00" "40" "09" "00" "00" "00" "0f" "50" ...          
#> 
#> `old[49:56]`: "72" "20" "74" "c3" "a9" "6c" "c3" "a9"
#> `new[49:54]`: "72" "20" "74" "e9" "6c" "e9"

我们在上面看到的3个差异是编码标记(utf-8的80,拉丁1,未知为00),字节中的长度(十进制为11 = 17,十进制为0f = 15),和小字节值的字节值“é”字符(“ C3”“ A9” VS“ E9”)

有趣的事实,如果我们将语言环境更改为Latin1(Mac上),因为我不了解的原因实际上将打印“é”(其他人将不再打印好),证明我们不能始终相信print() and selutical(),并且ChartOraw()编码()iconv()是您的朋友来调试编码地狱。

Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")
oops
#> [1] "Producteur télé"

Assuming an utf-8 locale, can be checked with:

Sys.getlocale()
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

At first we had this somewhere and everything was fine:

utf8 <- "Producteur télé"
Encoding(utf8)
#> [1] "UTF-8"
charToRaw(utf8) # é encoded to c3 a9 as expected for utf-8
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
utf8
#> [1] "Producteur télé"

But something bad happened, and the string was considered as a latin string for which c3 and a9 are 2 separate chatacters "Ã" and "©", and was converted wrongly from latin1 to utf8, so now instead of having é in UTF-8 (with bytes that translate to "é" in latin), we have "é" in utf8, coded with the 2 characters c3 83 and c2 a9

oops <- iconv(utf8, from = "latin1", to = "UTF-8")
Encoding(oops)
#> [1] "UTF-8"
charToRaw(oops)
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 83 c2 a9 6c c3 83 c2 a9
oops
#> [1] "Producteur télé"

This string is not a proper (meaningful) utf-8 or latin1 string anymore, "é" is e9 in latin, or c3 a9 in utf-8, but never c3 83 c2 a9 !

We can undo the bad translation though:

proper_utf8_encoding_with_latin1_marking <- 
  iconv(oops, from = "UTF-8", to = "latin1")
Encoding(proper_utf8_encoding_with_latin1_marking)
#> [1] "latin1"
# c3 a9 is é in utf-8, not in latin1!
charToRaw(proper_utf8_encoding_with_latin1_marking) 
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
proper_utf8_encoding_with_latin1_marking
#> [1] "Producteur télé"

From there we can build either a proper utf-8 string (recommended) or a proper latin1 string

utf8 <- proper_utf8_encoding_with_latin1_marking
Encoding(utf8) <- "UTF-8"
Encoding(utf8)
#> [1] "UTF-8"

charToRaw(utf8)
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9

utf8
#> [1] "Producteur télé"

latin1 <- 
  iconv(proper_utf8_encoding_with_latin1_marking, from = "UTF-8", to = "latin1")
Encoding(latin1)
#> [1] "latin1"
charToRaw(latin1) # e9 is é in latin1
#>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 e9 6c e9
latin1
#> [1] "Producteur télé"

Part of encoding hell is that R sees those MOSTLY as the same, because it mostly doesn't matter

identical(utf8, latin1)
#> [1] TRUE

But the truth can be seen with the Encoding() and charToRaw() functions, or when serializing, which shows both informations.

waldo::compare(
  serialize(utf8, NULL),
  serialize(latin1, NULL)
)
#> `old[31:42]`: "01" "00" "00" "80" "09" "00" "00" "00" "11" "50" and 2 more...
#> `new[31:42]`: "01" "00" "00" "40" "09" "00" "00" "00" "0f" "50" ...          
#> 
#> `old[49:56]`: "72" "20" "74" "c3" "a9" "6c" "c3" "a9"
#> `new[49:54]`: "72" "20" "74" "e9" "6c" "e9"

The 3 differences we see above are the encoding marking (80 for UTF-8, 40 for latin1, 00 for unknown), the length in byte (11=17 in decimal, 0f = 15 in decimal), and the byte values of the "é" characters ("c3" "a9" vs "e9")

Fun fact, if we change the locale to latin1 (here on a Mac), for reasons that I don't understand, oops will actually print "é" (and the others won't print well anymore), proving that we can't always trust print() and identical(), and that charToRaw(), Encoding() and iconv() are your friends to debug encoding hell.

Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")
oops
#> [1] "Producteur télé"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文