r进口Stata文件有法语重音字符的问题
我有一个很大的Stata文件,我认为有一些法国重音角色的节省不佳。
当我将编码设置设置为空白的文件导入时,它不会读入。法语重音角色无法正确呈现。我在另一个Stata文件中也有类似的问题 - 静止不动供pro/71844774?noredirect = 1#comment126972805_71844774“> fix (实际上在这种情况下实际上不起作用,但似乎很重要)。
老实说,这似乎是真正的问题。许多乱七八糟的角色都是“实际”,它们与“预期”相匹配,但我不知道回去。
可重复的代码在这里:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1
it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的数据文件是a dta版本113 file(文件中的第一个字节IS 113 )。也就是说,它是 stata 8 file ,尤其是pre- stata 14 ,因此,使用自定义编码(Stata&gt; = 14使用UTF-8)。
因此,使用
编码
read_dta
的参数似乎是正确的。但是,这里有一些问题,正如十六进制编辑可以看出的那样。首先,加重字母的截断标签(例如魁北克→Qu)实际上不是由避风港引起的:它们存储在DTA文件中。
pes19_occ_text
在UTF-8中编码,因为您可以检查:此
“â©”
是UTF-8数据的特征(此处“
”É''< /code>)读为Latin1。
无法将其读取为UTF-8。进口后我们必须做些事情。但是,如果您尝试使用
encoding =“ utf-8”
,read_dta
将失败:文件中可能有其他非UTF-8字符,则<代码> read_dta在这里,
read_dta
正在做一些令人讨厌的事情:它导入“producteurttâlâLâ©lâ”
就像是Latin1数据一样,然后转换为UTF-8,因此编码字符串确实是具有UTF-8字符“”和“©”。要解决此问题,您首先要转换回拉丁1。该字符串仍将是“ producteurtt©lâ©©”,但在Latin1中编码。
然后,您不仅可以强制编码为UTF-8,而无需更改数据即可。
这是代码:
您可以在带有变量的其他变量上执行相同的操作。
如果我们使用
ChartOraw
转换为RAW,则使用ICONV
,以查看实际字节。导入数据之后,“tâ©lâ©”是UTF-8中“ 74 C3 83 C2 A9 6C C3 83 C2 A9”的表示形式。第一个字节0x74(在十六进制中)是字母“ t”,而0x6c是字母“ l”。在两者之间,我们有四个字节,而不是在UTF-8中的字母“é”(“ C3 a9”,即读为latin1时)。实际上,“ C3 83”是“”和“ C2 A9”是“©”。
因此,我们首先将这些字符转换回Latin1,以便它们每个字节。然后“ 74 C3 A9 6C C3 A9”是“Tâ©lâ©”的编码,但这次是在Latin1中。也就是说,字符串具有与UTF-8中编码的“télé”相同的字节,我们只需要告诉R编码不是拉丁语1,而是utf-8(这是 not> a转换)。
另请参见编码和 iconv 。
现在一个好问题可能是:您是如何最终出现如此糟糕的DTA文件的? STATA 8文件持有UTF-8数据是非常令人惊讶的。
想到的第一个想法是 saveold 命令,允许一个用于将数据保存在Stata文件中的较旧版本。但是根据参考手册只能存储Stata&gt; = 11的文件。
也许第三方工具做到了这一点,以及标签的不良截断?例如,它可能是SAS或SPSS。我不知道您的数据来自,但是公共提供商将SAS用于内部工作并发布转换后的数据集并不少见。例如,来自
回答评论:如何循环浏览字符变量以进行相同的操作? DPLYR有一种更聪明的方式,但这是一个简单的循环。
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the
encoding
argument ofread_dta
seems right. But there are a few problems here, as can be seen with a hex editor.First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The
pes19_occ_text
is encoded in UTF-8, as you can check with:This
"é"
is characteristic of UTF-8 data (here"é"
) read as latin1.However, if you try to import with
encoding="UTF-8"
,read_dta
will fail: there might be other non-UTF-8 characters in the file, thatread_dta
can't read as UTF-8. We have to do somthing after the import.Here,
read_dta
is doing something nasty: it imports"Producteur télé"
as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
You can do the same on other variables with diacritics.
The use of
iconv
here may be more understandable if we convert to raw withcharToRaw
, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14
saveold
can only store files for Stata >=11.Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
假设可以检查一个UTF-8语言环境:
起初我们有一个地方,一切都很好
。 > a9 是2个单独的聊天室“”和“©”,并且从Latin1转换为UTF8,因此现在而不是在UTF-8中使用É(带有将“在拉丁语中转换为“©”的字节) ,我们在UTF8中具有“â©”,用2个字符编码
c3 83
和c2 a9
此字符串不再是合适的(有意义的)UTF-8或Latin1字符串,“é”是拉丁语中的
e9
,或c3 a9
在UTF-8中,但从不C3 83 C2 A9
!但是,我们可以撤消不良的翻译:
从那里我们可以构建一个适当的UTF-8字符串(推荐)或
编码地狱的适当拉丁1字符串部分,因为R将它们视为相同,因为它大多并不重要,
但是可以用
编码()
和chartoraw()
函数或序列化时,可以看到真相。我们在上面看到的3个差异是编码标记(utf-8的80,拉丁1,未知为00),字节中的长度(十进制为11 = 17,十进制为0f = 15),和小字节值的字节值“é”字符(“ C3”“ A9” VS“ E9”)
有趣的事实,如果我们将语言环境更改为Latin1(Mac上),因为我不了解的原因实际上将打印“é”(其他人将不再打印好),证明我们不能始终相信
print()
andselutical()
,并且ChartOraw()
,编码()
和iconv()
是您的朋友来调试编码地狱。Assuming an utf-8 locale, can be checked with:
At first we had this somewhere and everything was fine:
But something bad happened, and the string was considered as a latin string for which
c3
anda9
are 2 separate chatacters "Ã" and "©", and was converted wrongly from latin1 to utf8, so now instead of having é in UTF-8 (with bytes that translate to "é" in latin), we have "é" in utf8, coded with the 2 charactersc3 83
andc2 a9
This string is not a proper (meaningful) utf-8 or latin1 string anymore, "é" is
e9
in latin, orc3 a9
in utf-8, but neverc3 83 c2 a9
!We can undo the bad translation though:
From there we can build either a proper utf-8 string (recommended) or a proper latin1 string
Part of encoding hell is that R sees those MOSTLY as the same, because it mostly doesn't matter
But the truth can be seen with the
Encoding()
andcharToRaw()
functions, or when serializing, which shows both informations.The 3 differences we see above are the encoding marking (80 for UTF-8, 40 for latin1, 00 for unknown), the length in byte (11=17 in decimal, 0f = 15 in decimal), and the byte values of the "é" characters ("c3" "a9" vs "e9")
Fun fact, if we change the locale to latin1 (here on a Mac), for reasons that I don't understand,
oops
will actually print "é" (and the others won't print well anymore), proving that we can't always trustprint()
andidentical()
, and thatcharToRaw()
,Encoding()
andiconv()
are your friends to debug encoding hell.