一些法语重音字符被编码为 utf-8 但仍然无法正确呈现

发布于 2025-01-20 18:22:47 字数 423 浏览 0 评论 0原文

您好：我正在导入一个包含大量法语口音字符的 Stata 文件。导入时，我将编码设置为 utf-8。但是，某些重音字符无法正确呈现。请参阅下面我的数据集中的行示例。我该如何处理这个问题？

test<-tibble::tribble(
  ~municipality,
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality

原文

Hi there: I'm importing a Stata file that has a lot of French accented characters. The on import, I set the Encoding to utf-8. However, some of the accented characters are not rendering properly. See a sample of rows from my data-set below.
How do I handle this?

test<-tibble::tribble(
  ~municipality,
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反差帅 2025-01-27 18:22:47

正如 Giacomo 提到的，这似乎是文件的一个示例，其中一部分（正如您还显示了正确编码的 é，其中 UTF-8 一样，将其读取为 Latin1 并再次编码为 UTF-8，这意味着您的编码是正确的 à 并且这样显示。

知道它是如何发生的意味着我们知道如何修复它！

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

让我们在您的数据上运行它

v <- c("Sainte-Anne-de-Beaupré", "BeauprÃ©")
v
# [1] "Sainte-Anne-de-Beaupré" "BeauprÃ©"

stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

另一个例子：

str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"

# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lÃ©lÃ¨lÃ¶lÃ£"

# add once, twice and tripple wrong encoded string to messy
messy <- c("lÃ©lÃ¨lÃ¶lÃ£", "lÃƒÂ©lÃƒÂ¨lÃƒÂ¶lÃƒÂ£", "lÃƒÆ’Ã‚Â©lÃƒÆ’Ã‚Â¨lÃƒÆ’Ã‚Â¶lÃƒÆ’Ã‚Â£")

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"

As Giacomo mentions this seems to be an example of files where part of it (as you show also correct encoded é's where UTF-8, read it as it was Latin1 and encoded again as UTF-8, this means your encoding is correct as Ã© itself are utf-8 characters as well and displayed as such. What you can do is fix such errors in the past.

Knowing how it happened means we know how to fix it!

So I wrote a function in the past (it supports some major screw ups of tripple wrong encodings) so we simulate how each character becomes after wrong encoding and then saved as utf8 again.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

Lets run it on your data

v <- c("Sainte-Anne-de-Beaupré", "BeauprÃ©")
v
# [1] "Sainte-Anne-de-Beaupré" "BeauprÃ©"

stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

Another example:

str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"

# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lÃ©lÃ¨lÃ¶lÃ£"

# add once, twice and tripple wrong encoded string to messy
messy <- c("lÃ©lÃ¨lÃ¶lÃ£", "lÃƒÂ©lÃƒÂ¨lÃƒÂ¶lÃƒÂ£", "lÃƒÆ’Ã‚Â©lÃƒÆ’Ã‚Â¨lÃƒÆ’Ã‚Â¶lÃƒÆ’Ã‚Â£")

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"

回复收藏 0 原文

~没有更多了~