一些法语重音字符被编码为 utf-8 但仍然无法正确呈现

发布于 2025-01-20 18:22:47 字数 423 浏览 0 评论 0原文

您好:我正在导入一个包含大量法语口音字符的 Stata 文件。导入时,我将编码设置为 utf-8。但是,某些重音字符无法正确呈现。请参阅下面我的数据集中的行示例。 我该如何处理这个问题?

test<-tibble::tribble(
  ~municipality,
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality

Hi there: I'm importing a Stata file that has a lot of French accented characters. The on import, I set the Encoding to utf-8. However, some of the accented characters are not rendering properly. See a sample of rows from my data-set below.
How do I handle this?

test<-tibble::tribble(
  ~municipality,
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré",
  "Beaupré"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

反差帅 2025-01-27 18:22:47

正如 Giacomo 提到的,这似乎是文件的一个示例,其中一部分(正如您还显示了正确编码的 é,其中 UTF-8 一样,将其读取为 Latin1 并再次编码为 UTF-8,这意味着您的编码是正确的 à 并且这样显示。

知道它是如何发生的意味着我们知道如何修复它!

© 本身也是 utf-8 字符, 重大失误三次错误的编码),因此我们模拟每个字符在错误编码后的变化,然后再次保存为 utf8

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

让我们在您的数据上运行它

v <- c("Sainte-Anne-de-Beaupré", "Beaupré")
v
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

另一个例子:

str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"

# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lélèlölã"

# add once, twice and tripple wrong encoded string to messy
messy <- c("lélèlölã", "lélèlölã", "lélèlölã")

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"

As Giacomo mentions this seems to be an example of files where part of it (as you show also correct encoded é's where UTF-8, read it as it was Latin1 and encoded again as UTF-8, this means your encoding is correct as é itself are utf-8 characters as well and displayed as such. What you can do is fix such errors in the past.

Knowing how it happened means we know how to fix it!

So I wrote a function in the past (it supports some major screw ups of tripple wrong encodings) so we simulate how each character becomes after wrong encoding and then saved as utf8 again.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

Lets run it on your data

v <- c("Sainte-Anne-de-Beaupré", "Beaupré")
v
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

Another example:

str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"

# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lélèlölã"

# add once, twice and tripple wrong encoded string to messy
messy <- c("lélèlölã", "lélèlölã", "lélèlölã")

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文