使用 UTF-16 编码的 R write.csv

发布于 2024-10-21 05:13:36 字数 762 浏览 1 评论 0原文

我在使用 UTF-16 字符编码的 write.csv 输出 data.frame 时遇到问题。

背景:我正在尝试从 data.frame 中写出 CSV 文件以在 Excel 中使用。 Excel Mac 2011 似乎不喜欢 UTF-8(如果我在文本导入期间指定 UTF-8,非 ASCII 字符将显示为下划线)。我一直相信 Excel 会满意 UTF-16LE 编码。

这是示例 data.frame:

> foo
  a  b
1 á 羽
> Encoding(levels(foo$a))
[1] "UTF-8"
> Encoding(levels(foo$b))
[1] "UTF-8"

所以我尝试通过执行以下操作来输出 data.frame:

f <- file("foo.csv", encoding="UTF-16LE")
write.csv(foo, f)

这给了我一个 ASCII 文件,如下所示:

"","

如果我使用 encoding="UTF-16",我会得到一个仅包含字节顺序标记 0xFE 0xFF 的文件。

如果我使用 encoding="UTF-16BE",我会得到一个空文件。

这是在 Mac OS X 10.6.6 上的 64 位版本的 R 2.12.2 上。我做错了什么?

I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding.

Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text import, non-ASCII characters show up as underscores). I've been led to believe that Excel will be happy with UTF-16LE encoding.

Here's the example data.frame:

> foo
  a  b
1 á 羽
> Encoding(levels(foo$a))
[1] "UTF-8"
> Encoding(levels(foo$b))
[1] "UTF-8"

So I tried to output the data.frame by doing:

f <- file("foo.csv", encoding="UTF-16LE")
write.csv(foo, f)

This gives me an ASCII file that looks like:

"","

If I use encoding="UTF-16", I get a file that only contains the byte-order mark 0xFE 0xFF.

If I use encoding="UTF-16BE", I get an empty file.

This is on a 64-bit version of R 2.12.2 on Mac OS X 10.6.6. What am I doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

月朦胧 2024-10-28 05:13:36

您只需将 csv 保存为 UTF-8,然后使用 iconv< 将其转换为 UTF-16LE /a> 在终端中。

如果您坚持在 R 中执行此操作,则以下方法可能会起作用 - 尽管 R 中的 iconv 似乎确实存在一些问题,请参阅:http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

> x <- c("foo", "bar")
> iconv(x,"UTF-8","UTF-16LE")
Error in iconv(x, "UTF-8", "UTF-16LE") : 
  embedded nul in string: 'f\0o\0o\0'

如您所见上面链接的补丁是确实需要的 - 我没有测试过,但是如果你想保持它简单(并且令人讨厌):只需在保存后使用 system 调用来调用 R 中的第三方 iconv 程序表到 csv

You could simply save the csv in UTF-8 and later convert it to UTF-16LE with iconv in terminal.

If you insist on doing it in R, the following might work - althought it seems that iconv in R does have some issues, see: http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

> x <- c("foo", "bar")
> iconv(x,"UTF-8","UTF-16LE")
Error in iconv(x, "UTF-8", "UTF-16LE") : 
  embedded nul in string: 'f\0o\0o\0'

As you can see the above linked patch is really needed - which I did not tested, but if you want to keep it simly (and nasty): just call the third party iconv program inside R with a system call after saving the table to csv.

爱冒险 2024-10-28 05:13:36

类似的事情可能会这样做(write.csv() 只是忽略编码,因此您必须选择 writLines()writeBin()) ...

#' function to convert character vectors to UTF-8 encoding
#'
#' @param x the vector to be converted
#' @export 

toUTF8 <- 
  function(x){
    worker <- function(x){
      iconv(x, from = Encoding(x), to = "UTF-8")
    }
    unlist(lapply(x, worker))
  }



#' function to write csv files with UTF-8 characters (even under Windwos)
#' @param df data frame to be written to file
#' @param file file name / path where to put the data
#' @export 

write_utf8_csv <- 
function(df, file){
  firstline <- paste(  '"', names(df), '"', sep = "", collapse = " , ")
  char_columns <- seq_along(df[1,])[sapply(df, class)=="character"]
  for( i in  char_columns){
    df[,i] <- toUTF8(df[,i])
  }
  data <- apply(df, 1, function(x){paste('"', x,'"', sep = "",collapse = " , ")})
  writeLines( c(firstline, data), file , useBytes = T)
}


#' function to read csv file with UTF-8 characters (even under Windwos) that 
#' were created by write_U
#' @param df data frame to be written to file
#' @param file file name / path where to put the data
#' @export 

read_utf8_csv <- function(file){
  # reading data from file
  content <- readLines(file, encoding = "UTF-8")
  # extracting data
  content <- stringr::str_split(content, " , ")
  content <- lapply(content, stringr::str_replace_all, '"', "")
  content_names <- content[[1]][content[[1]]!=""]
  content <- content[seq_along(content)[-1]]  
  # putting it into data.frame
  df <- data.frame(dummy=seq_along(content), stringsAsFactors = F)
  for(name in content_names){
    tmp <- sapply(content, `[[`, dim(df)[2])
    Encoding(tmp) <- "UTF-8"
    df[,name] <- tmp 
  }
  df <- df[,-1]
  # return
  return(df)
}

something like that might do (write.csv() simply ignores the encoding so you have to opt for writLines() or writeBin()) ...

#' function to convert character vectors to UTF-8 encoding
#'
#' @param x the vector to be converted
#' @export 

toUTF8 <- 
  function(x){
    worker <- function(x){
      iconv(x, from = Encoding(x), to = "UTF-8")
    }
    unlist(lapply(x, worker))
  }



#' function to write csv files with UTF-8 characters (even under Windwos)
#' @param df data frame to be written to file
#' @param file file name / path where to put the data
#' @export 

write_utf8_csv <- 
function(df, file){
  firstline <- paste(  '"', names(df), '"', sep = "", collapse = " , ")
  char_columns <- seq_along(df[1,])[sapply(df, class)=="character"]
  for( i in  char_columns){
    df[,i] <- toUTF8(df[,i])
  }
  data <- apply(df, 1, function(x){paste('"', x,'"', sep = "",collapse = " , ")})
  writeLines( c(firstline, data), file , useBytes = T)
}


#' function to read csv file with UTF-8 characters (even under Windwos) that 
#' were created by write_U
#' @param df data frame to be written to file
#' @param file file name / path where to put the data
#' @export 

read_utf8_csv <- function(file){
  # reading data from file
  content <- readLines(file, encoding = "UTF-8")
  # extracting data
  content <- stringr::str_split(content, " , ")
  content <- lapply(content, stringr::str_replace_all, '"', "")
  content_names <- content[[1]][content[[1]]!=""]
  content <- content[seq_along(content)[-1]]  
  # putting it into data.frame
  df <- data.frame(dummy=seq_along(content), stringsAsFactors = F)
  for(name in content_names){
    tmp <- sapply(content, `[[`, dim(df)[2])
    Encoding(tmp) <- "UTF-8"
    df[,name] <- tmp 
  }
  df <- df[,-1]
  # return
  return(df)
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文