可靠地导入CSV列的“双重”

发布于 2025-02-06 12:42:35 字数 1189 浏览 1 评论 0原文

我正在尝试在for循环中导入多个CSV文件。迭代地试图求解所产生的代码的错误，我可以在下面完成此操作。

for (E in EDCODES) {
  Filename <- paste("$. Data/2. Liabilities/",
                    E, 
                    sep="")
  Framename <- gsub("\\..*",
                    "", 
                    E)
  assign(Framename,
         read.csv(Filename, 
                  header = TRUE,
                  sep = ",", 
                  stringsAsFactors = FALSE,
                  na.strings = c("\"ND", 
                                 "ND,5",
                                 "5\""),
                  colClasses = c("BAA35" = "double"),
                  encoding = "UTF-8",
                  quote = ""))}

首先，我意识到代码并不总是将最重要的列“ BAA35”识别为数字，因此我添加了colclasses参数。然后我意识到数据具有“ NA”的多个版本，因此我添加了Na.strings参数。最常见的Na值是“ ND，5”，其中包含分离器”。因此，如果我添加上面定义的Na.strings参数，我会在引用的字符串警告中获得很多eof。其他也是“ ND，[Number]”或“ ND，4，[Yyyy-Mm]的版本”。

如果我尝试用我可以找到的最常见的建议来对待这个问题，添加QUOTE =“”我最终会比列更多，而不是列名称问题。

数据有78列，因此我不相信将其发布在此处将以可用的方式显示。

有人可以建议我如何可靠地将本列作为数字值导入并让R正确识别NAS吗？

我认为问题可能是Na.strings包含逗号，在某些情况下，第5条被读为一个带有ND的一列，另一种为5，在其他情况下则被视为Na.String。有什么办法告诉r不要将“ nd，5”分为两列？

原文

I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.

for (E in EDCODES) {
  Filename <- paste("$. Data/2. Liabilities/",
                    E, 
                    sep="")
  Framename <- gsub("\\..*",
                    "", 
                    E)
  assign(Framename,
         read.csv(Filename, 
                  header = TRUE,
                  sep = ",", 
                  stringsAsFactors = FALSE,
                  na.strings = c("\"ND", 
                                 "ND,5",
                                 "5\""),
                  colClasses = c("BAA35" = "double"),
                  encoding = "UTF-8",
                  quote = ""))}

First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".

If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.

The data has 78 columns, so I don't believe posting it here will display in a usable way.

Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?

I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?

分享到QQ

分享到微博