可靠地导入CSV列的“双重”

发布于 2025-02-06 12:42:35 字数 1189 浏览 1 评论 0原文

我正在尝试在for循环中导入多个CSV文件。迭代地试图求解所产生的代码的错误,我可以在下面完成此操作。

for (E in EDCODES) {
  Filename <- paste("$. Data/2. Liabilities/",
                    E, 
                    sep="")
  Framename <- gsub("\\..*",
                    "", 
                    E)
  assign(Framename,
         read.csv(Filename, 
                  header = TRUE,
                  sep = ",", 
                  stringsAsFactors = FALSE,
                  na.strings = c("\"ND", 
                                 "ND,5",
                                 "5\""),
                  colClasses = c("BAA35" = "double"),
                  encoding = "UTF-8",
                  quote = ""))}

首先,我意识到代码并不总是将最重要的列“ BAA35”识别为数字,因此我添加了colclasses参数。然后我意识到数据具有“ NA”的多个版本,因此我添加了Na.strings参数。最常见的Na值是“ ND,5”,其中包含分离器”。因此,如果我添加上面定义的Na.strings参数,我会在引用的字符串警告中获得很多eof。其他也是“ ND,[Number]”或“ ND,4,[Yyyy-Mm]的版本”。

如果我尝试用我可以找到的最常见的建议来对待这个问题,添加QUOTE =“”我最终会比列更多,而不是列名称问题。

数据有78列,因此我不相信将其发布在此处将以可用的方式显示。

有人可以建议我如何可靠地将本列作为数字值导入并让R正确识别NAS吗?

我认为问题可能是Na.strings包含逗号,在某些情况下,第5条被读为一个带有ND的一列,另一种为5,在其他情况下则被视为Na.String。有什么办法告诉r不要将“ nd,5”分为两列?

I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.

for (E in EDCODES) {
  Filename <- paste("$. Data/2. Liabilities/",
                    E, 
                    sep="")
  Framename <- gsub("\\..*",
                    "", 
                    E)
  assign(Framename,
         read.csv(Filename, 
                  header = TRUE,
                  sep = ",", 
                  stringsAsFactors = FALSE,
                  na.strings = c("\"ND", 
                                 "ND,5",
                                 "5\""),
                  colClasses = c("BAA35" = "double"),
                  encoding = "UTF-8",
                  quote = ""))}

First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".

If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.

The data has 78 columns, so I don't believe posting it here will display in a usable way.

Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?

I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文