导入包含逗号、千位分隔符和尾随减号的 CSV 数据

发布于 2024-11-24 23:25:22 字数 2861 浏览 3 评论 0原文

Mac OS X 上的 R 2.13.1。我正在尝试导入一个数据文件,该文件具有千位分隔符和逗号作为小数点,以及负值的尾随减号。

基本上,我试图从: 转换

"A|324,80|1.324,80|35,80-"

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

现在,以交互方式进行以下操作:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

并将它们组合起来:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

但是,我无法从 read.data 中删除千位分隔符:

setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

请注意,如果我从“\\”更改”。到函数中的“,”,事情看起来有点不同:

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

我认为问题是 read.data 和 dec=“,”将传入的“,”转换为“。”在调用 as(from, "num.with.commas") 之前,以便输入字符串可以是例如“1.324.80”。

我希望 as("1.123,80-","num.with.commas") 返回 -1123.80 和 as("1.100.123,80", "num.with.commas") 返回 1100123.80。

如何让我的 num.with.commas 替换输入字符串中除最后一个小数点之外的所有小数点?

更新:首先,我添加了否定前瞻并让 as() 在控制台中工作:

setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

但是,read.table仍然存在相同的问题。在我的函数中添加一些 print() 表明 num.with.commas 实际上得到了逗号而不是要点。

所以我当前的解决方案是将“,”替换为“。”以 num.with.commas 表示。

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8

R 2.13.1 on Mac OS X. I'm trying to import a data file that has a point for thousand separator and comma as the decimal point, as well as trailing minus for negative values.

Basically, I'm trying to convert from:

"A|324,80|1.324,80|35,80-"

to

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

Now, interactively both the following works:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

and also combining them:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

However, I'm not able to remove the thousand separator from read.data:

setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

Note that if I change from "\\." to "," in the function, things look a bit different:

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

I think the problem is that read.data with dec="," converts the incoming "," to "." BEFORE calling as(from, "num.with.commas"), so that the input string can be e.g. "1.324.80".

I want as("1.123,80-","num.with.commas") to return -1123.80 and as("1.100.123,80", "num.with.commas") to return 1100123.80.

How can I make my num.with.commas replace all except the last decimal point in the input string?

Update: First, I added negative lookahead and got as() working in the console:

setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

However, read.table still had the same problem. Adding some print()s to my function showed that num.with.commas in fact got the comma and not the point.

So my current solution is to then replace from "," to "." in num.with.commas.

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

生生不灭 2024-12-01 23:25:22

您应该先删除所有句点,然后将逗号更改为小数点,然后再使用 as.numeric() 进行强制转换。您稍后可以使用 options(OutDec=",") 控制小数点的打印方式。我不认为 R 在内部使用逗号作为小数点分隔符,即使在传统的语言环境中也是如此。

> tst <- c("A","324,80","1.324,80","35,80-")
> 
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1]     NA  324.8 1324.8  -35.8
Warning message:
NAs introduced by coercion 

You should be removing all the periods first and then changing the commas to decimal points before coercing with as.numeric(). You can later control how decimal points are printed with options(OutDec=",") . I do not think R uses commas as decimal separators internally even in locales where they are conventional.

> tst <- c("A","324,80","1.324,80","35,80-")
> 
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1]     NA  324.8 1324.8  -35.8
Warning message:
NAs introduced by coercion 
趁年轻赶紧闹 2024-12-01 23:25:22

这是使用正则表达式和替换的解决方案

mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))

Here's a solution with regular expressions and substitutions

mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文