当某些数字包含逗号作为千位分隔符时如何读取数据?

发布于 2024-08-06 16:34:36 字数 223 浏览 6 评论 0原文

我有一个 csv 文件,其中一些数值表示为字符串,并以逗号作为千位分隔符,例如 "1,513" 而不是 1513。将数据读入 R 的最简单方法是什么?

我可以使用 read.csv(..., colClasses="character"),但是在将这些列转换为数字之前,我必须从相关元素中删除逗号,但我不能找到一个巧妙的方法来做到这一点。

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?

I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

念三年u 2024-08-13 16:34:36

不确定如何让 read.csv 正确解释它,但您可以使用 gsub"," 替换为 "",然后使用 as.numeric 将字符串转换为 numeric

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

这是 之前也在 R-Help 上回答过(以及 此处为问题 2)。

或者,您可以预处理文件,例如在 unix 中使用 sed

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

This was also answered previously on R-Help (and in Q2 here).

Alternatively, you can pre-process the file, for instance with sed in unix.

难忘№最初的完美 2024-08-13 16:34:36

您可以让 read.table 或 read.csv 半自动地为您完成此转换。首先创建一个新的类定义,然后创建一个转换函数并使用 setAs 函数将其设置为“as”方法,如下所示:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

然后运行 ​​read.csv,如下所示:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
橘味果▽酱 2024-08-13 16:34:36

我想使用 R 而不是预处理数据,因为它在修改数据时更容易。按照 Shane 使用 gsub 的建议,我认为这已经是我能做到的最简洁的了:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
听你说爱我 2024-08-13 16:34:36

这个问题已经有好几年了,但我偶然发现了它,这意味着也许其他人也会这样做。

readr 库/包有一些不错的功能。其中之一是解释“混乱”列的好方法,例如这些。

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

这会产生

来源:本地数据帧[4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

读取文件时的一个重要点:您要么必须进行预处理,就像上面关于sed的评论,要么必须处理阅读时。通常,如果你试图在事后解决问题,就会做出一些很难发现的危险假设。 (这就是为什么平面文件首先如此邪恶。)

例如,如果我没有标记 col_types,我会得到这个:(

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

请注意,它现在是一个 chr字符)而不是数字。)

或者,更危险的是,如果它足够长并且大多数早期元素不包含逗号:(

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

例如最后几个元素看起来像:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

然后你会发现阅读该逗号时遇到困难!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details. 

This question is several years old, but I stumbled upon it, which means maybe others will.

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

Source: local data frame [4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

For instance, if I had not flagged the col_types, I would have gotten this:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details. 
苍景流年 2024-08-13 16:34:36

我们还可以使用 readr::parse_number ,但列必须是字符。如果我们想将其应用于多个列,我们可以使用 lapply 循环遍历列,

df[2:3] <- lapply(df[2:3], readr::parse_number)
df

#  a        b        c
#1 a    12234       12
#2 b      123  1234123
#3 c     1234     1234
#4 d 13456234    15342
#5 e    12312 12334512

或者使用 dplyr 中的 mutate_at 将其应用于特定变量。

library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)

数据

df <- data.frame(a = letters[1:5], 
                 b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
                 c = c("12", "1,234,123","1234", "15,342", "123,345,12"), 
                 stringsAsFactors = FALSE)

We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply

df[2:3] <- lapply(df[2:3], readr::parse_number)
df

#  a        b        c
#1 a    12234       12
#2 b      123  1234123
#3 c     1234     1234
#4 d 13456234    15342
#5 e    12312 12334512

Or use mutate_at from dplyr to apply it to specific variables.

library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)

data

df <- data.frame(a = letters[1:5], 
                 b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
                 c = c("12", "1,234,123","1234", "15,342", "123,345,12"), 
                 stringsAsFactors = FALSE)
物价感观 2024-08-13 16:34:36

使用 mutate_all 和管道的 dplyr 解决方案

表示您有以下内容:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

并且想要从年份变量 X2014-X2016 中删除逗号,并且
将它们转换为数字。另外,假设 X2014-X2016 读作
Factors (默认)

dft %>%
    mutate_all(funs(as.character(.)), X2014:X2016) %>%
    mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_all(funs(as.numeric(.)), X2014:X2016)

mutate_allfuns 中的函数应用到指定的列

我按顺序执行,一次一个函数(如果您使用多个函数)
funs 内的函数,然后您创建额外的、不必要的列)

a dplyr solution using mutate_all and pipes

say you have the following:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)

dft %>%
    mutate_all(funs(as.character(.)), X2014:X2016) %>%
    mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_all(funs(as.numeric(.)), X2014:X2016)

mutate_all applies the function(s) inside funs to the specified columns

I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)

始终不够 2024-08-13 16:34:36

R 中的“预处理”:

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"

可以在 textConnection 上使用 readLines。然后仅删除数字之间的逗号:

gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)

## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"

知道逗号作为小数分隔符可以由 read.csv2 (自动)或 read.table (通过设置 'dec'- 设置)处理也很有用,但与这个问题没有直接关系范围)。

编辑:后来我通过设计一个新类发现了如何使用 colClasses。请参阅:

如何在 R 中加载带有 1000 分隔符的 df 作为数字类?

"Preprocess" in R:

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"

Can use readLines on a textConnection. Then remove only the commas that are between digits:

gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)

## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"

It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).

Edit: Later I discovered how to use colClasses by designing a new class. See:

How to load df with 1000 separator in R as numeric class?

萌逼全场 2024-08-13 16:34:36

使用 read_delim 函数(属于 readr 库的一部分),您可以指定附加参数:

locale = locale(decimal_mark = ",")

read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))

*第二行中的分号表示 read_delim 将读取 csv 分号分隔的值。

这将有助于将所有带逗号的数字读取为正确的数字。

问候

马特乌斯·卡尼亚

Using read_delim function, which is part of readr library, you can specify additional parameter:

locale = locale(decimal_mark = ",")

read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))

*Semicolon in second line means that read_delim will read csv semicolon separated values.

This will help to read all numbers with a comma as proper numbers.

Regards

Mateusz Kania

沙沙粒小 2024-08-13 16:34:36

如果数字之间用“.”分隔并以“,”(1.200.000,00)表示小数,在调用gsub时,您必须设置fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE ))

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

旧时光的容颜 2024-08-13 16:34:36

一个非常方便的方法是 readr::read_delim-family。以这里为例:
将带有多个分隔符的 csv 导入到 R 中,您可以执行以下操作:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

这会产生预期的结果:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7

A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

Which results in the expected result:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7
2024-08-13 16:34:36

我认为预处理是可行的方法。您可以使用 Notepad++ 来替换正则表达式选项。

例如,如果您的文件如下所示:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

那么,您可以使用正则表达式 "([0-9]+),([0-9]+)" 并将其替换为 \1\2

1234,"123",1234
"234","123",1234
123,456,789

然后你可以使用 x <- read.csv(file="x.csv",header=FALSE) 来读取文件。

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

For example, if your file were like this:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文