当某些数字包含逗号作为千位分隔符时如何读取数据?
我有一个 csv 文件,其中一些数值表示为字符串,并以逗号作为千位分隔符,例如 "1,513"
而不是 1513
。将数据读入 R 的最简单方法是什么?
我可以使用 read.csv(..., colClasses="character")
,但是在将这些列转换为数字之前,我必须从相关元素中删除逗号,但我不能找到一个巧妙的方法来做到这一点。
I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513"
instead of 1513
. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character")
, but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
不确定如何让
read.csv
正确解释它,但您可以使用gsub
将","
替换为""
,然后使用as.numeric
将字符串转换为numeric
:这是 之前也在 R-Help 上回答过(以及 此处为问题 2)。
或者,您可以预处理文件,例如在 unix 中使用
sed
。Not sure about how to have
read.csv
interpret it properly, but you can usegsub
to replace","
with""
, and then convert the string tonumeric
usingas.numeric
:This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with
sed
in unix.您可以让 read.table 或 read.csv 半自动地为您完成此转换。首先创建一个新的类定义,然后创建一个转换函数并使用 setAs 函数将其设置为“as”方法,如下所示:
然后运行 read.csv,如下所示:
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
Then run read.csv like:
我想使用 R 而不是预处理数据,因为它在修改数据时更容易。按照 Shane 使用
gsub
的建议,我认为这已经是我能做到的最简洁的了:I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using
gsub
, I think this is about as neat as I can do:这个问题已经有好几年了,但我偶然发现了它,这意味着也许其他人也会这样做。
readr
库/包有一些不错的功能。其中之一是解释“混乱”列的好方法,例如这些。这会产生
来源:本地数据帧[4 x 1]
读取文件时的一个重要点:您要么必须进行预处理,就像上面关于
sed
的评论,要么必须处理阅读时。通常,如果你试图在事后解决问题,就会做出一些很难发现的危险假设。 (这就是为什么平面文件首先如此邪恶。)例如,如果我没有标记
col_types
,我会得到这个:(请注意,它现在是一个
chr
(字符
)而不是数字
。)或者,更危险的是,如果它足够长并且大多数早期元素不包含逗号:(
例如最后几个元素看起来像:)
然后你会发现阅读该逗号时遇到困难!
This question is several years old, but I stumbled upon it, which means maybe others will.
The
readr
library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.This yields
Source: local data frame [4 x 1]
An important point when reading in files: you either have to pre-process, like the comment above regarding
sed
, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)For instance, if I had not flagged the
col_types
, I would have gotten this:(Notice that it is now a
chr
(character
) instead of anumeric
.)Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
(such that the last few elements look like:)
Then you'll find trouble reading that comma at all!
我们还可以使用 readr::parse_number ,但列必须是字符。如果我们想将其应用于多个列,我们可以使用
lapply
循环遍历列,或者使用
dplyr
中的mutate_at
将其应用于特定变量。数据
We can also use
readr::parse_number
, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns usinglapply
Or use
mutate_at
fromdplyr
to apply it to specific variables.data
使用
mutate_all
和管道的dplyr
解决方案表示您有以下内容:
并且想要从年份变量 X2014-X2016 中删除逗号,并且
将它们转换为数字。另外,假设 X2014-X2016 读作
Factors (默认)
mutate_all
将funs
中的函数应用到指定的列我按顺序执行,一次一个函数(如果您使用多个函数)
funs 内的函数,然后您创建额外的、不必要的列)
a
dplyr
solution usingmutate_all
and pipessay you have the following:
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
mutate_all
applies the function(s) insidefuns
to the specified columnsI did it sequentially, one function at a time (if you use multiple
functions inside
funs
then you create additional, unnecessary columns)R 中的“预处理”:
可以在
textConnection
上使用readLines
。然后仅删除数字之间的逗号:知道逗号作为小数分隔符可以由 read.csv2 (自动)或 read.table (通过设置 'dec'- 设置)处理也很有用,但与这个问题没有直接关系范围)。
编辑:后来我通过设计一个新类发现了如何使用 colClasses。请参阅:
如何在 R 中加载带有 1000 分隔符的 df 作为数字类?
"Preprocess" in R:
Can use
readLines
on atextConnection
. Then remove only the commas that are between digits:It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
使用 read_delim 函数(属于 readr 库的一部分),您可以指定附加参数:
*第二行中的分号表示 read_delim 将读取 csv 分号分隔的值。
这将有助于将所有带逗号的数字读取为正确的数字。
问候
马特乌斯·卡尼亚
Using read_delim function, which is part of readr library, you can specify additional parameter:
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
如果数字之间用“.”分隔并以“,”(1.200.000,00)表示小数,在调用
gsub
时,您必须设置fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE ))
If number is separated by "." and decimals by "," (1.200.000,00) in calling
gsub
you mustset fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
一个非常方便的方法是 readr::read_delim-family。以这里为例:
将带有多个分隔符的 csv 导入到 R 中,您可以执行以下操作:
这会产生预期的结果:
A very convenient way is
readr::read_delim
-family. Taking the example from here:Importing csv with multiple separators into R you can do it as follows:
Which results in the expected result:
我认为预处理是可行的方法。您可以使用 Notepad++ 来替换正则表达式选项。
例如,如果您的文件如下所示:
那么,您可以使用正则表达式
"([0-9]+),([0-9]+)"
并将其替换为\1\2
然后你可以使用
x <- read.csv(file="x.csv",header=FALSE)
来读取文件。I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
Then, you could use the regular expression
"([0-9]+),([0-9]+)"
and replace it with\1\2
Then you could use
x <- read.csv(file="x.csv",header=FALSE)
to read the file.