IS.Character()无法正确识别数据框架
的行为IS.Character()
在R 4.X中更改了? 在这里,我将一个简单的选项卡划分的文本文件读取到数据框架中,然后确认所有列都正确标记为字符数据:
> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame': 407 obs. of 18 variables:
$ NAME : chr "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
$ ADDRESS : chr "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
$ ZIP : chr "99975" "99904" "99900" "99924" ...
$ SSN : chr "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
$ SEX : chr "F" "F" "M" "M" ...
$ MARITALSTATUS : chr "M" "M" "M" "M" ...
$ CHILDREN : chr "2" "1" "0" "0" ...
$ OCCUPATION : chr "Professional" "Unknown" "Unknown" "Unknown" ...
$ HOMEOWNERSHIP : chr "O" "O" "O" "O" ...
$ INCOME : chr "3212" "3145" "3165" "3248" ...
$ EXPENSES : chr "1124" "1100" "1266" "974" ...
$ CHECKING : chr "N" "N" "N" "N" ...
$ SAVINGS : chr "Y" "Y" "Y" "Y" ...
$ MSTRCARD : chr "1" "1" "1" "1" ...
$ VISA : chr "5" "5" "5" "5" ...
$ AMEX : chr "0" "0" "0" "0" ...
$ MERCHANT : chr "9" "9" "9" "9" ...
$ PAYMENTHISTORY: chr "2" "0" "2" "3" ...
但是,is.character(raw)(raw)
for data框架和> IS.Character(RAW [3,1:17])
对于数据框中的一部分的一部分,两者都返回false:
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
>
使用R版本 3.5.2 (原始开发环境是64位r 3.5.2在64位Win 7)上,只需将文件读取到数据框中( 没有 需要添加colclasses =“ contric”
简单地工作即可。用例是,基本上R包装器使用 is.character()
确定数据框中的一行是否包含所有字符串值(实际上:实际上: is.Character(raw [n,1:17])
);字符串,或者期望所有双打)。
自2019年以来,我已经离开R,所以今天在运行Win10 Pro的计算机上,我安装了64位R 4.2.1,加载了原始工作空间,并期望一切正常。 这些记录(矢量)明确具有双引号(例如“ Hope Gorman”,“ 99975”等)中的每个值
而且,如果我手动制作一个唱片(向量) , IS,从简单的Flat ASCII文本文件中加载数据框架,然后行访问该数据框架,即使加载r后,数据框架似乎是由引用的字符串的值组成的。 字符串
可怕的 nas由胁迫错误 - 在包装中不识别我缺少
的 这是文件的前4行(第一行包含列标签; 18个总计 TAB DELIMITED 字段 - 某些字符串字段包含空格,例如希望Gorman
是值。对于第一个名称字段/列)。这是一个用于消费者信用分析的玩具(完全伪造的)数据文件。
NAME ADDRESS ZIP SSN SEX MARITALSTATUS CHILDREN OCCUPATION HOMEOWNERSHIP INCOME EXPENSES CHECKING SAVINGS MSTRCARD VISA AMEX MERCHANT PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd. 99975 470-17-7670 F M 2 Professional O 3212 1124 N Y 1 5 0 9 2
Sarah Coriano 640 Prospect Lane 99904 355-91-5677 F M 1 Unknown O 3145 1100 N Y 1 5 0 9 0
Ernest Farmer 474 Green Street 99900 129-21-0468 M M 0 Unknown O 3165 1266 N Y 1 5 0 9 2
另外,我已经检查了原始开发机上的所有内容(相同的文件,相同的R工作区,但在Win7上运行的R 3.5.2),R包装器按预期调用正确的C代码。
这使我认为r 4.2在Win 10上运行的情况有所不同 - 我注意到R现在显然使用UTF-8字符,但是由于该文件仅由US-ASCII字符组成,没有BOM,所以我很难认为Win10上的角色处理是问题所在,但事实仍然是原始代码/ R工作区不起作用。
谢谢, 杰克
Has the behavior of is.character()
changed in R 4.x ?
Here I read a simple tab-delimited text file into a data frame, and then confirm all columns are correctly marked as character data:
> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame': 407 obs. of 18 variables:
$ NAME : chr "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
$ ADDRESS : chr "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
$ ZIP : chr "99975" "99904" "99900" "99924" ...
$ SSN : chr "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
$ SEX : chr "F" "F" "M" "M" ...
$ MARITALSTATUS : chr "M" "M" "M" "M" ...
$ CHILDREN : chr "2" "1" "0" "0" ...
$ OCCUPATION : chr "Professional" "Unknown" "Unknown" "Unknown" ...
$ HOMEOWNERSHIP : chr "O" "O" "O" "O" ...
$ INCOME : chr "3212" "3145" "3165" "3248" ...
$ EXPENSES : chr "1124" "1100" "1266" "974" ...
$ CHECKING : chr "N" "N" "N" "N" ...
$ SAVINGS : chr "Y" "Y" "Y" "Y" ...
$ MSTRCARD : chr "1" "1" "1" "1" ...
$ VISA : chr "5" "5" "5" "5" ...
$ AMEX : chr "0" "0" "0" "0" ...
$ MERCHANT : chr "9" "9" "9" "9" ...
$ PAYMENTHISTORY: chr "2" "0" "2" "3" ...
However, is.character(raw)
for the data frame and is.character(raw[3,1:17])
for a portion of a row in the data frame both return FALSE:
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
>
With R version 3.5.2 (the original development environment was 64-bit R 3.5.2 on 64-bit Win 7), simply reading the file into a data frame (WITHOUT needing to add colClasses = "character"
simply worked. The use case is that basically an R wrapper uses is.character()
to determine whether a row in the data frame contains all string values (in effect: is.character(raw[n,1:17])
); that then determines which version of a C function in a legacy DLL to call - one that expects either ALL strings, or one that expects ALL doubles).
I have been away from R since 2019, so today on a computer running Win10 Pro I installed 64-bit R 4.2.1, loaded the original workspace, and expected everything to work. And, if I manually craft a record (vector) that explicitly has every value in double quotes (e.g., "Hope Gorman", ""99975", etc.) everything does work - the R wrapper calls the correct C function.
The problem is, loading the data frame from the simple flat ASCII text file and then accessing it row by row does not work, even though after loading R seems to think the data consists of values that are quoted strings. The error is the dreaded NAs introduced by coercion error - in the wrapper R appears to NOT recognize the character strings.
What am I missing? Is this a bug in 4.x ?
EDIT:
Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman
is the value for the first Name field/column). This is a toy (ENTIRELY FAKED) data file for consumer credit analysis.
NAME ADDRESS ZIP SSN SEX MARITALSTATUS CHILDREN OCCUPATION HOMEOWNERSHIP INCOME EXPENSES CHECKING SAVINGS MSTRCARD VISA AMEX MERCHANT PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd. 99975 470-17-7670 F M 2 Professional O 3212 1124 N Y 1 5 0 9 2
Sarah Coriano 640 Prospect Lane 99904 355-91-5677 F M 1 Unknown O 3145 1100 N Y 1 5 0 9 0
Ernest Farmer 474 Green Street 99900 129-21-0468 M M 0 Unknown O 3165 1266 N Y 1 5 0 9 2
Also FWIW, I have checked everything on the original development machine (same file, same R workspace but R 3.5.2 running on Win7), and the R wrapper calls the correct C code as expected.
This leads me to think there is something different in R 4.2 running on Win 10 - I have noted that R now apparently uses UTF-8 characters, but since the file consists solely of US-ASCII characters and no BOM, I am hard-pressed to think character handling on Win10 is the problem, but the fact remains the original code/ R Workspace doesn't work.
Thanks,
Jack
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,感谢所有回应的R Gurus。
其次,有些尴尬,在重新学习如何使用R中的调试工具之后,我发现代码在 r 3.5.2 上的代码“运行”的原因是传统C中有一个错误dll。
当我从静态查看问题的R代码时,似乎R函数可以调用正确的DLL函数的唯一方法是
is.character(data)
返回的true 。但是,当我在调试器(原始Win7/R 3.5.2环境中)中介入代码时,我发现
is.character(data)
实际上正在返回错误 - 正如这里的每个人所期望的(和Casper V.进一步证明), ,但 win7 dll中的C函数仍在处理数据
作为字符字符串的数组(考虑到R函数中的逻辑路径,它不应完成)。然后,我发现Win 10上使用的遗留DLL与Win 7环境中使用的遗产相同,实际上是一个较晚的版本,其中该错误是固定的(当然会导致我看到的R错误在胜利10中)。
最后,按照R2evans建议检查R中的数据类型最终解决了该问题。
First, thanks to all the R Gurus who responded.
Second, and somewhat embarrassing, after re-learning how to use debugging tools in R, I discovered that the reason the code "ran" on R 3.5.2 was that there was a bug in the legacy C DLL.
When I looked at the problematic R code statically, it appeared that the only way the R function could possibly call the correct DLL function was if
is.character(data)
returned true.However, when I stepped through the code in the debugger (in the original Win7/R 3.5.2 environment), I found that
is.character(data)
was actually returning false - as everyone here expected (and Casper V. further demonstrated), BUT the C function in the Win7 DLL was still treatingdata
as an array of character strings (which it should not have done, given the logic path in the R function).I then discovered that the legacy DLL used on Win 10, which I thought was the same as that used in the Win 7 environment, was actually a later version, in which the bug was fixed (which of course caused the R error I was seeing in Win 10).
In the end, checking the data type in R as suggested by r2evans ultimately solved the problem.
您描述的
IS.Character()
的行为似乎没有改变。您所描述的3.5.2版本的行为似乎是不正确的,正如我尝试在下面复制的那样,可以看到。如@Guasi所述,数据框架的类
raw
只是data.frame
。 列的类,例如raw $ name
,可以是contrac
。您有可能使用/拥有
is.Character()
来处理3.5.2设置的数据帧的自定义处理程序,而您没有将其复制到新版本:https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character/character
您提到的是相同的代码在旧的3.5.2安装上,仍然调用正确的C代码。您是否检查过
的输出是。
繁殖尝试
我在Win 7上安装了R 3.5.2 64位,并且无法再现过去记得的内容:
file: test.csv ,是根据示例数据创建的,但选项卡划分:
环境:
命令上:
。
工作 Code> is.Character()仍然是
colclasses =“ tarne”
。read.csv()
,但这也产生了相同的结果。The behaviour of
is.character()
that you describe doesn't seem to have changed. What you describe as the behavior for version 3.5.2 doesn't seem to be correct, as can be seen in my attempt to replicate it below.As mentioned by @guasi, the class of your data frame
raw
is justdata.frame
. The class of a column, egraw$NAME
, can becharacter
.It is possible that you had/have a custom handler for
is.character()
to handle data frames on the 3.5.2 setup, and you didn't copy it to the new version:https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character
You mention that the same code on the old 3.5.2 install, still calls the correct C code. Have you checked yourself what the output of
is.character(raw)
is on that machine?Reproduction attempt
I have installed R 3.5.2 64 bit on Win 7, and couldn't reproduce what you remember from the past:
File: test.csv, created from example data, but tab delimited:
Environment:
Command:
spacing between commands mine
Here I assumed you mean the same command you used before, but without the colClasses parameter.
is.character()
is stillFALSE
withcolClasses = "character"
. It's more common to useread.csv()
, but that yielded the same results as well.