IS.Character()无法正确识别数据框架

发布于 2025-02-13 01:35:02 字数 3097 浏览 2 评论 0原文

的行为IS.Character()在R 4.X中更改了? 在这里,我将一个简单的选项卡划分的文本文件读取到数据框架中,然后确认所有列都正确标记为字符数据:

> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame':   407 obs. of  18 variables:
 $ NAME          : chr  "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
 $ ADDRESS       : chr  "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
 $ ZIP           : chr  "99975" "99904" "99900" "99924" ...
 $ SSN           : chr  "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
 $ SEX           : chr  "F" "F" "M" "M" ...
 $ MARITALSTATUS : chr  "M" "M" "M" "M" ...
 $ CHILDREN      : chr  "2" "1" "0" "0" ...
 $ OCCUPATION    : chr  "Professional" "Unknown" "Unknown" "Unknown" ...
 $ HOMEOWNERSHIP : chr  "O" "O" "O" "O" ...
 $ INCOME        : chr  "3212" "3145" "3165" "3248" ...
 $ EXPENSES      : chr  "1124" "1100" "1266" "974" ...
 $ CHECKING      : chr  "N" "N" "N" "N" ...
 $ SAVINGS       : chr  "Y" "Y" "Y" "Y" ...
 $ MSTRCARD      : chr  "1" "1" "1" "1" ...
 $ VISA          : chr  "5" "5" "5" "5" ...
 $ AMEX          : chr  "0" "0" "0" "0" ...
 $ MERCHANT      : chr  "9" "9" "9" "9" ...
 $ PAYMENTHISTORY: chr  "2" "0" "2" "3" ...

但是,is.character(raw)(raw) for data框架和> IS.Character(RAW [3,1:17])对于数据框中的一部分的一部分,两者都返回false:

> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
> 

使用R版本 3.5.2 (原始开发环境是64位r 3.5.2在64位Win 7)上,只需将文件读取到数据框中( 没有 需要添加colclasses =“ contric”简单地工作即可。用例是,基本上R包装器使用 is.character() 确定数据框中的一行是否包含所有字符串值(实际上:实际上: is.Character(raw [n,1:17]) );字符串,或者期望所有双打)。

自2019年以来,我已经离开R,所以今天在运行Win10 Pro的计算机上,我安装了64位R 4.2.1,加载了原始工作空间,并期望一切正常。 这些记录(矢量)明确具有双引号(例如“ Hope Gorman”,“ 99975”等)中的每个值

而且,如果我手动制作一个唱片(向量) , IS,从简单的Flat ASCII文本文件中加载数据框架,然后行访问该数据框架,即使加载r后,数据框架似乎是由引用的字符串的值组成的。 字符串

可怕的 nas由胁迫错误 - 在包装中不识别我缺少

的 这是文件的前4行(第一行包含列标签; 18个总计 TAB DELIMITED 字段 - 某些字符串字段包含空格,例如希望Gorman是值。对于第一个名称字段/列)。这是一个用于消费者信用分析的玩具(完全伪造的)数据文件。

NAME    ADDRESS ZIP SSN SEX MARITALSTATUS   CHILDREN    OCCUPATION  HOMEOWNERSHIP   INCOME  EXPENSES    CHECKING    SAVINGS MSTRCARD    VISA    AMEX    MERCHANT    PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd.   99975   470-17-7670 F   M   2   Professional    O   3212    1124    N   Y   1   5   0   9   2
Sarah Coriano   640 Prospect Lane   99904   355-91-5677 F   M   1   Unknown O   3145    1100    N   Y   1   5   0   9   0
Ernest Farmer   474 Green Street    99900   129-21-0468 M   M   0   Unknown O   3165    1266    N   Y   1   5   0   9   2

另外,我已经检查了原始开发机上的所有内容(相同的文件,相同的R工作区,但在Win7上运行的R 3.5.2),R包装器按预期调用正确的C代码。

这使我认为r 4.2在Win 10上运行的情况有所不同 - 我注意到R现在显然使用UTF-8字符,但是由于该文件仅由US-ASCII字符组成,没有BOM,所以我很难认为Win10上的角色处理是问题所在,但事实仍然是原始代码/ R工作区不起作用。

谢谢, 杰克

Has the behavior of is.character() changed in R 4.x ?
Here I read a simple tab-delimited text file into a data frame, and then confirm all columns are correctly marked as character data:

> raw <- read.table( creditDataPath, header = TRUE, colClasses="character", sep = "\t")
> str(raw)
'data.frame':   407 obs. of  18 variables:
 $ NAME          : chr  "Hope Gorman" "Sarah Coriano" "Ernest Farmer" "John Coleman" ...
 $ ADDRESS       : chr  "179 Del Mar Blvd." "640 Prospect Lane" "474 Green Street" "452 Green Street" ...
 $ ZIP           : chr  "99975" "99904" "99900" "99924" ...
 $ SSN           : chr  "470-17-7670" "355-91-5677" "129-21-0468" "121-57-2753" ...
 $ SEX           : chr  "F" "F" "M" "M" ...
 $ MARITALSTATUS : chr  "M" "M" "M" "M" ...
 $ CHILDREN      : chr  "2" "1" "0" "0" ...
 $ OCCUPATION    : chr  "Professional" "Unknown" "Unknown" "Unknown" ...
 $ HOMEOWNERSHIP : chr  "O" "O" "O" "O" ...
 $ INCOME        : chr  "3212" "3145" "3165" "3248" ...
 $ EXPENSES      : chr  "1124" "1100" "1266" "974" ...
 $ CHECKING      : chr  "N" "N" "N" "N" ...
 $ SAVINGS       : chr  "Y" "Y" "Y" "Y" ...
 $ MSTRCARD      : chr  "1" "1" "1" "1" ...
 $ VISA          : chr  "5" "5" "5" "5" ...
 $ AMEX          : chr  "0" "0" "0" "0" ...
 $ MERCHANT      : chr  "9" "9" "9" "9" ...
 $ PAYMENTHISTORY: chr  "2" "0" "2" "3" ...

However, is.character(raw) for the data frame and is.character(raw[3,1:17]) for a portion of a row in the data frame both return FALSE:

> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE
> 

With R version 3.5.2 (the original development environment was 64-bit R 3.5.2 on 64-bit Win 7), simply reading the file into a data frame (WITHOUT needing to add colClasses = "character" simply worked. The use case is that basically an R wrapper uses is.character() to determine whether a row in the data frame contains all string values (in effect: is.character(raw[n,1:17])); that then determines which version of a C function in a legacy DLL to call - one that expects either ALL strings, or one that expects ALL doubles).

I have been away from R since 2019, so today on a computer running Win10 Pro I installed 64-bit R 4.2.1, loaded the original workspace, and expected everything to work. And, if I manually craft a record (vector) that explicitly has every value in double quotes (e.g., "Hope Gorman", ""99975", etc.) everything does work - the R wrapper calls the correct C function.

The problem is, loading the data frame from the simple flat ASCII text file and then accessing it row by row does not work, even though after loading R seems to think the data consists of values that are quoted strings. The error is the dreaded NAs introduced by coercion error - in the wrapper R appears to NOT recognize the character strings.

What am I missing? Is this a bug in 4.x ?

EDIT:
Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman is the value for the first Name field/column). This is a toy (ENTIRELY FAKED) data file for consumer credit analysis.

NAME    ADDRESS ZIP SSN SEX MARITALSTATUS   CHILDREN    OCCUPATION  HOMEOWNERSHIP   INCOME  EXPENSES    CHECKING    SAVINGS MSTRCARD    VISA    AMEX    MERCHANT    PAYMENTHISTORY
Hope Gorman 179 Del Mar Blvd.   99975   470-17-7670 F   M   2   Professional    O   3212    1124    N   Y   1   5   0   9   2
Sarah Coriano   640 Prospect Lane   99904   355-91-5677 F   M   1   Unknown O   3145    1100    N   Y   1   5   0   9   0
Ernest Farmer   474 Green Street    99900   129-21-0468 M   M   0   Unknown O   3165    1266    N   Y   1   5   0   9   2

Also FWIW, I have checked everything on the original development machine (same file, same R workspace but R 3.5.2 running on Win7), and the R wrapper calls the correct C code as expected.

This leads me to think there is something different in R 4.2 running on Win 10 - I have noted that R now apparently uses UTF-8 characters, but since the file consists solely of US-ASCII characters and no BOM, I am hard-pressed to think character handling on Win10 is the problem, but the fact remains the original code/ R Workspace doesn't work.

Thanks,
Jack

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦幻之岛 2025-02-20 01:35:02

首先,感谢所有回应的R Gurus。

其次,有些尴尬,在重新学习如何使用R中的调试工具之后,我发现代码在 r 3.5.2 上的代码“运行”的原因是传统C中有一个错误dll。

当我从静态查看问题的R代码时,似乎R函数可以调用正确的DLL函数的唯一方法是 is.character(data) 返回的true 。

但是,当我在调试器(原始Win7/R 3.5.2环境中)中介入代码时,我发现 is.character(data) 实际上正在返回错误 - 正如这里的每个人所期望的(和Casper V.进一步证明), ,但 win7 dll中的C函数仍在处理 数据 作为字符字符串的数组(考虑到R函数中的逻辑路径,它不应完成)。

然后,我发现Win 10上使用的遗留DLL与Win 7环境中使用的遗产相同,实际上是一个较晚的版本,其中该错误是固定的(当然会导致我看到的R错误在胜利10中)。

最后,按照R2evans建议检查R中的数据类型最终解决了该问题。

First, thanks to all the R Gurus who responded.

Second, and somewhat embarrassing, after re-learning how to use debugging tools in R, I discovered that the reason the code "ran" on R 3.5.2 was that there was a bug in the legacy C DLL.

When I looked at the problematic R code statically, it appeared that the only way the R function could possibly call the correct DLL function was if is.character(data) returned true.

However, when I stepped through the code in the debugger (in the original Win7/R 3.5.2 environment), I found that is.character(data) was actually returning false - as everyone here expected (and Casper V. further demonstrated), BUT the C function in the Win7 DLL was still treating data as an array of character strings (which it should not have done, given the logic path in the R function).

I then discovered that the legacy DLL used on Win 10, which I thought was the same as that used in the Win 7 environment, was actually a later version, in which the bug was fixed (which of course caused the R error I was seeing in Win 10).

In the end, checking the data type in R as suggested by r2evans ultimately solved the problem.

日久见人心 2025-02-20 01:35:02

您描述的IS.Character()的行为似乎没有改变。您所描述的3.5.2版本的行为似乎是不正确的,正如我尝试在下面复制的那样,可以看到。

如@Guasi所述,数据框架的类 raw 只是data.frame列的类,例如 raw $ name ,可以是contrac

您有可能使用/拥有is.Character()来处理3.5.2设置的数据帧的自定义处理程序,而您没有将其复制到新版本:

https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character/character

as.characteris.character是通用的:您可以编写方法来处理特定的对象类,请参见internalmethods

您提到的是相同的代码在旧的3.5.2安装上,仍然调用正确的C代码。您是否检查过的输出是。

繁殖尝试

我在Win 7上安装了R 3.5.2 64位,并且无法再现过去记得的内容:

file: test.csv ,是根据示例数据创建的,但选项卡划分:

这是文件的前4行(第一行包含列标记; 18个总选项卡划界字段 - 某些字符串字段包含空格,例如,Hope Gorman是名字字段/列的值)。

环境:

最初的开发环境为64位r 3.5.2在64位赢7

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

命令上:

简单地将文件读取到表中(无需添加colclasses =“ conture”只是工作。

> raw <- read.table('C:\\Users\\caspar\\Desktop\\test.csv', header=T, sep='\t')
> str(raw)
'data.frame':   3 obs. of  18 variables:
 $ NAME          : Factor w/ 3 levels "Ernest Farmer",..: 2 3 1
 $ ADDRESS       : Factor w/ 3 levels "179 Del Mar Blvd.",..: 1 3 2
 $ ZIP           : int  99975 99904 99900
 $ SSN           : Factor w/ 3 levels "129-21-0468",..: 3 2 1
 $ SEX           : Factor w/ 2 levels "F","M": 1 1 2
 $ MARITALSTATUS : Factor w/ 1 level "M": 1 1 1
 $ CHILDREN      : int  2 1 0
 $ OCCUPATION    : Factor w/ 2 levels "Professional",..: 1 2 2
 $ HOMEOWNERSHIP : Factor w/ 1 level "O": 1 1 1
 $ INCOME        : int  3212 3145 3165
 $ EXPENSES      : int  1124 1100 1266
 $ CHECKING      : Factor w/ 1 level "N": 1 1 1
 $ SAVINGS       : Factor w/ 1 level "Y": 1 1 1
 $ MSTRCARD      : int  1 1 1
 $ VISA          : int  5 5 5
 $ AMEX          : int  0 0 0
 $ MERCHANT      : int  9 9 9
 $ PAYMENTHISTORY: int  2 0 2
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE

工作 Code> is.Character()仍然是 colclasses =“ tarne”read.csv(),但这也产生了相同的结果。

The behaviour of is.character() that you describe doesn't seem to have changed. What you describe as the behavior for version 3.5.2 doesn't seem to be correct, as can be seen in my attempt to replicate it below.

As mentioned by @guasi, the class of your data frame raw is just data.frame. The class of a column, eg raw$NAME, can be character.

It is possible that you had/have a custom handler for is.character() to handle data frames on the 3.5.2 setup, and you didn't copy it to the new version:

https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/character

as.character and is.character are generic: you can write methods to handle specific classes of objects, see InternalMethods

You mention that the same code on the old 3.5.2 install, still calls the correct C code. Have you checked yourself what the output of is.character(raw) is on that machine?

Reproduction attempt

I have installed R 3.5.2 64 bit on Win 7, and couldn't reproduce what you remember from the past:

File: test.csv, created from example data, but tab delimited:

Here are the first 4 lines of the file (first line contains column labels; 18 total tab delimited fields - some of the string fields contain spaces, e.g. Hope Gorman is the value for the first Name field/column).

Environment:

the original development environment was 64-bit R 3.5.2 on 64-bit Win 7

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

Command:

simply reading the file into a table (WITHOUT needing to add colClasses = "character" simply worked.

> raw <- read.table('C:\\Users\\caspar\\Desktop\\test.csv', header=T, sep='\t')
> str(raw)
'data.frame':   3 obs. of  18 variables:
 $ NAME          : Factor w/ 3 levels "Ernest Farmer",..: 2 3 1
 $ ADDRESS       : Factor w/ 3 levels "179 Del Mar Blvd.",..: 1 3 2
 $ ZIP           : int  99975 99904 99900
 $ SSN           : Factor w/ 3 levels "129-21-0468",..: 3 2 1
 $ SEX           : Factor w/ 2 levels "F","M": 1 1 2
 $ MARITALSTATUS : Factor w/ 1 level "M": 1 1 1
 $ CHILDREN      : int  2 1 0
 $ OCCUPATION    : Factor w/ 2 levels "Professional",..: 1 2 2
 $ HOMEOWNERSHIP : Factor w/ 1 level "O": 1 1 1
 $ INCOME        : int  3212 3145 3165
 $ EXPENSES      : int  1124 1100 1266
 $ CHECKING      : Factor w/ 1 level "N": 1 1 1
 $ SAVINGS       : Factor w/ 1 level "Y": 1 1 1
 $ MSTRCARD      : int  1 1 1
 $ VISA          : int  5 5 5
 $ AMEX          : int  0 0 0
 $ MERCHANT      : int  9 9 9
 $ PAYMENTHISTORY: int  2 0 2
> is.character(raw)
[1] FALSE
> is.character(raw[3,1:17])
[1] FALSE

spacing between commands mine

Here I assumed you mean the same command you used before, but without the colClasses parameter. is.character() is still FALSE with colClasses = "character". It's more common to use read.csv(), but that yielded the same results as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文