尝试使用用户定义的函数来填充数据框中的新列。出了什么问题?

发布于 2024-12-10 13:54:26 字数 1485 浏览 1 评论 0原文

超短版本:我正在尝试使用用户定义的函数使用以下命令填充数据框中的新列:

TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)

但是,当我运行该命令时,它似乎只是将 EmployeeLocationNumber 应用于第一行的 Location 值,而不是使用每行的值来单独确定该行的新列的值。

请注意:我试图理解 R,而不仅仅是执行这个特定的任务。我实际上能够使用 Apply() 函数获得我正在寻找的输出,但这无关紧要。我的理解是上面的行应该逐行工作,但事实并非如此。

以下是测试的具体细节:

TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3), 
                   Month=c(1,5,6,11,4,10,1,5,10), 
                   Location=c(1,5,6,7,10,3,4,2,8))

该 testDF 跟踪 3 名员工中的每一位在一年中在多个地点的位置。

(您可以将“位置”视为每个员工唯一的...它本质上是该行的唯一 ID。)

EmployeeLocationNumber 函数采用一个位置并输出一个数字,指示员工访问该位置的顺序。例如 EmployeeLocationNumber(8) = 2 因为这是访问该位置的员工访问的第二个位置。

EmployeeLocationNumber <- function(Site){
  CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
  LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
  LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
  return(LocationNumber)
}

我意识到我可能可以将所有这些打包到一个子集命令中,但我不知道当您在其他子集命令中使用子集命令时引用如何工作。

因此,请记住,我真的想了解如何在 R 中工作,我有几个问题:

  1. 为什么不会 TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)像其他赋值语句那样逐行工作?

  2. 是否有一种更简单的方法可以根据另一个数据帧的值来引用数据帧中的特定值?也许不返回数据帧/列表,然后必须将其展平并从中提取?

  3. 我确信我正在使用的函数与 R 类似...我应该做什么才能本质上模拟 INNER Join 类型查询?

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:

TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)

However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.

Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.

Here are the specifics for testing:

TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3), 
                   Month=c(1,5,6,11,4,10,1,5,10), 
                   Location=c(1,5,6,7,10,3,4,2,8))

This testDF keeps track of where each of 3 employees was over the course of the year among several locations.

(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)

The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.

EmployeeLocationNumber <- function(Site){
  CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
  LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
  LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
  return(LocationNumber)
}

I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.

So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:

  1. Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?

  2. Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?

  3. I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

月光色 2024-12-17 13:54:26

使用逻辑索引,您的函数的简洁单行替换是:

EmployeeLocationNumber <- function(Site){
    with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}

当然,这不是最易读的方式,但它演示了 R 中逻辑索引和 which() 的原理。然后,就像其他人所说的那样,只需用矢量化的 *ply 函数将其包装起来即可将其应用于您的数据集。

Using logical indexing, the condensed one-liner replacement for your function is:

EmployeeLocationNumber <- function(Site){
    with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}

Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.

浅紫色的梦幻 2024-12-17 13:54:26

A) TestDF$Location 是一个向量。您的函数未设置为返回向量,因此为其提供向量可能会失败。

B) Location:8 在什么意义上是“访问的第二个位置”?

C)如果您想要在组内排序,那么您需要将按员工拆分的数据帧传递给计算结果的函数。

D) data.frame 的条件访问通常涉及逻辑索引和/或使用 which()

如果您只想要员工的访问顺序,请尝试以下操作:
(将第一个参数更改为“月份”,因为这决定了位置的顺序)

 with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
 TestDF$LocOrder <-  with(TestDF, ave(Month, Employee, FUN=seq))

如果您想要 EE:3 的第二个位置,它将是:

subset(TestDF, LocOrder==2 & Employee==3, select= Location)
#   Location
# 8        2

A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.

B) In what sense is Location:8 the "second location visited"?

C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.

D) Conditional access of a data.frame typically involves logical indexing and or the use of which()

If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)

 with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
 TestDF$LocOrder <-  with(TestDF, ave(Month, Employee, FUN=seq))

If you wanted the second location for EE:3 it would be:

subset(TestDF, LocOrder==2 & Employee==3, select= Location)
#   Location
# 8        2
泪痕残 2024-12-17 13:54:26

R 的向量化本质(也称为逐行)不是通过使用参数的每个下一个值重复调用函数,而是通过一次传递整个向量并一次性对所有向量进行操作。但在 EmployeeLocationNumber 中,您只返回一个值,因此该值会在整个数据集中重复。

此外,您的 EmployeeLocationNumber 示例与您的描述不符。

> EmployeeLocationNumber(8)
[1] 3

现在,按照您的想法(对每个值重复调用)对函数进行向量化的一种方法是将其传递给 Vectorize()

TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)

这给出了

> TestDF
  Employee Month Location ELN
1        1     1        1   1
2        1     5        5   2
3        1     6        6   3
4        1    11        7   4
5        2     4       10   1
6        2    10        3   2
7        3     1        4   1
8        3     5        2   2
9        3    10        8   3

至于您的其他问题,我将其写为

TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)

逻辑是按月份,分别查看员工的月份组,并给出月份的排名顺序(它们按顺序排列)。

The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.

Also, your example for EmployeeLocationNumber does not match your description.

> EmployeeLocationNumber(8)
[1] 3

Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()

TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)

which gives

> TestDF
  Employee Month Location ELN
1        1     1        1   1
2        1     5        5   2
3        1     6        6   3
4        1    11        7   4
5        2     4       10   1
6        2    10        3   2
7        3     1        4   1
8        3     5        2   2
9        3    10        8   3

As to your other questions, I would just write it as

TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)

The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).

奢望 2024-12-17 13:54:26

您的 EmployeeLocationNumber 函数接受一个向量并返回一个值。
因此,创建新 data.frame 列的分配仅获取单个值:

EmployeeLocationNumber(TestDF$Location) # returns 1

TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
  1. 分配不会产生任何类似的魔法。它需要一个值并将其放在某个地方。在本例中,值为 1。如果该值是与行数长度相同的向量,它将按照您的要求工作。
  2. 我会回复你的:)
  3. 迪托。

更新:我终于编写出了一些代码来做到这一点,但到那时@DWin有一个更好的解决方案:(

TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))

...我猜ave函数几乎做了什么上面的代码确实如此。但为了记录:

首先我将 data.frame 分成子框架,每个员工一个,然后我对月份进行排名(以防万一)。月份不按顺序排列)。您也可以使用 order,但 rank 可以更好地处理关系。最后,我将所有结果合并到一个向量中,并将其放入新列 ELN

关于问题 。 2、“在数据框中引用值的最佳方式是什么?”:

这在一定程度上取决于具体问题,但如果您有一个值,请说 Employee=3 并且想要查找所有值data.frame 中与之匹配的行,然后简单地:

TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:

EmployeeLocationNumber(TestDF$Location) # returns 1

TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
  1. Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
  2. I'll get back to you on that :)
  3. Dito.

Update: I finally worked out some code to do it, but by then @DWin has a much better solution :(

TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))

...I guess the ave function does pretty much what the code above does. But for the record:

First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.

Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":

This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:

TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文