处理未知位置的缺失值

发布于 2024-12-14 13:34:06 字数 7435 浏览 0 评论 0原文

我有一些处理这个问题的想法,但我希望大师们能想出更好的办法。我向 Mechanical Turk 提​​交了一堆行输入。我需要表中的一行,并且有一个字段,我要求他们在其中输入以逗号分隔的行值。然后在 RI 中对其进行 strsplit,现在我正在比较多个 Turkers 条目的结果。

一种常见的模式是,一个 Turker 会错过一个条目,从而使其余条目落后一个。因此,挑战在于知道将缺失值放在哪里。假设他们只错过输入一个条目(我有错误检查代码来确认这一点),并且我可能从每个表行中获得了最多 3 个重复项(因此可能有 1-2 个适当的长度,并且1-2 条目的大小大约如下,因此计算效率并不重要,假设最长的条目是适当的长度

。作为列表,每个元素都是不同 Turker 的复制):

tt <- list(structure(c(4, 4, 5, 7, 9, 13, 15, 18, 20, 22, 24, 
27, 30, 32, 35, 37, 41, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(25L, 
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 18, 20, 22, 25, 
27, 30, 32, 35, 37, 40, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(26L, 
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 19, 20, 22, 25, 
27, 30, 32, 35, 37, 42, 43, 46, 48, 51, 54, 57, 61, 63), .Dim = c(26L, 
1L)))

lengths <- sapply(tt,length)
longs <- simplify2array(tt[lengths==max(lengths)],FALSE)
shorts <- simplify2array(tt[lengths==max(lengths)-1],FALSE)

我考虑过的算法是:

  • 在每个可能的位置使用 NA 创建 max(lengths) 排列,并将它们同时与 1- 2 个适当长度的元素,使用对总偏差的一些估计
  • 循环遍历每个元素并与 1-2 个适当长度的元素进行比较,直到找到不完全匹配的元素,然后确定与所有后续元素相比差异有多大。与 NA 的差异或例如,如果它们匹配到第 5 个条目,但将 NA 放入第 5 个条目中,其余部分仍会超出第 5 个条目中的差异,请继续向下移动向量。

很好奇大家会如何实现这一点。我很难避免循环并以优雅的方式编写它。可能像 filter 这样的东西可能会有所帮助。

有问题的输入和所需输出的示例

有问题的输入(缺少一个值;其他值中没有拼写错误)

> tt1 <- list(c(4, 4, 7, 9, 11), c(4, 4, 5, 7, 9, 11), c(4, 4, 5, 7, 9, 
11))
> tt1
[[1]]
[1]  4  4  7  9 11

[[2]]
[1]  4  4  5  7  9 11

[[3]]
[1]  4  4  5  7  9 11

所需的输出

> tt1
  [,1] [,2] [,3]
1    4    4    4
2    4    4    4
3   NA    5    5
4    7    7    7
5    9    9    9
6   11   11   11

有问题的输入(缺少 一个值)值 + 另一个值中的拼写错误)

> tt2 <- list(c(4, 4, 7, 9, 11), c(4, 3, 5, 7, 9, 11), c(4, 4, 5, 7, 9, 
11))
> tt2
[[1]]
[1]  4  4  7  9 11

[[2]]
[1]  4  3  5  7  9 11

[[3]]
[1]  4  4  5  7  9 11

期望的输出

> tt2[[1]][4:6] <- tt2[[1]][3:5]
> tt2[[1]][3] <- NA
> simplify2array(tt2,FALSE)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

应适当地容忍拼写错误的其他变化。请注意,向量通常会增加(您可以将它们视为随噪声单调增加)。因此,如果有人将 7 误认为 4,则可能是拼写错误。另请注意,对于大多数情况,我只进行了 2 次重复,因此没有任何方法可以为一个非缺失值提供比任何其他非缺失值更高的可信度。必须审视整个模式,或者至少利用它们普遍增加的事实。

完整数据框

上面的每个 tt 示例都是下面 data.frame 中给定脚图像级别的所有 TotalTime 条目。这是整个数据集。请注意,image 组之间的条目总数可能会发生变化。该值是预先知道的,或者您可以从最大条目中获取它。

dat <- structure(list(feet = c(1, 2, 3, 3, 1, 1, 7, 7, 8, 9, 9, 1, 1, 
2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 
6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 6, 6, 7, 7, 8, 8, 9, 10, 10
), TotalTime = c("4,3,4,6,6,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67", 
"4,3,4,6,8,20,22,24,26,28,30,31,34,36,38,40,42,44,46,48,50,52,54,56,58,60", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,38,41,44,46,49,51,55,58", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43.47,52,56,60,63,67,72,76,82,84", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,93,94", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,38,41,44,46,49,51,55,58", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,31,32,35,37,41,43,46,48,51,54,57,60,63", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67", 
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72", 
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72", 
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79", 
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79", 
"4,6,9,11,14,17,21,24,27,30,33,35,38,42,45,49,52,55,58,63,67,70,73,78,82,85", 
"4,6,9,11,14,17,21,24,27,30,33,35,36,42,45,49,52,55,58,63,67,70,73,78,82,85", 
"2,4,6,9,11,13,16,16,20,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52", 
"2,4,6,9,11,13,16,18,20,21,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52", 
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54", 
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54", 
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57", 
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57", 
"6,7,10,13,15,18,20,23,24,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63", 
"6,7,10,13,15,18,20,23,24,26,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63", 
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65", 
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65", 
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69", 
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69", 
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77", 
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"0,0,0,1,1,1,3,3,3,5,5,5,6,6,7,7,8,8,9,10,11,10,11,11", "0,0,0,1,1,1,3,3,3,5,5,6,6,7,7,8,8,9,10,11,10,11,11", 
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67", 
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67", 
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70", 
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70", 
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76", 
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76", 
"2,5,9,13,17,20,24,27,30,33,37,42,45,48,52,55,58,62,65,67,72,75,78,80", 
"3,6,10,15,18,23,25,26,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86", 
"3,6,10,15,18,23,25,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86"
), image = c(1, 1, 1, 1, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4)), .Names = c("feet", 
"TotalTime", "image"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 14L, 15L, 16L, 17L, 19L, 20L, 22L, 23L, 
24L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 
38L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 49L, 50L, 51L, 53L, 
54L, 55L, 56L, 57L, 58L, 59L, 61L, 62L, 63L), class = "data.frame")

I have a few ideas for dealing with this, but I expect the guRus can come up with something better yet. I submitted a bunch of row typing-in to Mechanical Turk. I needed a single row from a table, and I had a field into which I asked them to type the row's values separated by commas. In R I have then strsplit this, and I am now comparing the results of multiple Turkers' entries.

A common pattern is that one Turker will have missed a single entry, throwing the rest of the entries off by one. So the challenge is to know where to put the missing value. Assume that they only ever miss entering a single entry (I have error-checking code to confirm this), and that I may have obtained up to 3 replicates from each table row (so there could be 1-2 of the proper length, and 1-2 that are too short. Entries are approximately of the size below, and I only have about 50, so computational efficiency is not paramount. Assume that the longest entry is the proper length.

Here is an example of one such row (stored as a list, with each element being a replication by a different Turker):

tt <- list(structure(c(4, 4, 5, 7, 9, 13, 15, 18, 20, 22, 24, 
27, 30, 32, 35, 37, 41, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(25L, 
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 18, 20, 22, 25, 
27, 30, 32, 35, 37, 40, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(26L, 
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 19, 20, 22, 25, 
27, 30, 32, 35, 37, 42, 43, 46, 48, 51, 54, 57, 61, 63), .Dim = c(26L, 
1L)))

lengths <- sapply(tt,length)
longs <- simplify2array(tt[lengths==max(lengths)],FALSE)
shorts <- simplify2array(tt[lengths==max(lengths)-1],FALSE)

Algorithms I have considered are:

  • Creating max(lengths) permutations with the NA in every single possible place, and comparing them simultaneously to the 1-2 ones of appropriate length using some estimate of the total deviation.
  • Looping through each element and comparing to the 1-2 ones of appropriate length until I find a non-exact match. Then decide how big the difference is as compared to all the subsequent differences with the NA or not. E.g. if they match up to the 5th entry, but putting NA in the 5th entry still leaves the rest off by more than the difference in the 5th entry, keep moving down the vectors.

Curious how everyone would implement this. I'm having a hard time avoiding loops and writing this in an elegant way. Possibly something like filter might help.

Examples of problematic input and desired output

Problematic input (missing one value; no typos in other values)

> tt1 <- list(c(4, 4, 7, 9, 11), c(4, 4, 5, 7, 9, 11), c(4, 4, 5, 7, 9, 
11))
> tt1
[[1]]
[1]  4  4  7  9 11

[[2]]
[1]  4  4  5  7  9 11

[[3]]
[1]  4  4  5  7  9 11

Desired output

> tt1
  [,1] [,2] [,3]
1    4    4    4
2    4    4    4
3   NA    5    5
4    7    7    7
5    9    9    9
6   11   11   11

Problematic input (missing value + a typo in another value)

> tt2 <- list(c(4, 4, 7, 9, 11), c(4, 3, 5, 7, 9, 11), c(4, 4, 5, 7, 9, 
11))
> tt2
[[1]]
[1]  4  4  7  9 11

[[2]]
[1]  4  3  5  7  9 11

[[3]]
[1]  4  4  5  7  9 11

Desired output

> tt2[[1]][4:6] <- tt2[[1]][3:5]
> tt2[[1]][3] <- NA
> simplify2array(tt2,FALSE)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

Other variations of typos should be tolerated gracefully. Note that the vectors are generally increasing (you could view them as monotonically increasing with noise). So if someone mistakes a 7 for a 4, that's probably a typo. Note also that for most I only did 2 replicates, so there won't be any way to give one non-missing value any more credence than any other non-missing value. Going to have to look at the whole pattern or at least take advantage of the fact that they're generally increasing.

Full data frame

Each of the tt examples above is all of the TotalTime entries for a given feet-image level in the data.frame below. This is the whole dataset. Note that the total number of entries potentially changes between image groups. This value is known in advance, or you could just get it from the max of the entries.

dat <- structure(list(feet = c(1, 2, 3, 3, 1, 1, 7, 7, 8, 9, 9, 1, 1, 
2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 
6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 6, 6, 7, 7, 8, 8, 9, 10, 10
), TotalTime = c("4,3,4,6,6,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67", 
"4,3,4,6,8,20,22,24,26,28,30,31,34,36,38,40,42,44,46,48,50,52,54,56,58,60", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,38,41,44,46,49,51,55,58", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43.47,52,56,60,63,67,72,76,82,84", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,93,94", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58", 
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,38,41,44,46,49,51,55,58", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,31,32,35,37,41,43,46,48,51,54,57,60,63", 
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67", 
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67", 
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72", 
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72", 
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79", 
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79", 
"4,6,9,11,14,17,21,24,27,30,33,35,38,42,45,49,52,55,58,63,67,70,73,78,82,85", 
"4,6,9,11,14,17,21,24,27,30,33,35,36,42,45,49,52,55,58,63,67,70,73,78,82,85", 
"2,4,6,9,11,13,16,16,20,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52", 
"2,4,6,9,11,13,16,18,20,21,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52", 
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54", 
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54", 
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57", 
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57", 
"6,7,10,13,15,18,20,23,24,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63", 
"6,7,10,13,15,18,20,23,24,26,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63", 
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65", 
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65", 
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69", 
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69", 
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77", 
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94", 
"0,0,0,1,1,1,3,3,3,5,5,5,6,6,7,7,8,8,9,10,11,10,11,11", "0,0,0,1,1,1,3,3,3,5,5,6,6,7,7,8,8,9,10,11,10,11,11", 
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67", 
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67", 
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70", 
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70", 
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76", 
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76", 
"2,5,9,13,17,20,24,27,30,33,37,42,45,48,52,55,58,62,65,67,72,75,78,80", 
"3,6,10,15,18,23,25,26,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86", 
"3,6,10,15,18,23,25,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86"
), image = c(1, 1, 1, 1, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4)), .Names = c("feet", 
"TotalTime", "image"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 14L, 15L, 16L, 17L, 19L, 20L, 22L, 23L, 
24L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 
38L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 49L, 50L, 51L, 53L, 
54L, 55L, 56L, 57L, 58L, 59L, 61L, 62L, 63L), class = "data.frame")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

神爱温柔 2024-12-21 13:34:06

这是一个旨在提高可读性的解决方案。毫无疑问,它可以折叠成更少的代码行:

desiredLength <- function(x){
  len <- sapply(x, length)
  max(len)
}

insertNA <- function(x, position=1){
  c(x[seq_along(x) < position], NA, x[seq_along(x) >= position]) 
}

fixLength <- function(x, position=1){
  dlen <- desiredLength(x)
  sapply(x, function(zz) if(length(zz) < dlen) insertNA(zz, position) else zz)
}

objectiveFunction <- function(x){
  sum(apply(x, 1, function(z)length(unique(z))))
}

findMinObjective <- function(x){
  pos <- NA
  obj <- Inf
  for(i in 1:desiredLength(x)){
    z <- objectiveFunction(fixLength(x, position=i))
    if(z < obj){
      obj <- z
      pos <- i
    }
  }
  fixLength(x, pos)
}

结果:

> findMinObjective(tt1)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    4    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

> findMinObjective(tt2)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

Here is a solution that aims to be readable. It can be collapsed into a smaller number of of lines of code, no doubt:

desiredLength <- function(x){
  len <- sapply(x, length)
  max(len)
}

insertNA <- function(x, position=1){
  c(x[seq_along(x) < position], NA, x[seq_along(x) >= position]) 
}

fixLength <- function(x, position=1){
  dlen <- desiredLength(x)
  sapply(x, function(zz) if(length(zz) < dlen) insertNA(zz, position) else zz)
}

objectiveFunction <- function(x){
  sum(apply(x, 1, function(z)length(unique(z))))
}

findMinObjective <- function(x){
  pos <- NA
  obj <- Inf
  for(i in 1:desiredLength(x)){
    z <- objectiveFunction(fixLength(x, position=i))
    if(z < obj){
      obj <- z
      pos <- i
    }
  }
  fixLength(x, pos)
}

The results:

> findMinObjective(tt1)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    4    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

> findMinObjective(tt2)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11
川水往事 2024-12-21 13:34:06

我希望这会有所帮助:

f <- function(tt) {
  len <- (sapply(tt, length))
  tar <- rowMeans(do.call("cbind", tt[len == max(len)]))
  tt[len < max(len)] <- 
    lapply(tt[len < max(len)],
      function(x) {
        r <- lapply(combn(max(len), max(len)-length(x)),
          function(i) {z <- numeric(max(len)); z[i] <- NA; z[!is.na(z)] <- x; z})
        r[[which.min(sapply(r, function(x) sum((x - tar)^2, na.rm = T)))]]
    })
  simplify2array(tt,FALSE)
}

那么,

> f(tt)
      [,1] [,2] [,3]
 [1,]    4    4    4
 [2,]    3    4    4
 [3,]    4    5    5
... snip ...
[24,]   55   57   57
[25,]   58   60   61
[26,]   NA   63   63

> f(tt1)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    4    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

> f(tt2)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

这里是您的完整数据的示例:

dlply(dat, .(feet, image), function(x) f(lapply(strsplit(x$TotalTime, ","), as.numeric)))

看起来运行良好。

I hope this will help:

f <- function(tt) {
  len <- (sapply(tt, length))
  tar <- rowMeans(do.call("cbind", tt[len == max(len)]))
  tt[len < max(len)] <- 
    lapply(tt[len < max(len)],
      function(x) {
        r <- lapply(combn(max(len), max(len)-length(x)),
          function(i) {z <- numeric(max(len)); z[i] <- NA; z[!is.na(z)] <- x; z})
        r[[which.min(sapply(r, function(x) sum((x - tar)^2, na.rm = T)))]]
    })
  simplify2array(tt,FALSE)
}

then,

> f(tt)
      [,1] [,2] [,3]
 [1,]    4    4    4
 [2,]    3    4    4
 [3,]    4    5    5
... snip ...
[24,]   55   57   57
[25,]   58   60   61
[26,]   NA   63   63

> f(tt1)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    4    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

> f(tt2)
     [,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    3    4
[3,]   NA    5    5
[4,]    7    7    7
[5,]    9    9    9
[6,]   11   11   11

and here is an example for your full data:

dlply(dat, .(feet, image), function(x) f(lapply(strsplit(x$TotalTime, ","), as.numeric)))

looks like working well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文