which.min 在数据框上的 sapply 内无法正常工作?

发布于 2024-12-15 12:10:01 字数 2877 浏览 3 评论 0原文

任何人都可以解释在尝试使用 sapply 和 which.min 查找数据帧内满足条件的第一行时发现的这种奇怪行为吗?

数据帧是trApr;它按 customer_id(递增)排序,然后按交易访问日期(递增)排序。对于每个 customer_id,我们希望找到 trApr 中第一笔交易的行索引。 (每个 customer_id 的总体交易数量是可变的,这应该不重要。)

trApr is 'data.frame':  2195716 obs. of  3 variables:
 $ customer_id: int  2 2 2 2 2 2 2 2 2 2 ...
 $ visit_date : Date, format: "2011-04-02" "2011-04-06" "2011-04-07" "2011-04-08" ...
 $ visit_spend: num  37.12 32.51 4.55 31.35 42.49 ...

代码上的其他注释:

  • all_tr_cids 只是排序的唯一 customer_ids 列表:unique(trApr$customer_id ) )
  • n:m 只是我在调试时用于获取一小部分数据帧的索引。但我想在整个 df 上执行 sapply

这是有问题的代码:

**GOOD:** I <- sapply(all_tr_cids[n:m], function(cid){ head(which(trApr$customer_id==cid),n=1) }, USE.NAMES=FALSE)
 [1] 1909 1928 1964 1970 1988 2037 2092 2113 2140 2182

**BAD:** I <- sapply(all_tr_cids[n:m], function(cid){ which.min(trApr$customer_id==cid) }, USE.NAMES=FALSE)
 [1] 1 1 1 1 1 1 1 1 1 1

sapply 返回的中间不规则对象如下(它是 int 列表的 10 个列表)。

如果 which.min 无法处理这种结构,它确实应该发出警告,而不是愉快地返回 1 的列表。

sapply(all_tr_cids[n:m], function(cid){ which(trApr$customer_id==cid) }, USE.NAMES=FALSE)
[[1]]
 [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927

[[2]]
 [1] 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
[31] 1958 1959 1960 1961 1962 1963

[[3]]
[1] 1964 1965 1966 1967 1968 1969

[[4]]
 [1] 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987

[[5]]
 [1] 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
[31] 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036

[[6]]
 [1] 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066
[31] 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091

[[7]]
 [1] 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112

[[8]]
 [1] 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139

[[9]]
 [1] 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169
[31] 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181

[[10]]
 [1] 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210

Can anyone explain this strange behavior found when trying to use sapply and which.min to find the first lines inside a dataframe satisfying a condition?

The dataframe is trApr; it's sorted by customer_id (increasing) and then transaction visit_date (increasing). For each customer_id, we want to find the row-index of the first transaction in trApr. (There is a variable number of transactions overall per customer_id, that should not matter.)

trApr is 'data.frame':  2195716 obs. of  3 variables:
 $ customer_id: int  2 2 2 2 2 2 2 2 2 2 ...
 $ visit_date : Date, format: "2011-04-02" "2011-04-06" "2011-04-07" "2011-04-08" ...
 $ visit_spend: num  37.12 32.51 4.55 31.35 42.49 ...

Other notes on the code:

  • all_tr_cids is simply the list of sorted, unique customer_ids: unique(trApr$customer_id) )
  • n:m are just indices I used for taking a tiny slice of the dataframe, while debugging. But I want to do sapply on the entire d.f.

Here's the code in question:

**GOOD:** I <- sapply(all_tr_cids[n:m], function(cid){ head(which(trApr$customer_id==cid),n=1) }, USE.NAMES=FALSE)
 [1] 1909 1928 1964 1970 1988 2037 2092 2113 2140 2182

**BAD:** I <- sapply(all_tr_cids[n:m], function(cid){ which.min(trApr$customer_id==cid) }, USE.NAMES=FALSE)
 [1] 1 1 1 1 1 1 1 1 1 1

The intermediate ragged object returned by sapply is below (it's 10 lists of list of int ).

If which.min can't handle that sort of structure, it really should raise a warning, not merrily return a list of 1's.

sapply(all_tr_cids[n:m], function(cid){ which(trApr$customer_id==cid) }, USE.NAMES=FALSE)
[[1]]
 [1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927

[[2]]
 [1] 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
[31] 1958 1959 1960 1961 1962 1963

[[3]]
[1] 1964 1965 1966 1967 1968 1969

[[4]]
 [1] 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987

[[5]]
 [1] 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
[31] 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036

[[6]]
 [1] 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066
[31] 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091

[[7]]
 [1] 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112

[[8]]
 [1] 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139

[[9]]
 [1] 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169
[31] 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181

[[10]]
 [1] 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

榆西 2024-12-22 12:10:01

我认为您滥用了 which.min 函数。给定一个数值向量,它返回遇到的第一个最小值的索引,但在这里您给它一个逻辑向量 trApr$customer_id==cid,它被强制为数字为 0/1,因此遇到的第一个 FALSE 值是最小值。

有关 which.min 的更多详细信息,请参阅文档页面:http://stat.ethz.ch/R-manual/R-devel/library/base/html/which.min.html

I think you are misusing the which.min function. Given a vector of numeric values, it returns the index of the first minimum encountered, but here you are giving it a logical vector trApr$customer_id==cid, which is coerced to numeric as 0/1, so the first FALSE value encountered is a minimum.

See the doc page for further details on which.min : http://stat.ethz.ch/R-manual/R-devel/library/base/html/which.min.html

一腔孤↑勇 2024-12-22 12:10:01

这是您对 which.min() 的使用造成的错误。您正在提供一个逻辑向量(包含 TRUEFALSE)。这显然不是数字数据,因此 R 将逻辑强制转换为数字,其中 FALSE 等于 0TRUE 等于到1。因此,您实际上正在做:

R> which.min(c(TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 2
R> which.min(c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 1

因此,which.min() 返回绑定最小值的第一个,在本例中为第一个FALSE 遇到过。因此,您的示例中将返回所有 1 。对于显示的元素,匹配的客户 ID 不在比较对象的第一个元素中。

您想要这样的东西:

which.min(which(trApr$customer_id[trApr$customer_id==cid]))

我们首先子集 trApr$customer_id 仅返回 customer_id 向量中与当前 cid< 匹配的那些元素/code> (使用 which()),然后询问从 which() 返回的信息的最小值。使用 with() 会更容易:

with(trpApr, which.min(which(customer_id[customer_id == cid]))

两者都假设 cid 可用/可访问;即您首先创建了它。

It is your use of which.min() which is at fault. You are supplying a logical vector (one containing TRUE and FALSE). This is patently not numeric data, so R coerces the logical to numerics with FALSE equal to 0 and TRUE equal to 1. So you are in effect doing:

R> which.min(c(TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 2
R> which.min(c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 1

As such, which.min() returns the first of the tied minimum values, in this case the first FALSE encountered. Hence all the 1s being returned in your example. For the elements shown, the customer ID that matches was not in the first element of the object compared.

You want something like:

which.min(which(trApr$customer_id[trApr$customer_id==cid]))

where we subset trApr$customer_id first to return only those elements of the customer_id vector matching the current cid (using which()), and then ask for the minimum of the info returned from which(). This would be easier with with():

with(trpApr, which.min(which(customer_id[customer_id == cid]))

Both of which assume that cid is available/accessible; i.e. you;ve created it first.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文