在 R 公式中,为什么我必须在幂项上使用 I() 函数,例如 y ~ I(x^3)
我正在尝试了解波浪号运算符和相关函数的使用。我的第一个问题是为什么需要使用 I()
来指定算术运算符?例如,这两个图生成不同的结果(前者有直线,后者是预期曲线)
x <- c(1:100)
y <- seq(0.1,10,0.1)
plot(y~x^3)
plot(y~I(x^3))
此外,以下两个图也生成预期结果
plot(x^3, y)
plot(I(x^3), y)
我的第二个问题是,也许我一直在使用的示例是太简单了,但我不明白 ~
实际应该在哪里使用。
I'm trying to get my head around the use of the tilde operator, and associated functions. My 1st question is why does I()
need to be used to specify arithmetic operators? For example, these 2 plots generate different results (the former having a straight line, and the latter the expected curve)
x <- c(1:100)
y <- seq(0.1,10,0.1)
plot(y~x^3)
plot(y~I(x^3))
further, both of the following plots also generate the expected result
plot(x^3, y)
plot(I(x^3), y)
My second question is, perhaps the examples I've been using are too simple, but I don't understand where ~
should actually be used.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
波形符运算符实际上是一个返回未计算表达式(一种语言对象类型)的函数。然后,通过建模函数来解释该表达式,其方式与对数字对象进行操作的运算符的解释不同。
这里的问题是如何解释公式,特别是其中的“+”、“:”和“^”运算符。(附注:正确的统计过程是使用函数
poly
当尝试在回归公式中创建高阶项时。)在 R 公式中,中缀运算符“+”、“*”、“:”和“^”与在计算中使用时具有完全不同的含义公式中的波浪号。 (~
) 将左侧与右侧分开。^
和:
运算符用于构造交互,因此x。
=x^2
=x^3
而不是成为预期的数学幂(与其自身相互作用的变量只是同一个变量。)如果你有。输入(x+y)^2
R 解释器会得到产生的(为了它自己良好的内部使用),不是一个数学:x^2 +2xy +y^2
,而是一个符号:x + y +x:y
其中 x:y 是没有主要影响的交互项。 (^
为您提供主要效果和交互作用。)I()
函数用于将参数转换为“as.is”,即您所期望的。因此 I(x^2) 将返回一个值的二次方向量。当在回归函数中看到时,
~
应该被认为是“分布为”或“依赖于”。~
本身就是一个中缀函数。通过在控制台键入以下内容,您可以看到LHS ~ RHS
几乎是formula(LHS, RHS)
的简写:在回归函数中,模型描述中的错误项将是回归函数假定或在
family
的参数中特别要求的任何形式。基本水平的平均值通常标记为(截距)
。函数上下文和参数还可以进一步根据family
值确定链接函数,例如log()或logit(),并且也可能具有非规范的家族/链接组合。公式中的“+”符号并不是真正添加两个变量,而是通常隐式请求在公式右侧的其余变量的上下文中计算该变量的回归系数。回归函数使用 model.matrix,该函数将识别公式中因子或字符向量的存在,并构建一个矩阵来扩展公式离散分量的级别。
在plot()-ting函数中,它基本上颠倒了plot函数通常采用的参数
(x, y)
顺序。编写了一个plot.formula方法,以便可以将公式用作与R通信的更“数学”模式。在graphics::plot.formula
中,curve
、'lattice' 和 'ggplot' 函数,它控制多个因子或数值向量的显示和“分面”方式。“+”运算符的重载在下面的注释中讨论,并且也在绘图包中完成:ggplot2 和 gridExtra 它在哪里分隔提供对象结果的函数。它在那里充当传递和分层运算符。一些聚合函数具有使用“+”作为“排列”和分组运算符的公式方法。
The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.
The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function
poly
when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~
) separates the left hand side from the right hand side. The^
and:
operators are used to construct interactions sox
=x^2
=x^3
rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed(x+y)^2
the R interpreter would have produced (for its own good internal use), not a mathematical:x^2 +2xy +y^2
, but rather a symbolic:x + y +x:y
wherex:y
is an interaction term without its main effects. (The^
gives you both main effects and interactions.)The
I()
function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.The
~
should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The~
is an infix function in its own right. You can see thatLHS ~ RHS
is almost shorthand forformula(LHS, RHS)
by typing this at the console:In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for
family
. The mean for the base level will generally be labelled(Intercept)
. The function context and arguments may also further determine a link function such as log() or logit() from thefamily
value, and it is also possible to have a non-canonical family/link combination.The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.
In plot()-ting functions it basically reverses the usual
( x, y )
order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In thegraphics::plot.formula
,curve
, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.