回归与回归的视觉比较主成分分析

发布于 2024-12-20 19:25:09 字数 5029 浏览 4 评论 0原文

我正在尝试完善一种比较回归和 PCA 的方法，灵感来自于博客大脑咀嚼，这也在所以。在我忘记之前，非常感谢 JD Long 和 Josh Ulrich 为本文提供的大部分核心内容。我将在下学期的课程中使用它。抱歉这很长！

更新：我发现了一种几乎有效的不同方法（如果可以的话请修复它！）。我把它贴在底部了。这是一个比我能想出的更聪明、更短的方法！

我基本上遵循了之前的方案：生成随机数据，找出最佳拟合线，绘制残差。这显示在下面的第二个代码块中。但我还挖掘并编写了一些函数来绘制与通过随机点（本例中的数据点）的线垂直的线。我认为这些工作得很好，并且它们在“第一个代码块”中显示以及它们工作的证明。

现在，第二个代码块使用与 @JDLong 相同的流程显示了整个操作过程，我添加了结果图的图像。黑色、红色的数据是带有残差的回归，粉色、蓝色是第一个 PC，浅蓝色应该是法线，但显然它们不是。第一个代码块中绘制这些法线的函数看起来不错，但演示中有些问题：我想我一定是误解了某些内容或传递了错误的值。我的法线是水平的，这似乎是一个有用的线索（但到目前为止，对我来说不是）。谁能看出这里出了什么问题吗？

谢谢，这个问题困扰我好久了…… 显示问题的图

第一个代码块（绘制法线的函数并证明它们有效）：

##### The functions below are based very loosely on the citation at the end

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles

m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )

Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)

res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")

##========================================================
##
##  Credits:
##  Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
##  Based in part on C code by Damian Coventry Tuesday, 16 July 2002
##  Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
##  With grateful thanks for answering our needs!
##  This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================

第二个代码块（绘制演示图）：

set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

# Add the first PC by denormalizing back to original coords:

new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)

# Now add the normals

yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals

############ UPDATE

位于 Vincent Zoonekynd 的页面我几乎找到了我想要的东西。但是，它不太有效（显然曾经有效）。以下是该网站的代码摘录，其中绘制了通过垂直轴反射的第一台 PC 的法线：

set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')

r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')

u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")

以下是结果：

原文

I'm trying to perfect a method for comparing regression and PCA, inspired by the blog Cerebral Mastication which has also has been discussed from a different angle on SO. Before I forget, many thanks to JD Long and Josh Ulrich for much of the core of this. I'm going to use this in a course next semester. Sorry this is long!

UPDATE: I found a different approach which almost works (please fix it if you can!). I posted it at the bottom. A much smarter and shorter approach than I was able to come up with!

I basically followed the previous schemes up to a point: Generate random data, figure out the line of best fit, draw the residuals. This is shown in the second code chunk below. But I also dug around and wrote some functions to draw lines normal to a line through a random point (the data points in this case). I think these work fine, and they are shown in First Code Chunk along with proof they work.

Now, the Second Code Chunk shows the whole thing in action using the same flow as @JDLong and I'm adding an image of the resulting plot. Data in black, red is the regression with residuals pink, blue is the 1st PC and the light blue should be the normals, but obviously they are not. The functions in First Code Chunk that draw these normals seem fine, but something is not right with the demonstration: I think I must be misunderstanding something or passing the wrong values. My normals come in horizontal, which seems like a useful clue (but so far, not to me). Can anyone see what's wrong here?

Thanks, this has been vexing me for a while...
Plot showing problem

First Code Chunk (Functions to Draw Normals and Proof They Work):

##### The functions below are based very loosely on the citation at the end

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles

m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )

Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)

res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")

##========================================================
##
##  Credits:
##  Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
##  Based in part on C code by Damian Coventry Tuesday, 16 July 2002
##  Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
##  With grateful thanks for answering our needs!
##  This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================

Second Code Chunk (Plots the Demonstration):

set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

# Add the first PC by denormalizing back to original coords:

new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)

# Now add the normals

yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals

############ UPDATE

Over at Vincent Zoonekynd's Page I found almost exactly what I wanted. But, it doesn't quite work (obviously used to work). Here is a code excerpt from that site which plots normals to the first PC reflected through a vertical axis:

set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')

r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')

u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")

And here is the result:

enter image description here

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

入怼 2024-12-27 19:25:09

好吧，我得回答我自己的问题了！经过进一步阅读和比较人们在网上发布的方法，我解决了这个问题。我不确定我能否清楚地说明我“修复”的内容，因为我经历了相当多的迭代。无论如何，这是情节和代码（MWE）。为了清楚起见，辅助函数位于最后。

工作演示

# Comparison of Linear Regression & PCA
# Generate sample data

set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e

# Plot the main data & residuals

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# Now the PCA using built-in functions
# rotation = loadings = eigenvectors

r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC

# Plot normals to 1st PC

X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")

###### Needed Functions

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

Alright, I'll have to answer my own question! After further reading and comparison of methods that people have put on the internet, I have solved the problem. I'm not sure I can clearly state what I "fixed" because I went through quite a few iterations. Anyway, here is the plot and the code (MWE). The helper functions are at the end for clarity.

Working Demo

# Comparison of Linear Regression & PCA
# Generate sample data

set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e

# Plot the main data & residuals

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# Now the PCA using built-in functions
# rotation = loadings = eigenvectors

r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC

# Plot normals to 1st PC

X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")

###### Needed Functions

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

回复收藏 0 原文

心房敞 2024-12-27 19:25:09

尝试将这行代码更改为：

res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])

以便

res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])

您调用正确的 y 值。

Try changing this line of your code:

res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])

res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])

So you're calling the correct y values.

回复收藏 0 原文

隔纱相望 2024-12-27 19:25:09

在 Vincent Zoonekynd 的代码中，更改行 u <- r$loadings 到 u <-solve(r$loadings)。在 solve() 的第二个实例中，沿第一主轴的预测组件得分（即，第二个预测组件得分设置为零的预测得分矩阵）需要乘以 < em>载荷/特征向量的倒数。将数据乘以载荷即可得出预测分数；将预测分数除以负载即可得到数据。希望有帮助。