多变量线性回归 - Python - 实现问题

发布于 2024-12-28 14:14:19 字数 4898 浏览 0 评论 0原文

我正在尝试使用多个变量（实际上，只有 2 个）实现线性回归。我正在使用斯坦福大学 ML 级的数据。我让它在单变量情况下正常工作。相同的代码应该适用于多个，但是，却不行。

数据链接：

http://s3.amazonaws .com/mlclass-resources/exercises/mlclass-ex1.zip

特征归一化：

''' This is for the regression with multiple variables problem .  You have to normalize features before doing anything. Lets get started'''
from __future__ import division
import os,sys
from math import *

def mean(f,col):
    #This is to find the mean of a feature
    sigma = 0
    count = 0
    data = open(f,'r')
    for line  in data:
        points = line.split(",")
        sigma = sigma + float(points[col].strip("\n"))
        count+=1
    data.close()
    return sigma/count
def size(f):
    count = 0
    data = open(f,'r')

    for line in data:
        count +=1
    data.close()
    return count
def standard_dev(f,col):
    #Calculate the standard_dev . Formula : Sqrt ( Sigma ( x - x') ** (x-x') ) / N ) 
    data = open(f,'r')
    sigma = 0
    mean = 0
    if(col==0):
        mean = mean_area
    else:
        mean = mean_bedroom
    for line in data:
        points = line.split(",")
        sigma  = sigma + (float(points[col].strip("\n")) - mean) ** 2
    data.close()
    return sqrt(sigma/SIZE)

def substitute(f,fnew):
    ''' Take the old file.  
        1. Subtract the mean values from each feature
        2. Scale it by dividing with the SD
    '''
    data = open(f,'r')
    data_new = open(fnew,'w')
    for line in data:
        points = line.split(",")
        new_area = (float(points[0]) - mean_area ) / sd_area
        new_bedroom = (float(points[1].strip("\n")) - mean_bedroom) / sd_bedroom
        data_new.write("1,"+str(new_area)+ ","+str(new_bedroom)+","+str(points[2].strip("\n"))+"\n")
    data.close()
    data_new.close()
global mean_area
global mean_bedroom
mean_bedroom = mean(sys.argv[1],1)
mean_area = mean(sys.argv[1],0)
print 'Mean number of bedrooms',mean_bedroom
print 'Mean area',mean_area
global SIZE
SIZE = size(sys.argv[1])
global sd_area
global sd_bedroom
sd_area = standard_dev(sys.argv[1],0)
sd_bedroom=standard_dev(sys.argv[1],1)
substitute(sys.argv[1],sys.argv[2])

我在代码中实现了均值和标准差，而不是使用 NumPy/SciPy。将值存储在文件中后，其快照如下：

X1 X2 X3 COST OF HOUSE

1,0.131415422021,-0.226093367578,399900
1,-0.509640697591,-0.226093367578,329900
1,0.507908698618,-0.226093367578,369000
1,-0.743677058719,-1.5543919021,232000
1,1.27107074578,1.10220516694,539900
1,-0.0199450506651,1.10220516694,299900
1,-0.593588522778,-0.226093367578,314900
1,-0.729685754521,-0.226093367578,198999
1,-0.789466781548,-0.226093367578,212000
1,-0.644465992588,-0.226093367578,242500

我对其运行回归以查找参数。其代码如下：

''' The plan is to rewrite and this time, calculate cost each time to ensure its reducing. Also make it  enough to handle multiple variables '''
from __future__ import division
import os,sys

def computecost(X,Y,theta):
    #X is the feature vector, Y is the predicted variable
    h_theta=calculatehTheta(X,theta)
    delta = (h_theta - Y) * (h_theta - Y)
    return (1/194) * delta 



def allCost(f,no_features):
    theta=[0,0]
    sigma=0
    data = open(f,'r')
    for line in data:
        X=[]
        Y=0
        points=line.split(",")
        for i in range(no_features):
            X.append(float(points[i]))
        Y=float(points[no_features].strip("\n"))
        sigma=sigma+computecost(X,Y,theta)
    return sigma

def calculatehTheta(points,theta):
    #This takes a file which has  (1,feature1,feature2,so ... on)
    #print 'Points are',points
    sigma  = 0 
    for i in range(len(theta)):

        sigma = sigma + theta[i] * float(points[i])
    return sigma



def gradient_Descent(f,no_iters,no_features,theta):
    ''' Calculate ( h(x) - y ) * xj(i) . And then subtract it from thetaj . Continue for 1500 iterations and you will have your answer'''


    X=[]
    Y=0
    sigma=0
    alpha=0.01
    for i in range(no_iters):
        for j in range(len(theta)):
            data = open(f,'r')
            for line in data:
                points=line.split(",")
                for i in range(no_features):
                    X.append(float(points[i]))
                Y=float(points[no_features].strip("\n"))
                h_theta = calculatehTheta(points,theta)
                delta = h_theta - Y
                sigma = sigma + delta * float(points[j])
            data.close()
            theta[j] = theta[j] - (alpha/97) * sigma

            sigma = 0
    print theta

print allCost(sys.argv[1],2)
print gradient_Descent(sys.argv[1],1500,2,[0,0,0])

它打印以下内容作为参数：

[-3.8697149722857996e-14, 0.02030369056348706, 0.979706406501678]

这三个都严重错误:( 对于单变量来说完全相同的事情。

谢谢！

原文

I am trying to implement Linear Regression with Multiple variables( actually , just 2 ) . I am using the data from the ML-Class Stanford. I got it working correctly for the single variable case. The same code should have worked for multiple, but , does not.

LINK to the data :

http://s3.amazonaws.com/mlclass-resources/exercises/mlclass-ex1.zip

Feature Normalization:

''' This is for the regression with multiple variables problem .  You have to normalize features before doing anything. Lets get started'''
from __future__ import division
import os,sys
from math import *

def mean(f,col):
    #This is to find the mean of a feature
    sigma = 0
    count = 0
    data = open(f,'r')
    for line  in data:
        points = line.split(",")
        sigma = sigma + float(points[col].strip("\n"))
        count+=1
    data.close()
    return sigma/count
def size(f):
    count = 0
    data = open(f,'r')

    for line in data:
        count +=1
    data.close()
    return count
def standard_dev(f,col):
    #Calculate the standard_dev . Formula : Sqrt ( Sigma ( x - x') ** (x-x') ) / N ) 
    data = open(f,'r')
    sigma = 0
    mean = 0
    if(col==0):
        mean = mean_area
    else:
        mean = mean_bedroom
    for line in data:
        points = line.split(",")
        sigma  = sigma + (float(points[col].strip("\n")) - mean) ** 2
    data.close()
    return sqrt(sigma/SIZE)

def substitute(f,fnew):
    ''' Take the old file.  
        1. Subtract the mean values from each feature
        2. Scale it by dividing with the SD
    '''
    data = open(f,'r')
    data_new = open(fnew,'w')
    for line in data:
        points = line.split(",")
        new_area = (float(points[0]) - mean_area ) / sd_area
        new_bedroom = (float(points[1].strip("\n")) - mean_bedroom) / sd_bedroom
        data_new.write("1,"+str(new_area)+ ","+str(new_bedroom)+","+str(points[2].strip("\n"))+"\n")
    data.close()
    data_new.close()
global mean_area
global mean_bedroom
mean_bedroom = mean(sys.argv[1],1)
mean_area = mean(sys.argv[1],0)
print 'Mean number of bedrooms',mean_bedroom
print 'Mean area',mean_area
global SIZE
SIZE = size(sys.argv[1])
global sd_area
global sd_bedroom
sd_area = standard_dev(sys.argv[1],0)
sd_bedroom=standard_dev(sys.argv[1],1)
substitute(sys.argv[1],sys.argv[2])

I have implemented mean and Standard deviation in the code, instead of using NumPy/SciPy. After storing the values in a file , a snapshot of which is the following:

X1 X2 X3 COST OF HOUSE

1,0.131415422021,-0.226093367578,399900
1,-0.509640697591,-0.226093367578,329900
1,0.507908698618,-0.226093367578,369000
1,-0.743677058719,-1.5543919021,232000
1,1.27107074578,1.10220516694,539900
1,-0.0199450506651,1.10220516694,299900
1,-0.593588522778,-0.226093367578,314900
1,-0.729685754521,-0.226093367578,198999
1,-0.789466781548,-0.226093367578,212000
1,-0.644465992588,-0.226093367578,242500

I run regression on it to find the parameters. The code for that is below:

''' The plan is to rewrite and this time, calculate cost each time to ensure its reducing. Also make it  enough to handle multiple variables '''
from __future__ import division
import os,sys

def computecost(X,Y,theta):
    #X is the feature vector, Y is the predicted variable
    h_theta=calculatehTheta(X,theta)
    delta = (h_theta - Y) * (h_theta - Y)
    return (1/194) * delta 



def allCost(f,no_features):
    theta=[0,0]
    sigma=0
    data = open(f,'r')
    for line in data:
        X=[]
        Y=0
        points=line.split(",")
        for i in range(no_features):
            X.append(float(points[i]))
        Y=float(points[no_features].strip("\n"))
        sigma=sigma+computecost(X,Y,theta)
    return sigma

def calculatehTheta(points,theta):
    #This takes a file which has  (1,feature1,feature2,so ... on)
    #print 'Points are',points
    sigma  = 0 
    for i in range(len(theta)):

        sigma = sigma + theta[i] * float(points[i])
    return sigma



def gradient_Descent(f,no_iters,no_features,theta):
    ''' Calculate ( h(x) - y ) * xj(i) . And then subtract it from thetaj . Continue for 1500 iterations and you will have your answer'''


    X=[]
    Y=0
    sigma=0
    alpha=0.01
    for i in range(no_iters):
        for j in range(len(theta)):
            data = open(f,'r')
            for line in data:
                points=line.split(",")
                for i in range(no_features):
                    X.append(float(points[i]))
                Y=float(points[no_features].strip("\n"))
                h_theta = calculatehTheta(points,theta)
                delta = h_theta - Y
                sigma = sigma + delta * float(points[j])
            data.close()
            theta[j] = theta[j] - (alpha/97) * sigma

            sigma = 0
    print theta

print allCost(sys.argv[1],2)
print gradient_Descent(sys.argv[1],1500,2,[0,0,0])

It prints the following as the parameters:

[-3.8697149722857996e-14, 0.02030369056348706, 0.979706406501678]

All three are horribly wrong :( The exact same thing works with Single variable .

Thanks !

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

做个少女永远怀春 2025-01-04 14:14:19

全局变量和四重嵌套循环让我担心。并且多次读取数据并将其写入文件。

您的数据太大以至于无法轻松放入内存吗？

为什么不使用 csv 模块进行文件处理？

为什么不使用 Numpy 作为数字部分？

不要重新发明轮子

假设您的数据条目是行，您可以规范化数据并进行最小二乘拟合以适应两行：

normData = (data-data.mean(axis = 0))/data.std(axis = 0)
c = numpy.dot(numpy.linalg.pinv(normData),prices)

回复原始海报的评论：

好的，那么我唯一的其他建议然后可以给你的是尝试将其分解成更小的部分，这样更容易看到发生了什么。并且更容易对小部件进行健全性检查。

这可能不是问题，但您使用 i 作为该四重循环中两个循环的索引。通过将其划分为更小的范围，可以避免这种问题。

我想自从我编写显式嵌套循环或声明全局变量以来已经有很多年了。

The global variables and quadruply nested loops worry me. That and reading and writing the data to files multiple times.

Is your data so big it doesn't easily fit in memory?

Why not use the csv module for the file processing?

Why not you use Numpy for the numeric part ?

Don't reinvent the wheel

Assuming your data entries are rows, you can normalize your data and do a least squares fit in two lines:

normData = (data-data.mean(axis = 0))/data.std(axis = 0)
c = numpy.dot(numpy.linalg.pinv(normData),prices)

Reply to comment from Original Poster:

Ok, then the only other advice I can give you then is try to break it up into smaller pieces, so that it's easier to see what's going on. and it's easier to sanity-check the small parts.

It's probably not the problem, but you are using i as the index for two of the loops in that quadruple loop. That's the sort of problem you can avoid by cutting it into smaller scopes.

I think it's been years since I wrote an explicitly nested loop, or declared a global variable.

回复收藏 0 原文

~没有更多了~