Python 数据分析中文笔记

01. Python 工具

02. Python 基础

03. Numpy

04. Scipy

05. Python 进阶

06. Matplotlib

07. 使用其他语言进行扩展

08. 面向对象编程

09. Theano 基础

10. 有趣的第三方模块

11. 有用的工具

12. Pandas

文江博客开发文档 Python 数据分析中文笔记文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

Theano 实例：Softmax 回归

发布于 2022-09-03 20:46:15 字数 9230 浏览 0 评论 0 收藏 0

MNIST 数据集的下载和导入

MNIST 数据集是一个手写数字组成的数据集，现在被当作一个机器学习算法评测的基准数据集。

这是一个下载并解压数据的脚本：

In [1]:

%%file download_mnist.py
import os
import os.path
import urllib
import gzip
import shutil

if not os.path.exists('mnist'):
    os.mkdir('mnist')

def download_and_gzip(name):
    if not os.path.exists(name + '.gz'):
        urllib.urlretrieve('http://yann.lecun.com/exdb/' + name + '.gz', name + '.gz')
    if not os.path.exists(name):
        with gzip.open(name + '.gz', 'rb') as f_in, open(name, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

download_and_gzip('mnist/train-images-idx3-ubyte')
download_and_gzip('mnist/train-labels-idx1-ubyte')
download_and_gzip('mnist/t10k-images-idx3-ubyte')
download_and_gzip('mnist/t10k-labels-idx1-ubyte')

Overwriting download_mnist.py

可以运行这个脚本来下载和解压数据：

In [2]:

%run download_mnist.py

使用如下的脚本来导入 MNIST 数据，源码地址：

https://github.com/Newmu/Theano-Tutorials/blob/master/load.py

In [3]:

%%file load.py
import numpy as np
import os

datasets_dir = './'

def one_hot(x,n):
    if type(x) == list:
        x = np.array(x)
    x = x.flatten()
    o_h = np.zeros((len(x),n))
    o_h[np.arange(len(x)),x] = 1
    return o_h

def mnist(ntrain=60000,ntest=10000,onehot=True):
    data_dir = os.path.join(datasets_dir,'mnist/')
    fd = open(os.path.join(data_dir,'train-images-idx3-ubyte'))
    loaded = np.fromfile(file=fd,dtype=np.uint8)
    trX = loaded[16:].reshape((60000,28*28)).astype(float)

    fd = open(os.path.join(data_dir,'train-labels-idx1-ubyte'))
    loaded = np.fromfile(file=fd,dtype=np.uint8)
    trY = loaded[8:].reshape((60000))

    fd = open(os.path.join(data_dir,'t10k-images-idx3-ubyte'))
    loaded = np.fromfile(file=fd,dtype=np.uint8)
    teX = loaded[16:].reshape((10000,28*28)).astype(float)

    fd = open(os.path.join(data_dir,'t10k-labels-idx1-ubyte'))
    loaded = np.fromfile(file=fd,dtype=np.uint8)
    teY = loaded[8:].reshape((10000))

    trX = trX/255.
    teX = teX/255.

    trX = trX[:ntrain]
    trY = trY[:ntrain]

    teX = teX[:ntest]
    teY = teY[:ntest]

    if onehot:
        trY = one_hot(trY, 10)
        teY = one_hot(teY, 10)
    else:
        trY = np.asarray(trY)
        teY = np.asarray(teY)

    return trX,teX,trY,teY

Overwriting load.py

softmax 回归

Softmax 回归相当于 Logistic 回归的一个一般化，Logistic 回归处理的是两类问题，Softmax 回归处理的是 N 类问题。

Logistic 回归输出的是标签为 1 的概率（标签为 0 的概率也就知道了），对应地，对 N 类问题 Softmax 输出的是每个类对应的概率。

具体的内容，可以参考 UFLDL 教程：

http://ufldl.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92

In [4]:

import theano
from theano import tensor as T
import numpy as np
from load import mnist

Using gpu device 1: Tesla C2075 (CNMeM is disabled)

我们来看它具体的实现。

这两个函数一个是将数据转化为 GPU 计算的类型，另一个是初始化权重：

In [5]:

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

Softmax 的模型在 theano 中已经实现好了：

In [6]:

A = T.matrix()

B = T.nnet.softmax(A)

test_softmax = theano.function([A], B)

a = floatX(np.random.rand(3, 4))

b = test_softmax(a)

print b.shape

# 行和
print b.sum(1)

(3, 4)
[ 1.00000012  1\.          1\.        ]

softmax 函数会按照行对矩阵进行 Softmax 归一化。

所以我们的模型为：

In [7]:

def model(X, w):
    return T.nnet.softmax(T.dot(X, w))

导入数据：

In [8]:

trX, teX, trY, teY = mnist(onehot=True)

定义变量，并初始化权重：

In [9]:

X = T.fmatrix()
Y = T.fmatrix()

w = init_weights((784, 10))

定义模型输出和预测：

In [10]:

py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)

损失函数为多类的交叉熵，这个在 theano 中也被定义好了：

In [11]:

cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]

编译 train 和 predict 函数：

In [12]:

train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)

迭代 100 次，测试集正确率为 0.925：

In [13]:

for i in range(100):
    for start, end in zip(range(0, len(trX), 128), range(128, len(trX), 128)):
        cost = train(trX[start:end], trY[start:end])
    print "{0:03d}".format(i), np.mean(np.argmax(teY, axis=1) == predict(teX))

分享到QQ

分享到微博