文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

深入理解 Neural Style

发布于 2025-02-16 13:04:44 字数 8311 浏览 0 评论 0 收藏 0

引言
TensorFlow 是 Google 基于 DistBelief 进行研发的第二代人工智能学习系统，被广泛用于语音识别或图像识别等多项机器深度学习领域。其命名来源于本身的运行原理。Tensor（张量）意味着 N 维数组，Flow（流）意味着基于数据流图的计算，TensorFlow 代表着张量从图象的一端流动到另一端计算过程，是将复杂的数据结构传输至人工智能神经网中进行分析和处理的过程。
TensorFlow 完全开源，任何人都可以使用。可在小到一部智能手机、大到数千台数据中心服务器的各种设备上运行。
『机器学习进阶笔记』系列是将深入解析 TensorFlow 系统的技术实践，从零开始，由浅入深，与大家一起走上机器学习的进阶之路。

前面机器学习进阶笔记之一 | TensorFlow 安装与入门简单讲了下怎么在 Ubuntu 安装 tensorflow gpu 版本，也跑了下基于 Mnist 的比较基本的 LR 算法，但是 Tensorflow 可远远不止这些，它能做很多很有意思的东西，这篇文章主要针对 Tensorflow 利用 CNN 的方法对艺术照片做下 Neural Style 的相关工作。首先，我会详细解释下 A Neural Algorithm of Artistic Style 这篇 paper 是怎么做的，然后会结合一个开源的[Tensorflow 的 Neural Style 版本][3]来领略下大神的风采。

A Neural Algorithm of Artistic Style

在艺术领域，尤其是绘画，艺术家们通过创造不同的内容与风格，并相互交融影响来创立独立的视觉体验。如果给定两张图像，现在的技术手段，完全有能力让计算机识别出图像具体内容。而风格是一种很抽象的东西，在计算机的眼中，当然就是一些 pixel，但人眼就能很有效地的辨别出不同画家不同的 style，是否有一些更复杂的 feature 来构成，最开始学习 DeepLearning 的 paper 时，多层网络的实质其实就是找出更复杂、更内在的 features，所以图像的 style 理论上可以通过多层网络来提取里面可能一些有意思的东西。

而这篇文章就是利用卷积神经网络（利用 pretrain 的 Pre-trained VGG network model）来分别做 Content、Style 的 reconstruction，在合成时考虑 content loss 与 style loss 的最小化（其实还包括去噪变化的的 loss），这样合成出来的图像会保证在 content 和 style 的重构上更准确。

文章大纲

这里是整个 paper 在 neural style 的工作流，理解这幅图对理解整篇 paper 的逻辑很关键，主要分为两部分：

Content Reconstruction: 上图中下面部分是 Content Reconstruction 对应于 CNN 中的 a，b，c，d，e 层，注意最开始标了 Content Representations 的部分不是原始图片（可以理解是给计算机比如分类器看的图片，因此如果可视化它，可能完全就不知道是什么内容），而是经过了 Pre-trained 之后的 VGG network model 的图像数据，该 model 主要用来做 object recognition，这里主要用来生成图像的 Content Representations。理解了这里，后面就比较容易了，经过五层卷积网络来做 Content 的重构，文章作者实验发现在前 3 层的 Content Reconstruction 效果比较好，d，e 两层丢失了部分细节信息，保留了比较 high-level 的信息。
Style Reconstruction： Style 的重构比较复杂，很难去模型化 Style 这个东西，Style Represention 的生成也是和 Content Representation 的生成类似，也是由 VGG network model 去做的，不同点在于 a,b,c,d,e 的处理方式不同，Style Represention 的 Reconstruction 是在 CNN 的不同的子集上来计算的，怎么说呢，它会分别构造 conv1_1(a),[conv1_1, conv2_1](b),[conv1_1, conv2_1, conv3_1],[conv1_1, conv2_1, conv3_1,conv4_1],[conv1_1, conv2_1, conv3_1, conv4_1, conv5_1]。这样重构的 Style 会在各个不同的尺度上更加匹配图像本身的 style，忽略场景的全局信息。

methods

理解了以上两点，剩下的就是建模的数据问题了，这里按 Content 和 Style 来分别计算 loss，Content loss 的 method 比较简单：

其中 F^l 是产生的 Content Representation 在第 l 层的数据表示，P^l 是原始图片在第 l 层的数据表示，定义 squared-error loss 为两种特征表示的 error。

Style 的 loss 基本也和 Content loss 一样，只不过要包含每一层输出的 errors 之和

其中 A^l 是原始 style 图片在第 l 的数据表示，而 G^l 是产生的 Style Representation 在第 l 层的表示

定义好 loss 之后就是采用优化方法来最小化模型 loss(注意 paper 当中只有 content loss 和 style loss)，源码当中还涉及到降噪的 loss：

优化方法这里就不讲了，tensorflow 有内置的如 Adam 这样的方法来处理

Tensorflow 版本源码解读

项目 github 地址：< GitHub - anishathalye/neural-style: Neural style in TensorFlow!>

代码主要包括三个文件：neural_style.py, stylize.py, vgg.py。一些基本的接口代码我就不描述了，直接来核心代码：

g = tf.Graph()
with g.as_default(), g.device(‘/cpu:0’), tf.Session() as sess:
  image = tf.placeholder('float', shape=shape)
  net, mean_pixel = vgg.net(network, image)
  content_pre = np.array([vgg.preprocess(content, mean_pixel)])
  content_features[CONTENT_LAYER] = net[CONTENT_LAYER].eval(
    feed_dict={image: content_pre})

这里会调用 imagenet-vgg-verydeep-19.mat 这个 model，在这个基础上通过 vgg 里面的 net 构建前文当中提到的 abcde 那五个卷积层 conv1_1, conv2_1, conv3_1, conv4_1, conv5_1，net 每个不同的 key 表示对应的层，然后 ceontent_pre 得到经过 model 输出后再经过 abcde 后的 content 的的 feature

for i in range(len(styles)):
  g = tf.Graph()
  with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
     image = tf.placeholder('float', shape=style_shapes[i])
     net, _ = vgg.net(network, image)
     style_pre = np.array([vgg.preprocess(styles[i], mean_pixel)])
     for layer in STYLE_LAYERS:
        features = net[layer].eval(feed_dict={image: style_pre})
        features = np.reshape(features, (-1, features.shape[3]))
        gram = np.dot(features.T, features) / features.size
        style_features[i][layer] = gram

这里和 content 的 feature 的计算一样，只不过，由于计算 loss 的方法不同（style loss 为 total loss 包括每一层输出的 loss），因此

CONTENT_LAYER = 'relu4_2'
STYLE_LAYERS = ('relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1')

然后就是最小化 loss 的过程：

with tf.Graph().as_default():
  if initial is None:
     noise = np.random.normal(size=shape, scale=np.std(content) * 0.1)
     initial = tf.random_normal(shape) * 0.256
  else:
     initial = np.array([vgg.preprocess(initial, mean_pixel)])
     initial = initial.astype('float32')
  image = tf.Variable(initial)
  net, _ = vgg.net(network, image)

  # content loss
  content_loss = content_weight * (2 * tf.nn.l2_loss(
        net[CONTENT_LAYER] - content_features[CONTENT_LAYER]) /
        content_features[CONTENT_LAYER].size)
  # style loss
  style_loss = 0
  for i in range(len(styles)):
     style_losses = []
     for style_layer in STYLE_LAYERS:
        layer = net[style_layer]
        _, height, width, number = map(lambda i: i.value, layer.get_shape())
        size = height * width * number
        feats = tf.reshape(layer, (-1, number))
        gram = tf.matmul(tf.transpose(feats), feats) / size
        style_gram = style_features[i][style_layer]
        style_losses.append(2 * tf.nn.l2_loss(gram - style_gram) / style_gram.size)
     style_loss += style_weight * style_blend_weights[i] * reduce(tf.add, style_losses)
  # total variation denoising
  tv_y_size = _tensor_size(image[:,1:,:,:])
  tv_x_size = _tensor_size(image[:,:,1:,:])
  tv_loss = tv_weight * 2 * (
        (tf.nn.l2_loss(image[:,1:,:,:] - image[:,:shape[1]-1,:,:]) /
          tv_y_size) +
        (tf.nn.l2_loss(image[:,:,1:,:] - image[:,:,:shape[2]-1,:]) /
          tv_x_size))
  # overall loss
  loss = content_loss + style_loss + tv_loss

  # optimizer setup
  train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)

和上文中提到的公式一一对应，除了多了一个 total variation denoising，定义好 total loss 后调用 AdamOptimizer 来进行迭代计算，最小化 loss 注意这里的代码还是按像素点计算，并未向量化，所以看起来会有点头疼，后面如果更加熟悉 tensorflow 后，我再来这儿试图改改，看看能不能把这里计算的部分做稍微高效点。

如果想要详细了解这部分代码的童靴，可以 clone 这个项目下来，仔细研究研究，当做学习 tensorflow。