前言
译者序
第1章引言
- 1.1 本书面向的读者
- 1.2 深度学习的历史趋势
第2章线性代数
- 2.1 标量、向量、矩阵和张量
- 2.2 矩阵和向量相乘
- 2.3 单位矩阵和逆矩阵
- 2.4 线性相关和生成子空间
- 2.5 范数
- 2.6 特殊类型的矩阵和向量
- 2.7 特征分解
- 2.8 奇异值分解
- 2.9 Moore-Penrose 伪逆
- 2.10 迹运算
- 2.11 行列式
- 2.12 实例：主成分分析
第3章概率与信息论
- 3.1 为什么要使用概率
- 3.2 随机变量
- 3.3 概率分布
- 3.4 边缘概率
- 3.5 条件概率
- 3.6 条件概率的链式法则
- 3.7 独立性和条件独立性
- 3.8 期望、方差和协方差
- 3.9 常用概率分布
- 3.10 常用函数的有用性质
- 3.11 贝叶斯规则
- 3.12 连续型变量的技术细节
- 3.13 信息论
- 3.14 结构化概率模型
第4章数值计算
- 4.1 上溢和下溢
- 4.2 病态条件
- 4.3 基于梯度的优化方法
- 4.4 约束优化
- 4.5 实例：线性最小二乘
第5章机器学习基础
- 5.1 学习算法
- 5.2 容量、过拟合和欠拟合
- 5.3 超参数和验证集
- 5.4 估计、偏差和方差
- 5.5 最大似然估计
- 5.6 贝叶斯统计
- 5.7 监督学习算法
- 5.8 无监督学习算法
- 5.9 随机梯度下降
- 5.10 构建机器学习算法
- 5.11 促使深度学习发展的挑战
第6章深度前馈网络
- 6.1 实例：学习 XOR
- 6.2 基于梯度的学习
- 6.3 隐藏单元
- 6.4 架构设计
- 6.5 反向传播和其他的微分算法
- 6.6 历史小记
第7章深度学习中的正则化
- 7.1 参数范数惩罚
- 7.2 作为约束的范数惩罚
- 7.3 正则化和欠约束问题
- 7.4 数据集增强
- 7.5 噪声鲁棒性
- 7.6 半监督学习
- 7.7 多任务学习
- 7.8 提前终止
- 7.9 参数绑定和参数共享
- 7.10 稀疏表示
- 7.11 Bagging 和其他集成方法
- 7.12 Dropout
- 7.13 对抗训练
- 7.14 切面距离、正切传播和流形正切分类器
第8章深度模型中的优化
- 8.1 学习和纯优化有什么不同
- 8.2 神经网络优化中的挑战
- 8.3 基本算法
- 8.4 参数初始化策略
- 8.5 自适应学习率算法
- 8.6 二阶近似方法
- 8.7 优化策略和元算法
第9章卷积网络
- 9.1 卷积运算
- 9.2 动机
- 9.3 池化
- 9.4 卷积与池化作为一种无限强的先验
- 9.5 基本卷积函数的变体
- 9.6 结构化输出
- 9.7 数据类型
- 9.8 高效的卷积算法
- 9.9 随机或无监督的特征
- 9.10 卷积网络的神经科学基础
- 9.11 卷积网络与深度学习的历史
第10章序列建模：循环和递归网络
- 10.1 展开计算图
- 10.2 循环神经网络
- 10.3 双向 RNN
- 10.4 基于编码-解码的序列到序列架构
- 10.5 深度循环网络
- 10.6 递归神经网络
- 10.7 长期依赖的挑战
- 10.8 回声状态网络
- 10.9 渗漏单元和其他多时间尺度的策略
- 10.10 长短期记忆和其他门控 RNN
- 10.11 优化长期依赖
- 10.12 外显记忆
第11章实践方法论
- 11.1 性能度量
- 11.2 默认的基准模型
- 11.3 决定是否收集更多数据
- 11.4 选择超参数
- 11.5 调试策略
- 11.6 示例：多位数字识别
第12章应用
- 12.1 大规模深度学习
- 12.2 计算机视觉
- 12.3 语音识别
- 12.4 自然语言处理
- 12.5 其他应用
第13章线性因子模型
- 13.1 概率 PCA 和因子分析
- 13.2 独立成分分析
- 13.3 慢特征分析
- 13.4 稀疏编码
- 13.5 PCA 的流形解释
第14章自编码器
- 14.1 欠完备自编码器
- 14.2 正则自编码器
- 14.3 表示能力、层的大小和深度
- 14.4 随机编码器和解码器
- 14.5 去噪自编码器详解
- 14.6 使用自编码器学习流形
- 14.7 收缩自编码器
- 14.8 预测稀疏分解
- 14.9 自编码器的应用
第15章表示学习
- 15.1 贪心逐层无监督预训练
- 15.2 迁移学习和领域自适应
- 15.3 半监督解释因果关系
- 15.4 分布式表示
- 15.5 得益于深度的指数增益
- 15.6 提供发现潜在原因的线索
第16章深度学习中的结构化概率模型
- 16.1 非结构化建模的挑战
- 16.2 使用图描述模型结构
- 16.3 从图模型中采样
- 16.4 结构化建模的优势
- 16.5 学习依赖关系
- 16.6 推断和近似推断
- 16.7 结构化概率模型的深度学习方法
第17章蒙特卡罗方法
- 17.1 采样和蒙特卡罗方法
- 17.2 重要采样
- 17.3 马尔可夫链蒙特卡罗方法
- 17.4 Gibbs 采样
- 17.5 不同的峰值之间的混合挑战
第18章直面配分函数
- 18.1 对数似然梯度
- 18.2 随机最大似然和对比散度
- 18.3 伪似然
- 18.4 得分匹配和比率匹配
- 18.5 去噪得分匹配
- 18.6 噪声对比估计
- 18.7 估计配分函数
第19章近似推断
- 19.1 把推断视作优化问题
- 19.2 期望最大化
- 19.3 最大后验推断和稀疏编码
- 19.4 变分推断和变分学习
- 19.5 学成近似推断
第20章深度生成模型
- 20.1 玻尔兹曼机
- 20.2 受限玻尔兹曼机
- 20.3 深度信念网络
- 20.4 深度玻尔兹曼机
- 20.5 实值数据上的玻尔兹曼机
- 20.6 卷积玻尔兹曼机
- 20.7 用于结构化或序列输出的玻尔兹曼机
- 20.8 其他玻尔兹曼机
- 20.9 通过随机操作的反向传播
- 20.10 有向生成网络
- 20.11 从自编码器采样
- 20.12 生成随机网络
- 20.13 其他生成方案
- 20.14 评估生成模型
- 20.15 结论

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

20.15 结论

发布于 2024-01-20 12:27:17 字数 162417 浏览 0 评论 0 收藏 0

为了让模型理解基于给定训练数据表示的大千世界，训练具有隐藏单元的生成模型是一种有力方法。通过学习模型和表示，生成模型可以解答x输入变量之间关系的许多推断问题，并且可以在不同层对h求期望来提供表示x的许多不同方式。生成模型可以为AI系统提供它们所要理解的、各种不同概念的框架，让它们有能力在面对不确定性的情况下推理这些概念。我们希望读者能够找到增强这些方法的新途径，并继续探究智能和学习背后原理的旅程。

本书由“行行”整理，如果你不知道读什么书或者想获得更多免费电子书请加小编微信或QQ：2338856113 小编也和结交一些喜欢读书的朋友或者关注小编个人微信公众号名称：幸福的味道 id：d716-716 为了方便书友朋友找书和看书，小编自己做了一个电子书下载网站，网站的名称为：周读网址：http://www.ireadweek.com

————————————————————

(1) 术语“mcRBM”根据字母M-C-R-B-M发音；“mc”不是“McDonald's”中的“Mc”的发音。

(2) 这个版本的Gaussian-Bernoulli RBM能量函数假定图像数据的每个像素具有零均值。考虑非零像素均值时，可以简单地将像素偏移添加到模型中。

(3) 该论文将模型描述为“深度信念网络”，但因为它可以被描述为纯无向模型（具有易处理逐层均匀场不动点更新），所以它最适合深度玻尔兹曼机的定义。

参考文献

Abadi，M.，Agarwal，A.，Barham，P.，Brevdo，E.，Chen，Z.，Citro，C.，Corrado，G. S.，Davis，A.，Dean，J.，Devin，M.，Ghemawat，S.，Goodfellow，I.，Harp，A.，Irving，G.，Isard，M.，Jia，Y.，Jozefowicz，R.，Kaiser，L.，Kudlur，M.，Levenberg，J.，Mané，D.，Monga，R.，Moore，S.，Murray，D.，Olah，C.，Schuster，M.，Shlens，J.，Steiner，B.，Sutskever，I.，Talwar，K.，Tucker，P.，Vanhoucke，V.，Vasudevan，V.，Viégas，F.，Vinyals，O.，Warden，P.，Wattenberg，M.，Wicke，M.，Yu，Y.，and Zheng，X.（2015）. TensorFlow:Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.

Ackley，D. H.，Hinton，G. E.，and Sejnowski，T. J.（1985）. A learning algorithm for Boltzmann machines. Cognitive Science，9，147–169.

Alain，G. and Bengio，Y.（2013）. What regularized auto-encoders learn from the data generating distribution. In ICLR'2013，arXiv:1211.4246.

Alain，G.，Bengio，Y.，Yao，L.，Éric Thibodeau-Laufer，Yosinski，J.，and Vincent，P.（2015）. GSNs: Generative stochastic networks. arXiv:1503.05571.

Anderson，E.（1935）. The Irises of the Gaspé Peninsula. Bulletin of the American Iris Society，59，2–5.

Ba，J.，Mnih，V.，and Kavukcuoglu，K.（2014）. Multiple object recognition with visual attention. arXiv:1412.7755.

Bachman，P. and Precup，D.（2015）. Variational generative stochastic networks with collaborative shaping. In Proceedings of the 32nd International Conference on Machine Learning，ICML 2015，Lille，France，6-11 July 2015，pages 1964–1972.

Bacon，P.-L.，Bengio，E.，Pineau，J.，and Precup，D.（2015）. Conditional computation in neural networks using a decision-theoretic approach. In 2nd Multidisciplinary Conference on Rein-forcement Learning and Decision Making（RLDM 2015）.

Bagnell，J. A. and Bradley，D. M.（2009）. Differentiable sparse coding. In NIPS'2009，pages 113–120.

Bahdanau，D.，Cho，K.，and Bengio，Y.（2015）. Neural machine translation by jointly learning to align and translate. In ICLR'2015，arXiv:1409.0473.

Bahl，L. R.，Brown，P.，de Souza，P. V.，and Mercer，R. L.（1987）. Speech recognition with continuous-parameter hidden Markov models. Computer，Speech and Language，2，219–234.

Baldi，P. and Hornik，K.（1989）. Neural networks and principal component analysis:Learning from examples without local minima. Neural Networks，2，53–58.

Baldi，P.，Brunak，S.，Frasconi，P.，Soda，G.，and Pollastri，G.（1999）. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics，15（11），937–946.

Baldi，P.，Sadowski，P.，and Whiteson，D.（2014）. Searching for exotic particles in high-energy physics with deep learning. Nature communications，5.

Ballard，D. H.，Hinton，G. E.，and Sejnowski，T. J.（1983）. Parallel vision computation. Nature.

Barlow，H. B.（1989）. Unsupervised learning. Neural Computation，1，295–311.

Barron，A. E.（1993）. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory，39，930–945.

Bartholomew，D. J.（1987）. Latent variable models and factor analysis. Oxford University Press.

Basilevsky，A.（1994）. Statistical Factor Analysis and Related Methods:Theory and Applications. Wiley.

Bastien，F.，Lamblin，P.，Pascanu，R.，Bergstra，J.，Goodfellow，I.，Bergeron，A.，Bouchard，N.，Warde-Farley，D.，and Bengio，Y.（2012a）. Theano:new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop，http://www.iro.umontreal.ca/lisa/publications2/index.php/publications/show/551.

Bastien，F.，Lamblin，P.，Pascanu，R.，Bergstra，J.，Goodfellow，I. J.，Bergeron，A.，Bouchard，N.，and Bengio，Y.（2012b）. Theano:new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.

Basu，S. and Christensen，J.（2013）. Teaching classification boundaries to humans. In AAAI'2013.

Baxter，J.（1995）. Learning internal representations. In Proceedings of the 8th International Conference on Computational Learning Theory（COLT'95），pages 311–320，Santa Cruz，California. ACM Press.

Bayer，J. and Osendorfer，C.（2014）. Learning stochastic recurrent networks. ArXiv e-prints.

Becker，S. and Hinton，G.（1992）. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature，355，161–163.

Behnke，S.（2001）. Learning iterative image reconstruction in the neural abstraction pyramid. Int. J. Computational Intelligence and Applications，1（4），427–438.

Beiu，V.，Quintana，J. M.，and Avedillo，M. J.（2003）. VLSI implementations of threshold logic-a comprehensive survey. Neural Networks，IEEE Transactions on，14（5），1217–1243.

Belkin，M. and Niyogi，P.（2002）. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich，S. Becker，and Z. Ghahramani，editors，Advances in Neural Information Processing Systems 14（NIPS'01），Cambridge，MA. MIT Press.

Belkin，M. and Niyogi，P.（2003a）. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation，15（6），1373–1396.

Belkin，M. and Niyogi，P.（2003b）. Using manifold structure for partially labeled classification. In S. Becker，S. Thrun，and K. Obermayer，editors，Advances in Neural Information Processing Systems 15（NIPS'02），Cambridge，MA. MIT Press.

Bengio，E.，Bacon，P.-L.，Pineau，J.，and Precup，D.（2015a）. Conditional computation in neural networks for faster models. arXiv:1511.06297.

Bengio，S. and Bengio，Y.（2000a）. Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks，special issue on Data Mining and Knowledge Discovery，11（3），550–557.

Bengio，S.，Vinyals，O.，Jaitly，N.，and Shazeer，N.（2015b）. Scheduled sampling for sequence prediction with recurrent neural networks. Technical report，arXiv:1506.03099.

Bengio，Y.（1991）. Artificial Neural Networks and their Application to Sequence Recognition. Ph.D. thesis，McGill University，（Computer Science），Montreal，Canada.

Bengio，Y.（2000）. Gradient-based optimization of hyperparameters. Neural Computation，12（8），1889–1900.

Bengio，Y.（2002）. New distributed probabilistic language models. Technical Report 1215，Dept. IRO，Université de Montréal.

Bengio，Y.（2009）. Learning deep architectures for AI. Now Publishers.

Bengio，Y.（2013）. Deep learning of representations: looking forward. In Statistical Language and Speech Processing，volume 7978 of Lecture Notes in Computer Science，pages 1–37. Springer，also in arXiv at http://arxiv.org/abs/1305.0445.

Bengio，Y.（2015）. Early inference in energy-based models approximates back-propagation. Technical Report arXiv:1510.02777，Universite de Montreal.

Bengio，Y. and Bengio，S.（2000b）. Modeling high-dimensional discrete data with multi-layer neural networks. In NIPS 12，pages 400–406. MIT Press.

Bengio，Y. and Delalleau，O.（2009）. Justifying and generalizing contrastive divergence. Neural Computation，21（6），1601–1621.

Bengio，Y. and Grandvalet，Y.（2004）. No unbiased estimator of the variance of k-fold cross-validation. In JML（1），pages 1089–1105.

Bengio，Y. and LeCun，Y.（2007a）. Scaling learning algorithms towards AI. In Large Scale Kernel Machines.

Bengio，Y. and LeCun，Y.（2007b）. Scaling learning algorithms towards AI. In L. Bottou，O. Chapelle，D. DeCoste，and J. Weston，editors，Large Scale Kernel Machines. MIT Press.

Bengio，Y. and Monperrus，M.（2005）. Non-local manifold tangent learning. In L. Saul，Y. Weiss，and L. Bottou，editors，Advances in Neural Information Processing Systems 17（NIPS'04），pages 129–136. MIT Press.

Bengio，Y. and Sénécal，J.-S.（2003）. Quick training of probabilistic neural nets by importance sampling. In Proceedings of AISTATS 2003.

Bengio，Y. and Sénécal，J.-S.（2008）. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Networks，19（4），713–722.

Bengio，Y.，De Mori，R.，Flammia，G.，and Kompe，R.（1991）. Phonetically motivated acoustic parameters for continuous speech recognition using artificial neural networks. In Proceedings of EuroSpeech'91.

Bengio，Y.，De Mori，R.，Flammia，G.，and Kompe，R.（1992）. Neural network-Gaussian mix-ture hybrid for speech recognition or density estimation. In NIPS 4，pages 175–182. Morgan Kaufmann.

Bengio，Y.，Frasconi，P.，and Simard，P.（1993）. The problem of learning long-term dependencies in recurrent networks. In IEEE International Conference on Neural Networks，pages 1183–1195，San Francisco. IEEE Press.（invited paper）.

Bengio，Y.，Simard，P.，and Frasconi，P.（1994a）. Learning long-term dependencies with gradient descent is difficult. IEEE Tr. Neural Nets.

Bengio，Y.，Simard，P.，and Frasconi，P.（1994b）. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks，5（2），157–166.

Bengio，Y.，Simard，P.，and Frasconi，P.（1994c）. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks，5（2），157–166.

Bengio，Y.，Latendresse，S.，and Dugas，C.（1999）. Gradient-based learning of hyper-parameters. In Learning Conference.

Bengio，Y.，Ducharme，R.，and Vincent，P.（2001a）. A neural probabilistic language model. In T. Leen，T. Dietterich，and V. Tresp，editors，Advances in Neural Information Processing Systems 13（NIPS'00），pages 933–938. MIT Press.

Bengio，Y.，Ducharme，R.，and Vincent，P.（2001b）. A neural probabilistic language model. In T. K. Leen，T. G. Dietterich，and V. Tresp，editors，NIPS'2000，pages 932–938. MIT Press.

Bengio，Y.，Ducharme，R.，Vincent，P.，and Jauvin，C.（2003）. A neural probabilistic language model. JMLR，3，1137–1155.

Bengio，Y.，Delalleau，O.，and Le Roux，N.（2006a）. The curse of highly variable functions for local kernel machines. In NIPS'2005.

Bengio，Y.，Larochelle，H.，and Vincent，P.（2006b）. Non-local manifold Parzen windows. In NIPS'2005. MIT Press.

Bengio，Y.，Lamblin，P.，Popovici，D.，and Larochelle，H.（2007a）. Greedy layer-wise training of deep networks. In NIPS'2006.

Bengio，Y.，Lamblin，P.，Popovici，D.，and Larochelle，H.（2007b）. Greedy layer-wise training of deep networks. In B. Schölkopf，J. Platt，and T. Hoffman，editors，Advances in Neural Information Processing Systems 19（NIPS'06），pages 153–160. MIT Press.

Bengio，Y.，Lamblin，P.，Popovici，D.，and Larochelle，H.（2007c）. Greedy layer-wise training of deep networks. In Adv. Neural Inf. Proc. Sys. 19，pages 153–160.

Bengio，Y.，Lamblin，P.，Popovici，D.，and Larochelle，H.（2007d）. Greedy layer-wise training of deep networks. In NIPS 19，pages 153–160. MIT Press.

Bengio，Y.，Louradour，J.，Collobert，R.，and Weston，J.（2009）. Curriculum learning. In ICML'09. ACM.

Bengio，Y.，Mesnil，G.，Dauphin，Y.，and Rifai，S.（2013a）. Better mixing via deep representa-tions. In ICML'2013.

Bengio，Y.，Léonard，N.，and Courville，A.（2013b）. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432.

Bengio，Y.，Yao，L.，Alain，G.，and Vincent，P.（2013c）. Generalized denoising auto-encoders as generative models. In NIPS'2013.

Bengio，Y.，Courville，A.，and Vincent，P.（2013d）. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence，IEEE Transactions on，35（8），1798–1828.

Bengio，Y.，Thibodeau-Laufer，E.，Alain，G.，and Yosinski，J.（2014）. Deep generative stochastic networks trainable by backprop. In ICML'2014.

Bennett，C.（1976）. Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics，22（2），245–268.

Bennett，J. and Lanning，S.（2007）. The Netflix prize.

Berger，A. L.，Della Pietra，V. J.，and Della Pietra，S. A.（1996）. A maximum entropy approach to natural language processing. Computational Linguistics，22，39–71.

Berglund，M. and Raiko，T.（2013）. Stochastic gradient estimate variance in contrastive diver-gence and persistent contrastive divergence. CoRR，abs/1312.6002.

Bergstra，J.（2011）. Incorporating Complex Cells into Neural Networks for Pattern Classification. Ph.D. thesis，Université de Montréal.

Bergstra，J. and Bengio，Y.（2009）. Slow，decorrelated features for pretraining complex cell-like networks. In NIPS 22，pages 99–107. MIT Press.

Bergstra，J. and Bengio，Y.（2011）. Random search for hyper-parameter optimization. The Learning Workshop，Fort Lauderdale，Florida.

Bergstra，J. and Bengio，Y.（2012）. Random search for hyper-parameter optimization. J. Machine Learning Res.，13，281–305.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and Bengio，Y.（2010a）. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference（SciPy）. Oral Presentation.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and Bengio，Y.（2010b）. Theano: a CPU and GPU math expression com-piler. In Proc. SciPy.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and Bengio，Y.（2010c）. Theano:a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference（SciPy）.

Bergstra，J.，Bardenet，R.，Bengio，Y.，and Kégl，B.（2011）. Algorithms for hyper-parameter optimization. In NIPS'2011.

Berkes，P. and Wiskott，L.（2005）. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision，5（6），579–602.

Bertsekas，D. P. and Tsitsiklis，J.（1996）. Neuro-Dynamic Programming. Athena Scientific.

Besag，J.（1975）. Statistical analysis of non-lattice data. The Statistician，24（3），179–195.

Bishop，C. M.（1994）. Mixture density networks.

Bishop，C. M.（1995a）. Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN'95，volume 1，page 141–148.

Bishop，C. M.（1995b）. Training with noise is equivalent to Tikhonov regularization. Neural Computation，7（1），108–116.

Bishop，C. M.（2006）. Pattern Recognition and Machine Learning. Springer.

Blum，A. L. and Rivest，R. L.（1992）. Training a 3-node neural network is NP-complete.

Blumer，A.，Ehrenfeucht，A.，Haussler，D.，and Warmuth，M. K.（1989）. Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM，36（4），865–929.

Bonnet，G.（1964）. Transformations des signaux aléatoires à travers les systèmes non linéaires sans mémoire. Annales des Télécommunications，19（9–10），203–220.

Bordes，A.，Weston，J.，Collobert，R.，and Bengio，Y.（2011）. Learning structured embeddings of knowledge bases. In AAAI 2011.

Bordes，A.，Glorot，X.，Weston，J.，and Bengio，Y.（2012）. Joint learning of words and meaning representations for open-text semantic parsing. AISTATS'2012.

Bordes，A.，Glorot，X.，Weston，J.，and Bengio，Y.（2013a）. A semantic matching energy func-tion for learning with multi-relational data. Machine Learning: Special Issue on Learning Semantics.

Bordes，A.，Usunier，N.，Garcia-Duran，A.，Weston，J.，and Yakhnenko，O.（2013b）. Translating embeddings for modeling multi-relational data. In C. Burges，L. Bottou，M. Welling，Z. Ghahramani，and K. Weinberger，editors，Advances in Neural Information Processing Systems 26，pages 2787–2795. Curran Associates，Inc.

Bornschein，J. and Bengio，Y.（2015）. Reweighted wake-sleep. In ICLR'2015，arXiv:1406.2751.

Bornschein，J.，Shabanian，S.，Fischer，A.，and Bengio，Y.（2015）. Training bidirectional Helmholtz machines. Technical report，arXiv:1506.03877.

Boser，B. E.，Guyon，I. M.，and Vapnik，V. N.（1992）. A training algorithm for optimal margin classifiers. In COLT '92: Proceedings of thefifth annual workshop on Computational learning theory，pages 144–152，New York，NY，USA. ACM.

Bottou，L.（1998）. Online algorithms and stochastic approximations. In D. Saad，editor，Online Learning in Neural Networks. Cambridge University Press，Cambridge，UK.

Bottou，L.（2011）. From machine learning to machine reasoning. Technical report，arXiv.1102.1808.

Bottou，L.（2015）. Multilayer neural networks. Deep Learning Summer School.

Bottou，L. and Bousquet，O.（2008a）. The tradeoffs of large scale learning. In J. Platt，D. Koller，Y. Singer，and S. Roweis，editors，Advances in Neural Information Processing Systems 20（NIPS'07），volume 20. MIT Press，Cambridge，MA.

Bottou，L. and Bousquet，O.（2008b）. The tradeoffs of large scale learning. In NIPS'2008.

Boulanger-Lewandowski，N.，Bengio，Y.，and Vincent，P.（2012）. Modeling temporal dependen-cies in high-dimensional sequences: Application to polyphonic music generation and transcrip-tion. In ICML'12.

Boureau，Y.，Ponce，J.，and LeCun，Y.（2010）. A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning（ICML'10）.

Boureau，Y.，Le Roux，N.，Bach，F.，Ponce，J.，and LeCun，Y.（2011）. Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer Vision（ICCV'11）. IEEE.

Bourlard，H. and Kamp，Y.（1988）. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics，59，291–294.

Bourlard，H. and Wellekens，C.（1989）. Speech pattern discrimination and multi-layered percep-trons. Computer Speech and Language，3，1–19.

Boyd，S. and Vandenberghe，L.（2004）. Convex Optimization. Cambridge University Press，New York，NY，USA.

Brady，M. L.，Raghavan，R.，and Slawny，J.（1989）. Back-propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems，36（5），665–674.

Brakel，P.，Stroobandt，D.，and Schrauwen，B.（2013）. Training energy-based models for time-series imputation. Journal of Machine Learning Research，14，2771–2797.

Brand，M.（2003a）. Charting a manifold. In S. Becker，S. Thrun，and K. Obermayer，editors，Advances in Neural Information Processing Systems 15（NIPS'02），pages 961–968. MIT Press.

Brand，M.（2003b）. Charting a manifold. In NIPS'2002，pages 961–968. MIT Press.

Breiman，L.（1994）. Bagging predictors. Machine Learning，24（2），123–140.

Breiman，L.，Friedman，J. H.，Olshen，R. A.，and Stone，C. J.（1984）. Classification and Regression Trees. Wadsworth International Group，Belmont，CA.

Bridle，J. S.（1990）. Alphanets: a recurrent ‘neural’ network architecture with a hidden Markov model interpretation. Speech Communication，9（1），83–92.

Briggman，K.，Denk，W.，Seung，S.，Helmstaedter，M. N.，and Turaga，S. C.（2009）. Maximin affinity learning of image segmentation. In NIPS'2009，pages 1865–1873.

Brown，P. F.，Cocke，J.，Pietra，S. A. D.，Pietra，V. J. D.，Jelinek，F.，Lafferty，J. D.，Mercer，R. L.，and Roossin，P. S.（1990）. A statistical approach to machine translation. Computational linguistics，16（2），79–85.

Brown，P. F.，Pietra，V. J. D.，DeSouza，P. V.，Lai，J. C.，and Mercer，R. L.（1992）. Class-based n-gram models of natural language. Computational Linguistics，18，467–479.

Bryson，A. and Ho，Y.（1969）. Applied optimal control: optimization，estimation，and control. Blaisdell Pub. Co.

Bryson，Jr.，A. E. and Denham，W. F.（1961）. A steepest-ascent method for solving optimum programming problems. Technical Report BR-1303，Raytheon Company，Missle and Space Division.

Buciluǎ，C.，Caruana，R.，and Niculescu-Mizil，A.（2006）. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining，pages 535–541. ACM.

Burda，Y.，Grosse，R.，and Salakhutdinov，R.（2015）. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.

Cai，M.，Shi，Y.，and Liu，J.（2013）. Deep maxout neural networks for speech recognition. In Automatic Speech Recognition and Understanding（ASRU），2013 IEEE Workshop on，pages 291–296. IEEE.

Carreira-Perpiñan，M. A. and Hinton，G. E.（2005）. On contrastive divergence learning. In AISTATS'2005，pages 33–40.

Caruana，R.（1993）. Multitask connectionist learning. In Proceedings of the 1993 Connectionist Models Summer School，pages 372–379.

Cauchy，A.（1847）. Méthode générale pour la résolution de systèmes d'équations simultanées. In Compte rendu des séances de l'académie des sciences，pages 536–538.

Cayton，L.（2005）. Algorithms for manifold learning. Technical Report CS2008-0923，UCSD.

Chandola，V.，Banerjee，A.，and Kumar，V.（2009）. Anomaly detection: A survey. ACM computing surveys（CSUR），41（3），15.

Chapelle，O.，Weston，J.，and Schölkopf，B.（2003）. Cluster kernels for semi-supervised learning. In S. Becker，S. Thrun，and K. Obermayer，editors，Advances in Neural Information Processing Systems 15（NIPS'02），pages 585–592，Cambridge，MA. MIT Press.

Chapelle，O.，Schölkopf，B.，and Zien，A.，editors（2006）. Semi-Supervised Learning. MIT Press，Cambridge，MA.

Chellapilla，K.，Puri，S.，and Simard，P.（2006）. High Performance Convolutional Neural Net-works for Document Processing. In Guy Lorette，editor，Tenth International Workshop on Frontiers in Handwriting Recognition，La Baule（France）. Université de Rennes 1，Suvisoft. http://www.suvisoft.com.

Chen，B.，Ting，J.-A.，Marlin，B. M.，and de Freitas，N.（2010）. Deep learning of invariant spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop.

Chen，S. F. and Goodman，J. T.（1999）. An empirical study of smoothing techniques for language modeling. Computer，Speech and Language，13（4），359–393.

Chen，T.，Du，Z.，Sun，N.，Wang，J.，Wu，C.，Chen，Y.，and Temam，O.（2014a）. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems，pages 269–284. ACM.

Chen，T.，Li，M.，Li，Y.，Lin，M.，Wang，N.，Wang，M.，Xiao，T.，Xu，B.，Zhang，C.，and Zhang，Z.（2015）. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.

Chen，Y.，Luo，T.，Liu，S.，Zhang，S.，He，L.，Wang，J.，Li，L.，Chen，T.，Xu，Z.，Sun，N.，et al.（2014b）. DaDianNao: A machine-learning supercomputer. In Microarchitecture（MICRO），2014 47th Annual IEEE/ACM International Symposium on，pages 609–622. IEEE.

Chilimbi，T.，Suzue，Y.，Apacible，J.，and Kalyanaraman，K.（2014）. Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation（OSDI'14）.

Cho，K.，Raiko，T.，and Ilin，A.（2010a）. Parallel tempering is efficient for learning restricted Boltzmann machines. In Proceedings of the International Joint Conference on Neural Networks（IJCNN 2010），Barcelona，Spain.

Cho，K.，Raiko，T.，and Ilin，A.（2010b）. Parallel tempering is efficient for learning restricted Boltzmann machines. In IJCNN'2010.

Cho，K.，Raiko，T.，and Ilin，A.（2011）. Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. In ICML'2011，pages 105–112.

Cho，K.，Van Merriënboer，B.，Gülçehre，Ç.，Bahdanau，D.，Bougares，F.，Schwenk，H.，and Bengio，Y.（2014a）. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），pages 1724–1734. Association for Computational Linguistics.

Cho，K.，van Merriënboer，B.，Gulcehre，C.，Bougares，F.，Schwenk，H.，and Bengio，Y.（2014b）. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing（EMNLP 2014）.

Cho，K.，Van Merriënboer，B.，Bahdanau，D.，and Bengio，Y.（2014c）. On the properties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints，abs/1409.1259.

Choromanska，A.，Henaff，M.，Mathieu，M.，Arous，G. B.，and LeCun，Y.（2014）. The loss surface of multilayer networks.

Chorowski，J.，Bahdanau，D.，Cho，K.，and Bengio，Y.（2014）. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602.

Christianson，B.（1992）. Automatic Hessians by reverse accumulation. IMA Journal of Numerical Analysis，12（2），135–150.

Chrupala，G.，Kadar，A.，and Alishahi，A.（2015）. Learning language through pictures. arXiv 1506.03694.

Chung，J.，Gulcehre，C.，Cho，K.，and Bengio，Y.（2014）. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS'2014 Deep Learning workshop，arXiv 1412.3555.

Chung，J.，Gülçehre，Ç.，Cho，K.，and Bengio，Y.（2015a）. Gated feedback recurrent neural networks. In ICML'15.

Chung，J.，Kastner，K.，Dinh，L.，Goel，K.，Courville，A.，and Bengio，Y.（2015b）. A recurrent latent variable model for sequential data. In NIPS'2015.

Ciresan，D.，Meier，U.，Masci，J.，and Schmidhuber，J.（2012）. Multi-column deep neural network for traffic sign classification. Neural Networks，32，333–338.

Ciresan，D. C.，Meier，U.，Gambardella，L. M.，and Schmidhuber，J.（2010）. Deep big simple neural nets for handwritten digit recognition. Neural Computation，22，1–14.

Coates，A. and Ng，A. Y.（2011）. The importance of encoding versus training with sparse coding and vector quantization. In ICML'2011.

Coates，A.，Lee，H.，and Ng，A. Y.（2011）. An analysis of single-layer networks in unsuper-vised feature learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics（AISTATS 2011）.

Coates，A.，Huval，B.，Wang，T.，Wu，D.，Catanzaro，B.，and Andrew，N.（2013）. Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester，editors，Proceedings of the 30th International Conference on Machine Learning（ICML-13），volume 28（3），pages 1337–1345. JMLR Workshop and Conference Proceedings.

Cohen，N.，Sharir，O.，and Shashua，A.（2015）. On the expressive power of deep learning: A tensor analysis. arXiv:1509.05009.

Collobert，R.（2004）. Large Scale Machine Learning. Ph.D. thesis，Université de Paris VI，LIP6.

Collobert，R.（2011）. Deep learning for efficient discriminative parsing. In AISTATS'2011.

Collobert，R. and Weston，J.（2008a）. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.

Collobert，R. and Weston，J.（2008b）. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.

Collobert，R.，Bengio，S.，and Bengio，Y.（2001）. A parallel mixture of SVMs for very large scale problems. Technical Report 12，IDIAP.

Collobert，R.，Bengio，S.，and Bengio，Y.（2002）. Parallel mixture of SVMs for very large scale problem. Neural Computation.

Collobert，R.，Weston，J.，Bottou，L.，Karlen，M.，Kavukcuoglu，K.，and Kuksa，P.（2011a）. Natural language processing（almost） from scratch. The Journal of Machine Learning Research，12，2493–2537.

Collobert，R.，Kavukcuoglu，K.，and Farabet，C.（2011b）. Torch7: A Matlab-like environment for machine learning. In BigLearn，NIPS Workshop.

Comon，P.（1994）. Independent component analysis-a new concept？Signal Processing，36，287–314.

Cortes，C. and Vapnik，V.（1995）. Support vector networks. Machine Learning，20，273–297.

Couprie，C.，Farabet，C.，Najman，L.，and LeCun，Y.（2013）. Indoor semantic segmentation using depth information. In International Conference on Learning Representations（ICLR2013）.

Courbariaux，M.，Bengio，Y.，and David，J.-P.（2015）. Low precision arithmetic for deep learning. In Arxiv:1412.7024，ICLR'2015 Workshop.

Courville，A.，Bergstra，J.，and Bengio，Y.（2011a）. Unsupervised models of images by spike-and-slab RBMs. In ICML'2011.

Courville，A.，Bergstra，J.，and Bengio，Y.（2011b）. Unsupervised models of images by spike-and-slab RBMs. In ICM（1b）.

Courville，A.，Desjardins，G.，Bergstra，J.，and Bengio，Y.（2014）. The spike-and-slab RBM and extensions to discrete and sparse data distributions. Pattern Analysis and Machine Intelligence，IEEE Transactions on，36（9），1874–1887.

Cover，T. M. and Thomas，J. A.（2006）. Elements of Information Theory，2nd Edition. Wiley-Interscience.

Cox，D. and Pinto，N.（2011）. Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face & Gesture Recognition and Workshops（FG 2011），2011 IEEE International Conference on，pages 8–15. IEEE.

Cramér，H.（1946）. Mathematical methods of statistics. Princeton University Press.

Crick，F. H. C. and Mitchison，G.（1983）. The function of dream sleep. Nature，304，111–114.

Cybenko，G.（1989）. Approximation by superpositions of a sigmoidal function. Mathematics of Control，Signals，and Systems，2，303–314.

Dahl，G. E.，Ranzato，M.，Mohamed，A.，and Hinton，G. E.（2010）. Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems（NIPS）.

Dahl，G. E.，Yu，D.，Deng，L.，and Acero，A.（2012）. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio，Speech，and Language Processing，20（1），33–42.

Dahl，G. E.，Sainath，T. N.，and Hinton，G. E.（2013）. Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP'2013.

Dahl，G. E.，Jaitly，N.，and Salakhutdinov，R.（2014）. Multi-task neural networks for QSAR predictions. arXiv:1406.1231.

Dauphin，Y. and Bengio，Y.（2013）. Stochastic ratio matching of RBMs for sparse high-dimensional inputs. In NIP（1）.

Dauphin，Y.，Glorot，X.，and Bengio，Y.（2011）. Large-scale learning of embeddings with recon-struction sampling. In ICML'2011.

Dauphin，Y.，Pascanu，R.，Gulcehre，C.，Cho，K.，Ganguli，S.，and Bengio，Y.（2014）. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS'2014.

Davis，A.，Rubinstein，M.，Wadhwa，N.，Mysore，G.，Durand，F.，and Freeman，W. T.（2014）. The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics（Proc. SIGGRAPH），33（4），79:1–79:10.

Dayan，P.（1990）. Reinforcement comparison. In Connectionist Models: Proceedings of the 1990 Connectionist Summer School，San Mateo，CA.

Dayan，P. and Hinton，G. E.（1996）. Varieties of Helmholtz machine. Neural Networks，9（8），1385–1403.

Dayan，P.，Hinton，G. E.，Neal，R. M.，and Zemel，R. S.（1995）. The Helmholtz machine. Neural computation，7（5），889–904.

Dean，J.，Corrado，G.，Monga，R.，Chen，K.，Devin，M.，Le，Q.，Mao，M.，Ranzato，M.，Senior，A.，Tucker，P.，Yang，K.，and Ng，A. Y.（2012）. Large scale distributed deep networks. In NIPS'2012.

Dean，T. and Kanazawa，K.（1989）. A model for reasoning about persistence and causation. Computational Intelligence，5（3），142–150.

Deerwester，S.，Dumais，S. T.，Furnas，G. W.，Landauer，T. K.，and Harshman，R.（1990）. Indexing by latent semantic analysis. Journal of the American Society for Information Science，41（6），391–407.

Delalleau，O. and Bengio，Y.（2011）. Shallow vs. deep sum-product networks. In NIPS.

Deng，J.，Dong，W.，Socher，R.，Li，L.-J.，Li，K.，and Fei-Fei，L.（2009）. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

Deng，J.，Berg，A. C.，Li，K.，and Fei-Fei，L.（2010a）. What does classifying more than 10，000 image categories tell us? In Proceedings of the 11th European Conference on Computer Vision: Part V，ECCV'10，pages 71–84，Berlin，Heidelberg. Springer-Verlag.

Deng，L. and Yu，D.（2014）. Deep learning–methods and applications. Foundations and Trends in Signal Processing.

Deng，L.，Seltzer，M.，Yu，D.，Acero，A.，Mohamed，A.，and Hinton，G.（2010b）. Binary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010，Makuhari，Chiba，Japan.

Denil，M.，Bazzani，L.，Larochelle，H.，and de Freitas，N.（2012）. Learning where to attend with deep architectures for image tracking. Neural Computation，24（8），2151–2184.

Denton，E.，Chintala，S.，Szlam，A.，and Fergus，R.（2015）. Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS.

Desjardins，G. and Bengio，Y.（2008）. Empirical evaluation of convolutional RBMs for vision. Technical Report 1327，Département d'Informatique et de Recherche Opérationnelle，Université de Montréal.

Desjardins，G.，Courville，A. C.，Bengio，Y.，Vincent，P.，and Delalleau，O.（2010）. Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In International Conference on Artificial Intelligence and Statistics，pages 145–152.

Desjardins，G.，Courville，A.，and Bengio，Y.（2011）. On tracking the partition function. In NIPS'2011.

Devlin，J.，Zbib，R.，Huang，Z.，Lamar，T.，Schwartz，R.，and Makhoul，J.（2014）. Fast and robust neural network joint models for statistical machine translation. In Proc. ACL'2014.

Devroye，L.（2013）. Non-Uniform Random Variate Generation. SpringerLink: Bücher. Springer New York.

DiCarlo，J. J.（2013）. Mechanisms underlying visual object recognition:Humans vs. neurons vs. machines. NIPS Tutorial.

Dinh，L.，Krueger，D.，and Bengio，Y.（2014）. NICE: Non-linear independent components esti-mation. arXiv:1410.8516.

Donahue，J.，Hendricks，L. A.，Guadarrama，S.，Rohrbach，M.，Venugopalan，S.，Saenko，K.，and Darrell，T.（2014）. Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.

Donoho，D. L. and Grimes，C.（2003）. Hessian eigenmaps: new locally linear embedding tech-niques for high-dimensional data. Technical Report 2003-08，Dept. Statistics，Stanford University.

Dosovitskiy，A.，Springenberg，J. T.，and Brox，T.（2015）. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，pages 1538–1546.

Doya，K.（1993）. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on Neural Networks，1，75–80.

Dreyfus，S. E.（1962）. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications，5（1），30–45.

Dreyfus，S. E.（1973）. The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control，18（4），383–385.

Drucker，H. and LeCun，Y.（1992）. Improving generalisation performance using double back-propagation. IEEE Transactions on Neural Networks，3（6），991–997.

Duchi，J.，Hazan，E.，and Singer，Y.（2011）. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

Dudik，M.，Langford，J.，and Li，L.（2011）. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine learning，ICML '11.

Dugas，C.，Bengio，Y.，Bélisle，F.，and Nadeau，C.（2001）. Incorporating second-order functional knowledge for better option pricing. In T. Leen，T. Dietterich，and V. Tresp，editors，Advances in Neural Information Processing Systems 13（NIPS'00），pages 472–478. MIT Press.

Dziugaite，G. K.，Roy，D. M.，and Ghahramani，Z.（2015）. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.

El Hihi，S. and Bengio，Y.（1996）. Hierarchical recurrent neural networks for long-term depen-dencies. In NIPS 8. MIT Press.

Elkahky，A. M.，Song，Y.，and He，X.（2015）. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web，pages 278–288.

Elman，J. L.（1993）. Learning and development in neural networks: The importance of starting small. Cognition，48，781–799.

Erhan，D.，Manzagol，P.-A.，Bengio，Y.，Bengio，S.，and Vincent，P.（2009）. The difficulty of training deep architectures and the effect of unsupervised pre-training. In AISTATS'2009，pages 153–160.

Erhan，D.，Bengio，Y.，Courville，A.，Manzagol，P.，Vincent，P.，and Bengio，S.（2010）. Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

Fahlman，S. E.，Hinton，G. E.，and Sejnowski，T. J.（1983）. Massively parallel architectures for AI: NETL，thistle，and Boltzmann machines. In Proceedings of the National Conference on Artificial Intelligence AAAI-83.

Fang，H.，Gupta，S.，Iandola，F.，Srivastava，R.，Deng，L.，Dollár，P.，Gao，J.，He，X.，Mitchell，M.，Platt，J. C.，Zitnick，C. L.，and Zweig，G.（2015）. From captions to visual concepts and back. arXiv:1411.4952.

Farabet，C.，LeCun，Y.，Kavukcuoglu，K.，Culurciello，E.，Martini，B.，Akselrod，P.，and Talay，S.（2011）. Large-scale FPGA-based convolutional networks. In R. Bekkerman，M. Bilenko，and J. Langford，editors，Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press.

Farabet，C.，Couprie，C.，Najman，L.，and LeCun，Y.（2013）. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence，35（8），1915–1929.

Fei-Fei，L.，Fergus，R.，and Perona，P.（2006）. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence，28（4），594–611.

Finn，C.，Tan，X. Y.，Duan，Y.，Darrell，T.，Levine，S.，and Abbeel，P.（2015）. Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv preprint arXiv:1509.06113.

Fisher，R. A.（1936）. The use of multiple measurements in taxonomic problems. Annals of Eugenics，7，179–188.

Földiák，P.（1989）. Adaptive network for optimal linear feature extraction. In International Joint Conference on Neural Networks（IJCNN），volume 1，pages 401–405，Washington 1989. IEEE，New York.

Franzius，M.，Sprekeler，H.，and Wiskott，L.（2007）. Slowness and sparseness lead to place，head-direction，and spatial-view cells.

Franzius，M.，Wilbert，N.，and Wiskott，L.（2008）. Invariant object recognition with slow feature analysis. In Proceedings of the 18th international conference on Artificial Neural Networks，Part I，ICANN '08，pages 961–970，Berlin，Heidelberg. Springer-Verlag.

Frasconi，P.，Gori，M.，and Sperduti，A.（1997）. On the efficient classification of data structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence.

Frasconi，P.，Gori，M.，and Sperduti，A.（1998）. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks，9（5），768–786.

Freund，Y. and Schapire，R. E.（1996a）. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference，pages 148–156，USA. ACM.

Freund，Y. and Schapire，R. E.（1996b）. Game theory，on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory，pages 325–332.

Frey，B. J.（1998）. Graphical models for machine learning and digital communication. MIT Press.

Frey，B. J.，Hinton，G. E.，and Dayan，P.（1996）. Does the wake-sleep algorithm learn good density estimators? In D. Touretzky，M. Mozer，and M. Hasselmo，editors，Advances in Neural Information Processing Systems 8（NIPS'95），pages 661–670. MIT Press，Cambridge，MA.

Frobenius，G.（1908）. Über matrizen aus positiven elementen，s. B. Preuss. Akad. Wiss. Berlin，Germany.

Fukushima，K.（1975）. Cognitron: A self-organizing multilayered neural network. Biological Cybernetics，20，121–136.

Fukushima，K.（1980）. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics，36，193–202.

Gal，Y. and Ghahramani，Z.（2015）. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158.

Gallinari，P.，LeCun，Y.，Thiria，S.，and Fogelman-Soulie，F.（1987）. Memoires associatives distribuees. In Proceedings of COGNITIVA 87，Paris，La Villette.

Garcia-Duran，A.，Bordes，A.，Usunier，N.，and Grandvalet，Y.（2015）. Combining two and three-way embeddings models for link prediction in knowledge bases. arXiv preprint arXiv:1506.00999.

Garofolo，J. S.，Lamel，L. F.，Fisher，W. M.，Fiscus，J. G.，and Pallett，D. S.（1993）. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon Technical Report N，93，27403.

Garson，J.（1900）. The metric system of identification of criminals，as used in Great Britain and Ireland. The Journal of the Anthropological Institute of Great Britain and Ireland，（2），177–227.

Gers，F. A.，Schmidhuber，J.，and Cummins，F.（2000）. Learning to forget: Continual prediction with LSTM. Neural computation，12（10），2451–2471.

Ghahramani，Z. and Hinton，G. E.（1996）. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1，Dpt. of Comp. Sci.，Univ. of Toronto.

Gillick，D.，Brunk，C.，Vinyals，O.，and Subramanya，A.（2015）. Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103.

Girshick，R.，Donahue，J.，Darrell，T.，and Malik，J.（2015）. Region-based convolutional networks for accurate object detection and segmentation.

Giudice，M. D.，Manera，V.，and Keysers，C.（2009）. Programmed to learn? The ontogeny of mirror neurons. Dev. Sci.，12（2），350–363.

Glorot，X. and Bengio，Y.（2010）. Understanding the difficulty of training deep feedforward neural networks. In AISTATS'2010.

Glorot，X.，Bordes，A.，and Bengio，Y.（2011a）. Deep sparse rectifier neural networks. In AISTATS'2011.

Glorot，X.，Bordes，A.，and Bengio，Y.（2011b）. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML'2011.

Glorot，X.，Bordes，A.，and Bengio，Y.（2011c）. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICM（1b），pages 97–110.

Goldberger，J.，Roweis，S.，Hinton，G. E.，and Salakhutdinov，R.（2005）. Neighbourhood components analysis. In L. Saul，Y. Weiss，and L. Bottou，editors，Advances in Neural Information Processing Systems 17（NIPS'04）. MIT Press.

Gong，S.，McKenna，S.，and Psarrou，A.（2000）. Dynamic Vision: From Images to Face Recognition. Imperial College Press.

Goodfellow，I.，Le，Q.，Saxe，A.，and Ng，A.（2009）. Measuring invariances in deep networks. In Y. Bengio，D. Schuurmans，C. Williams，J. Lafferty，and A. Culotta，editors，Advances in Neural Information Processing Systems 22（NIPS'09），pages 646–654.

Goodfellow，I.，Koenig，N.，Muja，M.，Pantofaru，C.，Sorokin，A.，and Takayama，L.（2010）. Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction（HRI），Osaka，Japan. ACM Press，ACM Press.

Goodfellow，I.，Mirza，M.，Xiao，D.，Courville，A.，and Bengio，Y.（2014a）. An empirical inves-tigation of catastrophic forgetting in gradient-based neural networks. In ICLR'14.

Goodfellow，I. J.（2010）. Technical report:Multidimensional，downsampled convolution for autoencoders. Technical report，Université de Montréal.

Goodfellow，I. J.（2014）. On distinguishability criteria for estimating generative models. In International Conference on Learning Representations，Workshops Track.

Goodfellow，I. J.，Courville，A.，and Bengio，Y.（2011）. Spike-and-slab sparse coding for unsu-pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.

Goodfellow，I. J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and Bengio，Y.（2013a）. Maxout networks. In ICML'2013.

Goodfellow，I. J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and Bengio，Y.（2013b）. Maxout networks. In ICM（1c），pages 1319–1327.

Goodfellow，I. J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and Bengio，Y.（2013c）. Maxout networks. Technical Report arXiv:1302.4389，Université de Montréal.

Goodfellow，I. J.，Mirza，M.，Courville，A.，and Bengio，Y.（2013d）. Multi-prediction deep Boltzmann machines. In NIP（1）.

Goodfellow，I. J.，Warde-Farley，D.，Lamblin，P.，Dumoulin，V.，Mirza，M.，Pascanu，R.，Bergstra，J.，Bastien，F.，and Bengio，Y.（2013e）. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.

Goodfellow，I. J.，Courville，A.，and Bengio，Y.（2013f）. Scaling up spike-and-slab models for unsupervised feature learning. IEEE T. PAMI，pages 1902–1914.

Goodfellow，I. J.，Courville，A.，and Bengio，Y.（2013g）. Scaling up spike-and-slab models for un-supervised feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence，35（8），1902–1914.

Goodfellow，I. J.，Shlens，J.，and Szegedy，C.（2014b）. Explaining and harnessing adversarial examples. CoRR，abs/1412.6572.

Goodfellow，I. J.，Pouget-Abadie，J.，Mirza，M.，Xu，B.，Warde-Farley，D.，Ozair，S.，Courville，A.，and Bengio，Y.（2014c）. Generative adversarial networks. In NIPS'2014.

Goodfellow，I. J.，Bulatov，Y.，Ibarz，J.，Arnoud，S.，and Shet，V.（2014d）. Multi-digit number recognition from Street View imagery using deep convolutional neural networks. In International Conference on Learning Representations.

Goodfellow，I. J.，Vinyals，O.，and Saxe，A. M.（2015）. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations.

Goodman，J.（2001）. Classes for fast maximum entropy training. In International Conference on Acoustics，Speech and Signal Processing（ICASSP），Utah.

Gori，M. and Tesi，A.（1992）. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence，PAMI-14（1），76–86.

Gosset，W. S.（1908）. The probable error of a mean. Biometrika，6（1），1–25. Originally published under the pseudonym“Student”.

Gouws，S.，Bengio，Y.，and Corrado，G.（2014）. BilBOWA: Fast bilingual distributed representations without word alignments. Technical report，arXiv:1410.2455.

Graf，H. P. and Jackel，L. D.（1989）. Analog electronic neural network circuits. Circuits and Devices Magazine，IEEE，5（4），44–49.

Graves，A.（2011）. Practical variational inference for neural networks. In NIPS'2011.

Graves，A.（2012）. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer.

Graves，A.（2013）. Generating sequences with recurrent neural networks. Technical report，arXiv:1308.0850.

Graves，A. and Jaitly，N.（2014）. Towards end-to-end speech recognition with recurrent neural networks. In ICML'2014.

Graves，A. and Schmidhuber，J.（2005）. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks，18（5），602–610.

Graves，A. and Schmidhuber，J.（2009）. Offine handwriting recognition with multidimensional recurrent neural networks. In D. Koller，D. Schuurmans，Y. Bengio，and L. Bottou，editors，NIPS'2008，pages 545–552.

Graves，A.，Fernández，S.，Gomez，F.，and Schmidhuber，J.（2006）. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML'2006，pages 369–376，Pittsburgh，USA.

Graves，A.，Liwicki，M.，Bunke，H.，Schmidhuber，J.，and Fernández，S.（2008）. Unconstrained on-line handwriting recognition with recurrent neural networks. In J. Platt，D. Koller，Y. Singer，and S. Roweis，editors，NIPS'2007，pages 577–584.

Graves，A.，Liwicki，M.，Fernández，S.，Bertolami，R.，Bunke，H.，and Schmidhuber，J.（2009）. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence，IEEE Transactions on，31（5），855–868.

Graves，A.，Mohamed，A.，and Hinton，G.（2013）. Speech recognition with deep recurrent neural networks. In ICASSP'2013，pages 6645–6649.

Graves，A.，Wayne，G.，and Danihelka，I.（2014）. Neural Turing machines. arXiv:1410.5401.

Grefenstette，E.，Hermann，K. M.，Suleyman，M.，and Blunsom，P.（2015）. Learning to transduce with unbounded memory. In NIPS'2015.

Greff，K.，Srivastava，R. K.，Koutník，J.，Steunebrink，B. R.，and Schmidhuber，J.（2015）. LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069.

Gregor，K. and LeCun，Y.（2010a）. Emergence of complex-like cells in a temporal product network with local receptivefields. Technical report，arXiv:1006.0448.

Gregor，K. and LeCun，Y.（2010b）. Learning fast approximations of sparse coding. In L. Bottou and M. Littman，editors，Proceedings of the Twenty-seventh International Conference on Machine Learning（ICML-10）. ACM.

Gregor，K.，Danihelka，I.，Mnih，A.，Blundell，C.，and Wierstra，D.（2014）. Deep autoregressive networks. In International Conference on Machine Learning（ICML'2014）.

Gregor，K.，Danihelka，I.，Graves，A.，and Wierstra，D.（2015）. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

Gretton，A.，Borgwardt，K. M.，Rasch，M. J.，Schölkopf，B.，and Smola，A.（2012）. A kernel two-sample test. The Journal of Machine Learning Research，13（1），723–773.

Guillaume Desjardins，Karen Simonyan，R. P. K. K.（2015）. Natural neural networks. Technical report，arXiv:1507.00210.

Gulcehre，C. and Bengio，Y.（2013）. Knowledge matters: Importance of prior information for optimization. Technical Report arXiv:1301.4083，Universite de Montreal.

Guo，H. and Gelfand，S. B.（1992）. Classification trees with neural network feature extraction. Neural Networks，IEEE Transactions on，3（6），923–933.

Gupta，S.，Agrawal，A.，Gopalakrishnan，K.，and Narayanan，P.（2015）. Deep learning with limited numerical precision. CoRR，abs/1502.02551.

Gutmann，M. and Hyvarinen，A.（2010）. Noise-contrastive estimation: A new estimation princi-ple for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics（AISTATS'10）.

Hadsell，R.，Sermanet，P.，Ben，J.，Erkan，A.，Han，J.，Muller，U.，and LeCun，Y.（2007）. Online learning for offroad robots: Spatial label propagation to learn long-range traversability. In Proceedings of Robotics: Science and Systems，Atlanta，GA，USA.

Hajnal，A.，Maass，W.，Pudlak，P.，Szegedy，M.，and Turan，G.（1993）. Threshold circuits of bounded depth. J. Comput. System. Sci.，46，129–154.

Håstad，J.（1986）. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing，pages 6–20，Berkeley，California. ACM Press.

Håstad，J. and Goldmann，M.（1991）. On the power of small-depth threshold circuits. Computational Complexity，1，113–129.

Hastie，T.，Tibshirani，R.，and Friedman，J.（2001）. The elements of statistical learning: data mining，inference and prediction. Springer Series in Statistics. Springer Verlag.

He，K.，Zhang，X.，Ren，S.，and Sun，J.（2015）. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv preprint arXiv:1502.01852.

Hebb，D. O.（1949）. The Organization of Behavior. Wiley，New York.

Henaff，M.，Jarrett，K.，Kavukcuoglu，K.，and LeCun，Y.（2011）. Unsupervised learning of sparse features for scalable audio classification. In ISMIR'11.

Henderson，J.（2003）. Inducing history representations for broad coverage statistical parsing. In HLT-NAACL，pages 103–110.

Henderson，J.（2004）. Discriminative training of a neural network statistical parser. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics，page 95.

Henniges，M.，Puertas，G.，Bornschein，J.，Eggert，J.，and Lücke，J.（2010）. Binary sparse coding. In Latent Variable Analysis and Signal Separation，pages 450–457. Springer.

Herault，J. and Ans，B.（1984）. Circuits neuronaux à synapses modifiables: Décodage de messages composites par apprentissage non supervisé. Comptes Rendus de l'Académie des Sciences，299（III-13），525–528.

Hinton，G.，Deng，L.，Dahl，G. E.，Mohamed，A.，Jaitly，N.，Senior，A.，Vanhoucke，V.，Nguyen，P.，Sainath，T.，and Kingsbury，B.（2012a）. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine，29（6），82–97.

Hinton，G.，Vinyals，O.，and Dean，J.（2015）. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Hinton，G. E.（1989）. Connectionist learning procedures. Artificial Intelligence，40，185–234.

Hinton，G. E.（1990）. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence，46（1），47–75.

Hinton，G. E.（1999）. Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks（ICANN），volume 1，pages 1–6，Edinburgh，Scotland. IEE.

Hinton，G. E.（2000）. Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004，Gatsby Unit，University College London.

Hinton，G. E.（2006）. To recognize shapes，first learn to generate images. Technical Report UTML TR 2006-003，University of Toronto.

Hinton，G. E.（2007a）. How to do backpropagation in a brain. Invited talk at the NIPS'2007 Deep Learning Workshop.

Hinton，G. E.（2007b）. Learning multiple layers of representation. Trends in cognitive sciences，11（10），428–434.

Hinton，G. E.（2010）. A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003，Comp. Sc.，University of Toronto.

Hinton，G. E.（2012）. Tutorial on deep learning. IPAM Graduate Summer School: Deep Learning，Feature Learning.

Hinton，G. E. and Ghahramani，Z.（1997）. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London.

Hinton，G. E. and McClelland，J. L.（1988）. Learning representations by recirculation. In NIPS'1987，pages 358–366.

Hinton，G. E. and Roweis，S.（2003）. Stochastic neighbor embedding. In NIPS'2002.

Hinton，G. E. and Salakhutdinov，R.（2006）. Reducing the dimensionality of data with neural networks. Science，313（5786），504–507.

Hinton，G. E. and Sejnowski，T. J.（1986）. Learning and relearning in Boltzmann machines. In D. E. Rumelhart and J. L. McClelland，editors，Parallel Distributed Processing，volume 1，chapter 7，pages 282–317. MIT Press，Cambridge.

Hinton，G. E. and Sejnowski，T. J.（1999）. Unsupervised learning: foundations of neural computation. MIT press.

Hinton，G. E. and Shallice，T.（1991）. Lesioning an attractor network: investigations of acquired dyslexia. Psychological review，98（1），74.

Hinton，G. E. and Zemel，R. S.（1994）. Autoencoders，minimum description length，and Helmholtz free energy. In NIPS'1993.

Hinton，G. E.，Sejnowski，T. J.，and Ackley，D. H.（1984a）. Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119，Carnegie-Mellon Uni-versity，Dept. of Computer Science.

Hinton，G. E.，Sejnowski，T. J.，and Ackley，D. H.（1984b）. Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119，Carnegie-Mellon Uni-versity，Dept. of Computer Science.

Hinton，G. E.，McClelland，J.，and Rumelhart，D.（1986）. Distributed representations. In D. E. Rumelhart and J. L. McClelland，editors，Parallel Distributed Processing: Explorations in the Microstructure of Cognition，volume 1，pages 77–109. MIT Press，Cambridge.

Hinton，G. E.，Revow，M.，and Dayan，P.（1995a）. Recognizing handwritten digits using mixtures of linear models. In G. Tesauro，D. Touretzky，and T. Leen，editors，Advances in Neural Information Processing Systems 7（NIPS'94），pages 1015–1022. MIT Press，Cambridge，MA.

Hinton，G. E.，Dayan，P.，Frey，B. J.，and Neal，R. M.（1995b）. The wake-sleep algorithm for unsupervised neural networks. Science，268，1558–1161.

Hinton，G. E.，Dayan，P.，and Revow，M.（1997）. Modelling the manifolds of images of hand-written digits. IEEE Transactions on Neural Networks，8，65–74.

Hinton，G. E.，Welling，M.，Teh，Y. W.，and Osindero，S.（2001）. A new view of ICA. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation（ICA'01），pages 746–751，San Diego，CA.

Hinton，G. E.，Osindero，S.，and Teh，Y.（2006a）. A fast learning algorithm for deep belief nets. Neural Computation，18，1527–1554.

Hinton，G. E.，Osindero，S.，and Teh，Y.-W.（2006b）. A fast learning algorithm for deep belief nets. Neural Computation，18，1527–1554.

Hinton，G. E.，Deng，L.，Yu，D.，Dahl，G. E.，Mohamed，A.，Jaitly，N.，Senior，A.，Vanhoucke，V.，Nguyen，P.，Sainath，T. N.，and Kingsbury，B.（2012b）. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Process. Mag.，29（6），82–97.

Hinton，G. E.，Srivastava，N.，Krizhevsky，A.，Sutskever，I.，and Salakhutdinov，R.（2012c）. Improving neural networks by preventing co-adaptation of feature detectors. Technical report，arXiv:1207.0580.

Hinton，G. E.，Srivastava，N.，Krizhevsky，A.，Sutskever，I.，and Salakhutdinov，R.（2012d）. Improving neural networks by preventing co-adaptation of feature detectors. Technical report，arXiv:1207.0580.

Hinton，G. E.，Vinyals，O.，and Dean，J.（2014）. Dark knowledge. Invited talk at the BayLearn Bay Area Machine Learning Symposium.

Hochreiter，S.（1991a）. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis，T.U. München.

Hochreiter，S.（1991b）. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis，Institut für Informatik，Lehrstuhl Prof. Brauer，Technische Universität München.

Hochreiter，S. and Schmidhuber，J.（1995）. Simplifying neural nets by discoveringflat minima. In Advances in Neural Information Processing Systems 7，pages 529–536. MIT Press.

Hochreiter，S. and Schmidhuber，J.（1997）. Long short-term memory. Neural Computation，9（8），1735–1780.

Hochreiter，S.，Bengio，Y.，and Frasconi，P.（2001）. Gradientflow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer，editors，Field Guide to Dynamical Recurrent Networks. IEEE Press.

Holi，J. L. and Hwang，J.-N.（1993）. Finite precision error analysis of neural network hardware implementations. Computers，IEEE Transactions on，42（3），281–290.

Holt，J. L. and Baker，T. E.（1991）. Back propagation simulations using limited precision calculations. In Neural Networks，1991.，IJCNN-91-Seattle International Joint Conference on，volume 2，pages 121–126. IEEE.

Hornik，K.，Stinchcombe，M.，and White，H.（1989）. Multilayer feedforward networks are universal approximators. Neural Networks，2，359–366.

Hornik，K.，Stinchcombe，M.，and White，H.（1990）. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks，3（5），551–560.

Hsu，F.-H.（2002）. Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton University Press，Princeton，NJ，USA.

Huang，F. and Ogata，Y.（2002）. Generalized pseudo-likelihood estimates for Markov random fields on lattice. Annals of the Institute of Statistical Mathematics，54（1），1–18.

Huang，P.-S.，He，X.，Gao，J.，Deng，L.，Acero，A.，and Heck，L.（2013）. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management，pages 2333–2338. ACM.

Hubel，D. and Wiesel，T.（1968）. Receptivefields and functional architecture of monkey striate cortex. Journal of Physiology（London），195，215–243.

Hubel，D. H. and Wiesel，T. N.（1959）. Receptivefields of single neurons in the cat's striate cortex. Journal of Physiology，148，574–591.

Hubel，D. H. and Wiesel，T. N.（1962）. Receptivefields，binocular interaction，and functional architecture in the cat's visual cortex. Journal of Physiology（London），160，106–154.

Huszar，F.（2015）. How（not） to train your generative model: schedule sampling，likelihood，adversary? arXiv:1511.05101.

Hutter，F.，Hoos，H.，and Leyton-Brown，K.（2011）. Sequential model-based optimization for general algorithm configuration. In LION-5. Extended version as UBC Tech report TR-2010-10.

Hyotyniemi，H.（1996）. Turing machines are recurrent neural networks. In STeP'96，pages 13–24.

Hyvärinen，A.（1999）. Survey on independent component analysis. Neural Computing Surveys，2，94–128.

Hyvärinen，A.（2005a）. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research，6，695–709.

Hyvärinen，A.（2005b）. Estimation of non-normalized statistical models using score matching. J. Machine Learning Res.，6.

Hyvärinen，A.（2007a）. Connections between score matching，contrastive divergence，and pseu-dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks，18，1529–1531.

Hyvärinen，A.（2007b）. Some extensions of score matching. Computational Statistics and Data Analysis，51，2499–2512.

Hyvärinen，A. and Hoyer，P. O.（1999）. Emergence of topography and complex cell properties from natural images using extensions of ica. In NIPS，pages 827–833.

Hyvärinen，A. and Pajunen，P.（1999）. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks，12（3），429–439.

Hyvärinen，A.，Karhunen，J.，and Oja，E.（2001a）. Independent Component Analysis. Wiley-Interscience.

Hyvärinen，A.，Hoyer，P. O.，and Inki，M. O.（2001b）. Topographic independent component analysis. Neural Computation，13（7），1527–1558.

Hyvärinen，A.，Hurri，J.，and Hoyer，P. O.（2009）. Natural Image Statistics: A probabilistic approach to early computational vision. Springer-Verlag.

Iba，Y.（2001）. Extended ensemble Monte Carlo. International Journal of Modern Physics，C12，623–656.

Inayoshi，H. and Kurita，T.（2005）. Improved generalization by adding both auto-association and hidden-layer noise to neural-network-based-classifiers. IEEE Workshop on Machine Learning for Signal Processing，pages 141–146.

Ioffe，S. and Szegedy，C.（2015）. Batch normalization: Accelerating deep network training by reducing internal covariate shift.

Jacobs，R. A.（1988）. Increased rates of convergence through learning rate adaptation. Neural networks，1（4），295–307.

Jacobs，R. A.，Jordan，M. I.，Nowlan，S. J.，and Hinton，G. E.（1991）. Adaptive mixtures of local experts. Neural Computation，3，79–87.

Jaeger，H.（2003）. Adaptive nonlinear system identification with echo state networks. In Advances in Neural Information Processing Systems 15.

Jaeger，H.（2007a）. Discovering multiscale dynamical features with hierarchical echo state networks. Technical report，Jacobs University.

Jaeger，H.（2007b）. Echo state network. Scholarpedia，2（9），2330.

Jaeger，H.（2012）. Long short-term memory in echo state networks: Details of a simulation study. Technical report，Technical report，Jacobs University Bremen.

Jaeger，H. and Haas，H.（2004）. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science，304（5667），78–80.

Jaeger，H.，Lukosevicius，M.，Popovici，D.，and Siewert，U.（2007）. Optimization and applications of echo state networks with leaky-integrator neurons. Neural Networks，20（3），335–352.

Jain，V.，Murray，J. F.，Roth，F.，Turaga，S.，Zhigulin，V.，Briggman，K. L.，Helmstaedter，M. N.，Denk，W.，and Seung，H. S.（2007）. Supervised learning of image restoration with convolutional networks. In Computer Vision，2007. ICCV 2007. IEEE 11th International Conference on，pages 1–8. IEEE.

Jaitly，N. and Hinton，G.（2011）. Learning a better representation of speech soundwaves using restricted Boltzmann machines. In Acoustics，Speech and Signal Processing（ICASSP），2011 IEEE International Conference on，pages 5884–5887. IEEE.

Jaitly，N. and Hinton，G. E.（2013）. Vocal tract length perturbation（VTLP） improves speech recognition. In ICML'2013.

Jarrett，K.，Kavukcuoglu，K.，Ranzato，M.，and LeCun，Y.（2009a）. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision（ICCV'09），pages 2146–2153. IEEE.

Jarrett，K.，Kavukcuoglu，K.，Ranzato，M.，and LeCun，Y.（2009b）. What is the best multi-stage architecture for object recognition? In ICCV'09.

Jarzynski，C.（1997）. Nonequilibrium equality for free energy differences. Phys. Rev. Lett.，78，2690–2693.

Jaynes，E. T.（2003）. Probability Theory: The Logic of Science. Cambridge University Press.

Jean，S.，Cho，K.，Memisevic，R.，and Bengio，Y.（2014）. On using very large target vocabulary for neural machine translation. arXiv:1412.2007.

Jelinek，F. and Mercer，R. L.（1980）. Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal，editors，Pattern Recognition in Practice. North-Holland，Amsterdam.

Jia，Y.（2013）. Caffe:An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.

Jia，Y.，Huang，C.，and Darrell，T.（2012）. Beyond spatial pyramids: Receptivefield learning for pooled image features. In Computer Vision and Pattern Recognition（CVPR），2012 IEEE Conference on，pages 3370–3377. IEEE.

Jim，K.-C.，Giles，C. L.，and Horne，B. G.（1996）. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks，7（6），1424–1438.

Jordan，M. I.（1998）. Learning in Graphical Models. Kluwer，Dordrecht，Netherlands.

Joulin，A. and Mikolov，T.（2015）. Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv preprint arXiv:1503.01007.

Jozefowicz，R.，Zaremba，W.，and Sutskever，I.（2015）. An empirical evaluation of recurrent network architectures. In ICML'2015.

Judd，J. S.（1989）. Neural Network Design and the Complexity of Learning. MIT press.

Jutten，C. and Herault，J.（1991）. Blind separation of sources，part I: an adaptive algorithm based on neuromimetic architecture. Signal Processing，24，1–10.

Kahou，S. E.，Pal，C.，Bouthillier，X.，Froumenty，P.，Gülçehre，c.，Memisevic，R.，Vincent，P.，Courville，A.，Bengio，Y.，Ferrari，R. C.，Mirza，M.，Jean，S.，Carrier，P. L.，Dauphin，Y.，Boulanger-Lewandowski，N.，Aggarwal，A.，Zumer，J.，Lamblin，P.，Raymond，J.-P.，Des-jardins，G.，Pascanu，R.，Warde-Farley，D.，Torabi，A.，Sharma，A.，Bengio，E.，Côté，M.，Konda，K. R.，and Wu，Z.（2013）. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction.

Kalchbrenner，N. and Blunsom，P.（2013）. Recurrent continuous translation models. In EMNLP'2013.

Kalchbrenner，N.，Danihelka，I.，and Graves，A.（2015）. Grid long short-term memory. arXiv preprint arXiv:1507.01526.

Kamyshanska，H. and Memisevic，R.（2015）. The potential energy of an autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Karpathy，A. and Li，F.-F.（2015）. Deep visual-semantic alignments for generating image de-scriptions. In CVPR'2015. arXiv:1412.2306.

Karpathy，A.，Toderici，G.，Shetty，S.，Leung，T.，Sukthankar，R.，and Fei-Fei，L.（2014）. Large-scale video classification with convolutional neural networks. In CVPR.

Karush，W.（1939）. Minima of Functions of Several Variables with Inequalities as Side Constraints. Master's thesis，Dept. of Mathematics，Univ. of Chicago.

Katz，S. M.（1987）. Estimation of probabilities from sparse data for the language model compo-nent of a speech recognizer. IEEE Transactions on Acoustics，Speech，and Signal Processing，ASSP-35（3），400–401.

Kavukcuoglu，K.，Ranzato，M.，and LeCun，Y.（2008）. Fast inference in sparse coding algorithms with applications to object recognition. Technical report，Computational and Biological Learn-ing Lab，Courant Institute，NYU. Tech Report CBLL-TR-2008-12-01.

Kavukcuoglu，K.，Ranzato，M.-A.，Fergus，R.，and LeCun，Y.（2009）. Learning invariant features through topographicfilter maps. In CVPR'2009.

Kavukcuoglu，K.，Sermanet，P.，Boureau，Y.-L.，Gregor，K.，Mathieu，M.，and LeCun，Y.（2010）. Learning convolutional feature hierarchies for visual recognition. In NIPS'2010.

Kelley，H. J.（1960）. Gradient theory of optimalflight paths. ARS Journal，30（10），947–954.

Khan，F.，Zhu，X.，and Mutlu，B.（2011）. How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems 24（NIPS'11），pages 1449–1457.

Kim，S. K.，McAfee，L. C.，McMahon，P. L.，and Olukotun，K.（2009）. A highly scalable restricted Boltzmann machine FPGA implementation. In Field Programmable Logic and Applications，2009. FPL 2009. International Conference on，pages 367–372. IEEE.

Kindermann，R.（1980）. Markov Random Fields and Their Applications（Contemporary Mathe-matics；V. 1）. American Mathematical Society.

Kingma，D. and Ba，J.（2014）. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kingma，D. and LeCun，Y.（2010a）. Regularized estimation of image statistics by score matching. In NIPS'2010.

Kingma，D. and LeCun，Y.（2010b）. Regularized estimation of image statistics by score matching. In J. Lafferty，C. K. I. Williams，J. Shawe-Taylor，R. Zemel，and A. Culotta，editors，Advances in Neural Information Processing Systems 23，pages 1126–1134.

Kingma，D.，Rezende，D.，Mohamed，S.，and Welling，M.（2014）. Semi-supervised learning with deep generative models. In NIPS'2014.

Kingma，D. P.（2013）. Fast gradient-based inference with continuous latent variable models in auxiliary form. Technical report，arxiv:1306.0733.

Kingma，D. P. and Welling，M.（2014a）. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations（ICLR）.

Kingma，D. P. and Welling，M.（2014b）. Efficient gradient-based inference through transforma-tions between bayes nets and neural nets. Technical report，arxiv:1402.0480.

Kirkpatrick，S.，Jr.，C. D. G.，and Vecchi，M. P.（1983）. Optimization by simulated annealing. Science，220，671–680.

Kiros，R.，Salakhutdinov，R.，and Zemel，R.（2014a）. Multimodal neural language models. In ICML'2014.

Kiros，R.，Salakhutdinov，R.，and Zemel，R.（2014b）. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 ［cs.LG］.

Klementiev，A.，Titov，I.，and Bhattarai，B.（2012）. Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012.

Knowles-Barley，S.，Jones，T. R.，Morgan，J.，Lee，D.，Kasthuri，N.，Lichtman，J. W.，and Pfister，H.（2014）. Deep learning for the connectome. GPU Technology Conference.

Koller，D. and Friedman，N.（2009）. Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Konig，Y.，Bourlard，H.，and Morgan，N.（1996）. REMAP:Recursive estimation and maxi-mization of a posteriori probabilities–application to transition-based connectionist speech recognition. In D. Touretzky，M. Mozer，and M. Hasselmo，editors，Advances in Neural Information Processing Systems 8（NIPS'95）. MIT Press，Cambridge，MA.

Koren，Y.（2009）. The BellKor solution to the Netflix grand prize.

Kotzias，D.，Denil，M.，de Freitas，N.，and Smyth，P.（2015）. From group to individual labels using deep features. In ACM SIGKDD.

Koutnik，J.，Greff，K.，Gomez，F.，and Schmidhuber，J.（2014）. A clockwork RNN. In ICML'2014.

Kočiský，T.，Hermann，K. M.，and Blunsom，P.（2014）. Learning Bilingual Word Representations by Marginalizing Alignments. In Proceedings of ACL.

Krause，O.，Fischer，A.，Glasmachers，T.，and Igel，C.（2013）. Approximation properties of DBNs with binary hidden units and real-valued visible units. In ICML'2013.

Krizhevsky，A.（2010）. Convolutional deep belief networks on CIFAR-10. Technical report，Uni-versity of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/kriz/conv-cifar10-aug2010.pdf.

Krizhevsky，A. and Hinton，G.（2009）. Learning multiple layers of features from tiny images. Technical report，University of Toronto.

Krizhevsky，A. and Hinton，G. E.（2011）. Using very deep autoencoders for content-based image retrieval. In ESANN.

Krizhevsky，A.，Sutskever，I.，and Hinton，G.（2012a）. ImageNet classification with deep convo-lutional neural networks. In NIPS'2012.

Krizhevsky，A.，Sutskever，I.，and Hinton，G.（2012b）. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25（NIPS'2012）.

Krueger，K. A. and Dayan，P.（2009）. Flexible shaping: how learning in small steps helps. Cognition，110，380–394.

Kuhn，H. W. and Tucker，A. W.（1951）. Nonlinear programming. In Proceedings of the Sec-ond Berkeley Symposium on Mathematical Statistics and Probability，pages 481–492，Berkeley，Calif. University of California Press.

Kumar，A.，Irsoy，O.，Ondruska，P.，Iyyer，M.，Bradbury，J.，Gulrajani，I.，and Socher，R.（2015a）. Ask me anything: Dynamic memory networks for natural language processing. Technical report，arXiv:1506.07285.

Kumar，A.，Irsoy，O.，Su，J.，Bradbury，J.，English，R.，Pierce，B.，Ondruska，P.，Iyyer，M.，Gulrajani，I.，and Socher，R.（2015b）. Ask me anything: Dynamic memory networks for natural language processing. arXiv:1506.07285.

Kumar，M. P.，Packer，B.，and Koller，D.（2010）. Self-paced learning for latent variable models. In J. Lafferty，C. K. I. Williams，J. Shawe-Taylor，R. Zemel，and A. Culotta，editors，Advances in Neural Information Processing Systems 23，pages 1189–1197.

Lang，K. J. and Hinton，G. E.（1988）. The development of the time-delay neural network archi-tecture for speech recognition. Technical Report CMU-CS-88-152，Carnegie-Mellon University.

Lang，K. J.，Waibel，A. H.，and Hinton，G. E.（1990）. A time-delay neural network architecture for isolated word recognition. Neural networks，3（1），23–43.

Langford，J. and Zhang，T.（2008）. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS'2008，pages 1096–1103.

Lappalainen，H.，Giannakopoulos，X.，Honkela，A.，and Karhunen，J.（2000）. Nonlinear independent component analysis using ensemble learning: Experiments and discussion. In Proc. ICA. Citeseer.

Larochelle，H. and Bengio，Y.（2008a）. Classification using discriminative restricted Boltzmann machines. In ICML'2008.

Larochelle，H. and Bengio，Y.（2008b）. Classification using discriminative restricted Boltzmann machines. In ICM（1a），pages 536–543.

Larochelle，H. and Hinton，G. E.（2010）. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in Neural Information Processing Systems 23，pages 1243–1251.

Larochelle，H. and Murray，I.（2011）. The Neural Autoregressive Distribution Estimator. In AISTATS'2011.

Larochelle，H.，Erhan，D.，and Bengio，Y.（2008）. Zero-data learning of new tasks. In AAAI Conference on Artificial Intelligence.

Larochelle，H.，Bengio，Y.，Louradour，J.，and Lamblin，P.（2009）. Exploring strategies for training deep neural networks. In JML（1），pages 1–40.

Lasserre，J. A.，Bishop，C. M.，and Minka，T. P.（2006）. Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference（CVPR'06），pages 87–94，Washington，DC，USA. IEEE Computer Society.

Le，Q.，Ngiam，J.，Chen，Z.，hao Chia，D. J.，Koh，P. W.，and Ng，A.（2010）. Tiled convolutional neural networks. In J. Lafferty，C. K. I. Williams，J. Shawe-Taylor，R. Zemel，and A. Culotta，editors，Advances in Neural Information Processing Systems 23（NIPS'10），pages 1279–1287.

Le，Q.，Ngiam，J.，Coates，A.，Lahiri，A.，Prochnow，B.，and Ng，A.（2011）. On optimization methods for deep learning. In Proc. ICML'2011. ACM.

Le，Q.，Ranzato，M.，Monga，R.，Devin，M.，Corrado，G.，Chen，K.，Dean，J.，and Ng，A.（2012）. Building high-level features using large scale unsupervised learning. In ICML'2012.

Le Roux，N. and Bengio，Y.（2008）. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation，20（6），1631–1649.

Le Roux，N. and Bengio，Y.（2010）. Deep belief networks are compact universal approximators. Neural Computation，22（8），2192–2207.

LeCun，Y.（1985）. Une procédure d'apprentissage pour Réseau à seuil assymétrique. In Cognitiva 85: A la Frontière de l'Intelligence Artificielle，des Sciences de la Connaissance et des Neurosciences，pages 599–604，Paris 1985. CESTA，Paris.

LeCun，Y.（1986）. Learning processes in an asymmetric threshold network. In E. Bienenstock，F. Fogelman-Soulié，and G. Weisbuch，editors，Disordered Systems and Biological Organization，pages 233–240. Springer-Verlag，Berlin，Les Houches 1985.

LeCun，Y.（1987）. Modèles connexionistes de l'apprentissage. Ph.D. thesis，Université de Paris VI.

LeCun，Y.（1989）. Generalization and network design strategies. Technical Report CRG-TR-89-4，University of Toronto.

LeCun，Y.，Jackel，L. D.，Boser，B.，Denker，J. S.，Graf，H. P.，Guyon，I.，Henderson，D.，Howard，R. E.，and Hubbard，W.（1989）. Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine，27（11），41–46.

LeCun，Y.，Bottou，L.，Orr，G. B.，and Müller，K.-R.（1998a）. Efficient backprop. In Neural Networks，Tricks of the Trade，Lecture Notes in Computer Science LNCS 1524. Springer Verlag.

LeCun，Y.，Bottou，L.，Orr，G. B.，and Müller，K.（1998b）. Efficient backprop. In Neural Networks，Tricks of the Trade.

LeCun，Y.，Bottou，L.，Bengio，Y.，and Haffner，P.（1998c）. Gradient based learning applied to document recognition. Proc. IEEE.

LeCun，Y.，Kavukcuoglu，K.，and Farabet，C.（2010）. Convolutional networks and applications in vision. In Circuits and Systems（ISCAS），Proceedings of 2010 IEEE International Symposium on，pages 253–256. IEEE.

L'Ecuyer，P.（1994）. Efficiency improvement and variance reduction. In Proceedings of the 1994 Winter Simulation Conference，pages 122–132.

Lee，C.-Y.，Xie，S.，Gallagher，P.，Zhang，Z.，and Tu，Z.（2014）. Deeply-supervised nets. arXiv preprint arXiv:1409.5185.

Lee，H.，Battle，A.，Raina，R.，and Ng，A.（2007）. Efficient sparse coding algorithms. In B. Schölkopf，J. Platt，and T. Hoffman，editors，Advances in Neural Information Processing Systems 19（NIPS'06），pages 801–808. MIT Press.

Lee，H.，Ekanadham，C.，and Ng，A.（2008）. Sparse deep belief net model for visual area V2. In NIPS'07.

Lee，H.，Grosse，R.，Ranganath，R.，and Ng，A. Y.（2009）. Convolutional deep belief net-works for scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman，editors，Proceedings of the Twenty-sixth International Conference on Machine Learning（ICML'09）. ACM，Montreal，Canada.

Lee，Y. J. and Grauman，K.（2011）. Learning the easy thingsfirst: self-paced visual category discovery. In CVPR'2011.

Leibniz，G. W.（1676）. Memoir using the chain rule.（Cited in TMME 7:2&3 p 321-332，2010）.

Lenat，D. B. and Guha，R. V.（1989）. Building large knowledge-based systems；representation and inference in the Cyc project. Addison-Wesley Longman Publishing Co.，Inc.

Leshno，M.，Lin，V. Y.，Pinkus，A.，and Schocken，S.（1993）. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks，6，861–867.

Levenberg，K.（1944）. A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics，II（2），164–168.

L'Hôpital，G. F. A.（1696）. Analyse des infiniment petits，pour l'intelligence des lignes courbes. Paris: L'Imprimerie Royale.

Li，Y.，Swersky，K.，and Zemel，R. S.（2015）. Generative moment matching networks. CoRR，abs/1502.02761.

Lin，T.，Horne，B. G.，Tino，P.，and Giles，C. L.（1996）. Learning long-term dependencies is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural Networks，7（6），1329–1338.

Lin，Y.，Liu，Z.，Sun，M.，Liu，Y.，and Zhu，X.（2015）. Learning entity and relation embeddings for knowledge graph completion. In Proc. AAAI'15.

Linde，N.（1992）. The machine that changed the world，episode 3. Documentary miniseries.

Lindsey，C. and Lindblad，T.（1994）. Review of hardware neural networks: a user's perspective. In Proc. Third Workshop on Neural Networks: From Biology to High Energy Physics，pages 195–202，Isola d'Elba，Italy.

Linnainmaa，S.（1976）. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics，16（2），146–160.

LISA（2008）. Deep learning tutorials:Restricted Boltzmann machines. Technical report，LISA Lab，Université de Montréal.

Long，P. M. and Servedio，R. A.（2010）. Restricted Boltzmann machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning（ICML'10）.

Lotter，W.，Kreiman，G.，and Cox，D.（2015）. Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380.

Lovelace，A.（1842）. Notes upon L. F. Menabrea's“Sketch of the Analytical Engine invented by Charles Babbage”.

Lu，L.，Zhang，X.，Cho，K.，and Renals，S.（2015）. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proc. Interspeech.

Lu，T.，Pál，D.，and Pál，M.（2010）. Contextual multi-armed bandits. In International Conference on Artificial Intelligence and Statistics，pages 485–492.

Luenberger，D. G.（1984）. Linear and Nonlinear Programming. Addison Wesley.

Lukoševičius，M. and Jaeger，H.（2009）. Reservoir computing approaches to recurrent neural network training. Computer Science Review，3（3），127–149.

Luo，H.，Shen，R.，Niu，C.，and Ullrich，C.（2011）. Learning class-relevant features and class-irrelevant features via a hybrid third-order RBM. In International Conference on Artificial Intelligence and Statistics，pages 470–478.

Luo，H.，Carrier，P. L.，Courville，A.，and Bengio，Y.（2013）. Texture modeling with convolutional spike-and-slab RBMs and deep extensions. In AISTATS'2013.

Lyu，S.（2009）. Interpretation and generalization of score matching. In Proceedings of the Twenty-fifth Conference in Uncertainty in Artificial Intelligence（UAI'09）.

Ma，J.，Sheridan，R. P.，Liaw，A.，Dahl，G. E.，and Svetnik，V.（2015）. Deep neural nets as a method for quantitative structure–activity relationships. J. Chemical information and modeling.

Maas，A. L.，Hannun，A. Y.，and Ng，A. Y.（2013）. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio，Speech，and Language Processing.

Maass，W.（1992）. Bounds for the computational power and learning complexity of analog neural nets（extended abstract）. In Proc. of the 25th ACM Symp. Theory of Computing，pages 335–344.

Maass，W.，Schnitger，G.，and Sontag，E. D.（1994）. A comparison of the computational power of sigmoid and Boolean threshold circuits. Theoretical Advances in Neural Computation and Learning，pages 127–151.

Maass，W.，Natschlaeger，T.，and Markram，H.（2002）. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation，14（11），2531–2560.

MacKay，D.（2003）. Information Theory，Inference and Learning Algorithms. Cambridge University Press.

Maclaurin，D.，Duvenaud，D.，and Adams，R. P.（2015）. Gradient-based hyperparameter optimization through reversible learning. arXiv preprint arXiv:1502.03492.

Mao，J.，Xu，W.，Yang，Y.，Wang，J.，and Yuille，A.（2014）. Deep captioning with multimodal recurrent neural networks（m-rnn）. arXiv:1412.6632［cs.CV］.

Marcotte，P. and Savard，G.（1992）. Novel approaches to the discrimination problem. Zeitschrift für Operations Research（Theory），36，517–545.

Marlin，B. and de Freitas，N.（2011）. Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In UAI'2011.

Marlin，B.，Swersky，K.，Chen，B.，and de Freitas，N.（2010）. Inductive principles for restricted Boltzmann machine learning. In AISTATS'2010，pages 509–516.

Marquardt，D. W.（1963）. An algorithm for least-squares estimation of non-linear parameters. Journal of the Society of Industrial and Applied Mathematics，11（2），431–441.

Marr，D. and Poggio，T.（1976）. Cooperative computation of stereo disparity. Science，194.

Martens，J.（2010）. Deep learning via Hessian-free optimization. In ICML'2010，pages 735–742.

Martens，J. and Medabalimi，V.（2014）. On the expressive efficiency of sum product networks. arXiv:1411.7717.

Martens，J. and Sutskever，I.（2011）. Learning recurrent neural networks with Hessian-free optimization. In Proc. ICML'2011. ACM.

Mase，S.（1995）. Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian processes. The Annals of Applied Probability，5（3），pp. 603–612.

McClelland，J.，Rumelhart，D.，and Hinton，G.（1995）. The appeal of parallel distributed processing. In Computation & intelligence，pages 305–341. American Association for Artificial Intelligence.

McCulloch，W. S. and Pitts，W.（1943）. A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics，5，115–133.

Mead，C. and Ismail，M.（2012）. Analog VLSI implementation of neural systems，volume 80. Springer Science & Business Media.

Melchior，J.，Fischer，A.，and Wiskott，L.（2013）. How to center binary deep Boltzmann machines. arXiv preprint arXiv:1311.1354.

Memisevic，R. and Hinton，G. E.（2007）. Unsupervised learning of image transformations. In Proceedings of the Computer Vision and Pattern Recognition Conference（CVPR'07）.

Memisevic，R. and Hinton，G. E.（2010）. Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computation，22（6），1473–1492.

Mesnil，G.，Dauphin，Y.，Glorot，X.，Rifai，S.，Bengio，Y.，Goodfellow，I.，Lavoie，E.，Muller，X.，Desjardins，G.，Warde-Farley，D.，Vincent，P.，Courville，A.，and Bergstra，J.（2011）. Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised and Transfer Learning，volume 7.

Mesnil，G.，Rifai，S.，Dauphin，Y.，Bengio，Y.，and Vincent，P.（2012）. Surfing on the manifold. Learning Workshop，Snowbird.

Miikkulainen，R. and Dyer，M. G.（1991）. Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science，15，343–399.

Mikolov，T.（2012）. Statistical Language Models based on Neural Networks. Ph.D. thesis，Brno University of Technology.

Mikolov，T.，Deoras，A.，Kombrink，S.，Burget，L.，and Cernocky，J.（2011a）. Empirical evaluation and combination of advanced language modeling techniques. In Proc. 12th annual conference of the international speech communication association（INTERSPEECH 2011）.

Mikolov，T.，Deoras，A.，Povey，D.，Burget，L.，and Cernocky，J.（2011b）. Strategies for training large scale neural network language models. In Proc. ASRU'2011.

Mikolov，T.，Chen，K.，Corrado，G.，and Dean，J.（2013a）. Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.

Mikolov，T.，Le，Q. V.，and Sutskever，I.（2013b）. Exploiting similarities among languages for machine translation. Technical report，arXiv:1309.4168.

Minka，T.（2005）. Divergence measures and message passing. Microsoft Research Cambridge UK Tech Rep MSRTR2005173，72（TR-2005-173）.

Minsky，M. L. and Papert，S. A.（1969）. Perceptrons. MIT Press，Cambridge.

Mirza，M. and Osindero，S.（2014）. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.

Mishkin，D. and Matas，J.（2015）. All you need is a good init. arXiv preprint arXiv:1511.06422.

Misra，J. and Saha，I.（2010）. Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing，74（1），239–255.

Mitchell，T. M.（1997）. Machine Learning. McGraw-Hill，New York.

Miyato，T.，Maeda，S.，Koyama，M.，Nakae，K.，and Ishii，S.（2015）. Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677.

Mnih，A. and Gregor，K.（2014）. Neural variational inference and learning in belief networks. In ICML'2014.

Mnih，A. and Hinton，G. E.（2007）. Three new graphical models for statistical language mod-elling. In Z. Ghahramani，editor，Proceedings of the Twenty-fourth International Conference on Machine Learning（ICML'07），pages 641–648. ACM.

Mnih，A. and Hinton，G. E.（2009）. A scalable hierarchical distributed language model. In D. Koller，D. Schuurmans，Y. Bengio，and L. Bottou，editors，Advances in Neural Information Processing Systems 21（NIPS'08），pages 1081–1088.

Mnih，A. and Kavukcuoglu，K.（2013）. Learning word embeddings efficiently with noise-contrastive estimation. In C. Burges，L. Bottou，M. Welling，Z. Ghahramani，and K. Weinberger，editors，Advances in Neural Information Processing Systems 26，pages 2265–2273. Curran Associates，Inc.

Mnih，A. and Teh，Y. W.（2012）. A fast and simple algorithm for training neural probabilistic language models. In ICML'2012，pages 1751–1758.

Mnih，V. and Hinton，G.（2010）. Learning to detect roads in high-resolution aerial images. In Proceedings of the 11th European Conference on Computer Vision（ECCV）.

Mnih，V.，Larochelle，H.，and Hinton，G.（2011）. Conditional restricted Boltzmann machines for structure output prediction. In Proc. Conf. on Uncertainty in Artificial Intelligence（UAI）.

Mnih，V.，Kavukcuoglo，K.，Silver，D.，Graves，A.，Antonoglou，I.，and Wierstra，D.（2013）. Playing Atari with deep reinforcement learning. Technical report，arXiv:1312.5602.

Mnih，V.，Heess，N.，Graves，A.，and Kavukcuoglu，K.（2014）. Recurrent models of visual attention. In Z. Ghahramani，M. Welling，C. Cortes，N. Lawrence，and K. Weinberger，editors，NIPS'2014，pages 2204–2212.

Mnih，V.，Kavukcuoglo，K.，Silver，D.，Rusu，A. A.，Veness，J.，Bellemare，M. G.，Graves，A.，Riedmiller，M.，Fidgeland，A. K.，Ostrovski，G.，Petersen，S.，Beattie，C.，Sadik，A.，Antonoglou，I.，King，H.，Kumaran，D.，Wierstra，D.，Legg，S.，and Hassabis，D.（2015）. Human-level control through deep reinforcement learning. Nature，518，529–533.

Mobahi，H. and Fisher，III，J. W.（2015）. A theoretical analysis of optimization by Gaussian continuation. In AAAI'2015.

Mobahi，H.，Collobert，R.，and Weston，J.（2009）. Deep learning from temporal coherence in video. In L. Bottou and M. Littman，editors，Proceedings of the 26th International Conference on Machine Learning，pages 737–744，Montreal. Omnipress.

Mohamed，A.，Dahl，G.，and Hinton，G.（2009）. Deep belief networks for phone recognition.

Mohamed，A.，Sainath，T. N.，Dahl，G.，Ramabhadran，B.，Hinton，G. E.，and Picheny，M. A.（2011）. Deep belief networks using discriminative features for phone recognition. In Acoustics，Speech and Signal Processing（ICASSP），2011 IEEE International Conference on，pages 5060–5063. IEEE.

Mohamed，A.，Dahl，G.，and Hinton，G.（2012a）. Acoustic modeling using deep belief networks. IEEE Trans. on Audio，Speech and Language Processing，20（1），14–22.

Mohamed，A.，Hinton，G.，and Penn，G.（2012b）. Understanding how deep belief networks perform acoustic modelling. In Acoustics，Speech and Signal Processing（ICASSP），2012 IEEE International Conference on，pages 4273–4276. IEEE.

Moller，M.（1993）. Efficient Training of Feed-Forward Neural Networks. Ph.D. thesis，Aarhus University，Aarhus，Denmark.

Montavon，G. and Muller，K.-R.（2012）. Deep Boltzmann machines and the centering trick. In G. Montavon，G. Orr，and K.-R. Müller，editors，Neural Networks: Tricks of the Trade，volume 7700 of Lecture Notes in Computer Science，pages 621–637. Preprint: http://arxiv.org/abs/1203.3783.

Montúfar，G.（2014）. Universal approximation depth and errors of narrow belief networks with discrete units. Neural Computation，26.

Montúfar，G. and Ay，N.（2011）. Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation，23（5），1306–1319.

Montufar，G. F.，Pascanu，R.，Cho，K.，and Bengio，Y.（2014）. On the number of linear regions of deep neural networks. In NIPS'2014.

Mor-Yosef，S.，Samueloff，A.，Modan，B.，Navot，D.，and Schenker，J. G.（1990）. Ranking the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol，75（6），944–7.

Morin，F. and Bengio，Y.（2005）. Hierarchical probabilistic neural network language model. In AISTATS'2005.

Mozer，M. C.（1992）. The induction of multiscale temporal structure. In J. M. S. Hanson and R. Lippmann，editors，Advances in Neural Information Processing Systems 4（NIPS'91），pages 275–282，San Mateo，CA. Morgan Kaufmann.

Murphy，K. P.（2012）. Machine Learning: a Probabilistic Perspective. MIT Press，Cambridge，MA，USA.

Murray，B. U. I. and Larochelle，H.（2014）. A deep and tractable density estimator. In ICML'2014.

Nair，V. and Hinton，G.（2010a）. Rectified linear units improve restricted Boltzmann machines. In ICML'2010.

Nair，V. and Hinton，G. E.（2009）. 3d object recognition with deep belief nets. In Y. Bengio，D. Schuurmans，J. D. Lafferty，C. K. I. Williams，and A. Culotta，editors，Advances in Neural Information Processing Systems 22，pages 1339–1347. Curran Associates，Inc.

Nair，V. and Hinton，G. E.（2010b）. Rectified linear units improve restricted Boltzmann machines. In L. Bottou and M. Littman，editors，Proceedings of the Twenty-seventh International Conference on Machine Learning（ICML-10），pages 807–814. ACM.

Narayanan，H. and Mitter，S.（2010）. Sample complexity of testing the manifold hypothesis. In J. Lafferty，C. K. I. Williams，J. Shawe-Taylor，R. Zemel，and A. Culotta，editors，Advances in Neural Information Processing Systems 23，pages 1786–1794.

Naumann，U.（2008）. Optimal Jacobian accumulation is NP-complete. Mathematical Programming，112（2），427–441.

Navigli，R. and Velardi，P.（2005）. Structural semantic interconnections: a knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and Machine Intelligence，27（7），1075–1086.

Neal，R. and Hinton，G.（1999）. A view of the EM algorithm that justifies incremental，sparse，and other variants. In M. I. Jordan，editor，Learning in Graphical Models. MIT Press，Cambridge，MA.

Neal，R. M.（1990）. Learning stochastic feedforward networks. Technical report.

Neal，R. M.（1993）. Probabilistic inference using Markov chain Monte-Carlo methods. Technical Report CRG-TR-93-1，Dept. of Computer Science，University of Toronto.

Neal，R. M.（1994）. Sampling from multimodal distributions using tempered transitions. Technical Report 9421，Dept. of Statistics，University of Toronto.

Neal，R. M.（1996）. Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer.

Neal，R. M.（2001）. Annealed importance sampling. Statistics and Computing，11（2），125–139.

Neal，R. M.（2005）. Estimating ratios of normalizing constants using linked importance sampling.

Nesterov，Y.（1983）. A method of solving a convex programming problem with convergence rate O（1/k²）. Soviet Mathematics Doklady，27，372–376.

Nesterov，Y.（2004）. Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publ.，Boston，Dordrecht，London.

Netzer，Y.，Wang，T.，Coates，A.，Bissacco，A.，Wu，B.，and Ng，A. Y.（2011）. Reading digits in natural images with unsupervised feature learning. Deep Learning and Unsupervised Feature Learning Workshop，NIPS.

Ney，H. and Kneser，R.（1993）. Improved clustering techniques for class-based statistical language modelling. In European Conference on Speech Communication and Technology（Eurospeech），pages 973–976，Berlin.

Ng，A.（2015）. Advice for applying machine learning. https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf.

Niesler，T. R.，Whittaker，E. W. D.，and Woodland，P. C.（1998）. Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics，Speech and Signal Processing（ICASSP），pages 177–180.

Ning，F.，Delhomme，D.，LeCun，Y.，Piano，F.，Bottou，L.，and Barbano，P. E.（2005）. To-ward automatic phenotyping of developing embryos from videos. Image Processing，IEEE Transactions on，14（9），1360–1371.

Nocedal，J. and Wright，S.（2006）. Numerical Optimization. Springer.

Norouzi，M. and Fleet，D. J.（2011）. Minimal loss hashing for compact binary codes. In ICML'2011.

Nowlan，S. J.（1990）. Competing experts: An experimental investigation of associative mixture models. Technical Report CRG-TR-90-5，University of Toronto.

Nowlan，S. J. and Hinton，G. E.（1992）. Adaptive soft weight tying using Gaussian mixtures. In J. M. S. Hanson and R. Lippmann，editors，Advances in Neural Information Processing Systems 4（NIPS'91），pages 993–1000，San Mateo，CA. Morgan Kaufmann.

Olshausen，B. and Field，D. J.（2005）. How close are we to understanding V1? Neural Computation，17，1665–1699.

Olshausen，B. A. and Field，D. J.（1996）. Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature，381，607–609.

Olshausen，B. A.，Anderson，C. H.，and Van Essen，D. C.（1993）. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci.，13（11），4700–4719.

Opper，M. and Archambeau，C.（2009）. The variational Gaussian approximation revisited. Neural computation，21（3），786–792.

Oquab，M.，Bottou，L.，Laptev，I.，and Sivic，J.（2014）. Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition（CVPR），2014 IEEE Conference on，pages 1717–1724. IEEE.

Osindero，S. and Hinton，G. E.（2008）. Modeling image patches with a directed hierarchy of Markov randomfields. In J. Platt，D. Koller，Y. Singer，and S. Roweis，editors，Advances in Neural Information Processing Systems 20（NIPS'07），pages 1121–1128，Cambridge，MA. MIT Press.

Ovid and Martin，C.（2004）. Metamorphoses. W.W. Norton.

Paccanaro，A. and Hinton，G. E.（2000）. Extracting distributed representations of concepts and relations from positive and negative propositions. In International Joint Conference on Neural Networks（IJCNN），Como，Italy. IEEE，New York.

Paine，T. L.，Khorrami，P.，Han，W.，and Huang，T. S.（2014）. An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597.

Palatucci，M.，Pomerleau，D.，Hinton，G. E.，and Mitchell，T. M.（2009）. Zero-shot learning with semantic output codes. In Y. Bengio，D. Schuurmans，J. D. Lafferty，C. K. I. Williams，and A. Culotta，editors，Advances in Neural Information Processing Systems 22，pages 1410–1418. Curran Associates，Inc.

Parker，D. B.（1985）. Learning-logic. Technical Report TR-47，Center for Comp. Research in Economics and Management Sci.，MIT.

Pascanu，R.，Mikolov，T.，and Bengio，Y.（2013a）. On the difficulty of training recurrent neural networks. In ICML'2013.

Pascanu，R.，Mikolov，T.，and Bengio，Y.（2013b）. On the difficulty of training recurrent neural networks. In ICM（1c）.

Pascanu，R.，Gulcehre，C.，Cho，K.，and Bengio，Y.（2014a）. How to construct deep recurrent neural networks. In ICLR.

Pascanu，R.，Montufar，G.，and Bengio，Y.（2014b）. On the number of inference regions of deep feed forward networks with piece-wise linear activations. In ICL（1）.

Pati，Y.，Rezaiifar，R.，and Krishnaprasad，P.（1993）. Orthogonal matching pursuit:Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals，Systems，and Computers，pages 40–44.

Pearl，J.（1985）. Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society，University of California，Irvine，pages 329–334.

Pearl，J.（1988）. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.

Perron，O.（1907）. Zur theorie der matrices. Mathematische Annalen，64（2），248–263.

Petersen，K. B. and Pedersen，M. S.（2006）. The matrix cookbook. Version 20051003.

Peterson，G. B.（2004）. A day of great illumination: B. F. Skinner's discovery of shaping. Journal of the Experimental Analysis of Behavior，82（3），317–328.

Pham，D.-T.，Garat，P.，and Jutten，C.（1992）. Separation of a mixture of independent sources through a maximum likelihood approach. In EUSIPCO，pages 771–774.

Pham，P.-H.，Jelaca，D.，Farabet，C.，Martini，B.，LeCun，Y.，and Culurciello，E.（2012）. Neu-Flow: dataflow vision processing system-on-a-chip. In Circuits and Systems（MWSCAS），2012 IEEE 55th International Midwest Symposium on，pages 1044–1047. IEEE.

Pinheiro，P. H. O. and Collobert，R.（2014）. Recurrent convolutional neural networks for scene labeling. In ICML'2014.

Pinheiro，P. H. O. and Collobert，R.（2015）. From image-level to pixel-level labeling with con-volutional networks. In Conference on Computer Vision and Pattern Recognition（CVPR）.

Pinto，N.，Cox，D. D.，and DiCarlo，J. J.（2008）. Why is real-world visual object recognition hard? PLoS Comput Biol，4.

Pinto，N.，Stone，Z.，Zickler，T.，and Cox，D.（2011）. Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops（CVPRW），2011 IEEE Computer Society Conference on，pages 35–42. IEEE.

Pollack，J. B.（1990）. Recursive distributed representations. Artificial Intelligence，46（1），77–105.

Polyak，B. and Juditsky，A.（1992）. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization，30（4），838–855.

Polyak，B. T.（1964）. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics，4（5），1–17.

Poole，B.，Sohl-Dickstein，J.，and Ganguli，S.（2014）. Analyzing noise in autoencoders and deep networks. CoRR，abs/1406.1831.

Poon，H. and Domingos，P.（2011）. Sum-product networks for deep learning. In Learning Workshop，Fort Lauderdale，FL.

Presley，R. K. and Haggard，R. L.（1994）. Afixed point implementation of the backpropaga-tion learning algorithm. In Southeastcon '94. Creative Technology Transfer-A Global Affair.，Proceedings of the 1994 IEEE，pages 136–138. IEEE.

Price，R.（1958）. A useful theorem for nonlinear devices having Gaussian inputs. IEEE Transactions on Information Theory，4（2），69–72.

Quiroga，R. Q.，Reddy，L.，Kreiman，G.，Koch，C.，and Fried，I.（2005）. Invariant visual representation by single neurons in the human brain. Nature，435（7045），1102–1107.

Radford，A.，Metz，L.，and Chintala，S.（2015）. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

Raiko，T.，Yao，L.，Cho，K.，and Bengio，Y.（2014）. Iterative neural autoregressive distribution estimator（NADE-k）. Technical report，arXiv:1406.1485.

Raina，R.，Madhavan，A.，and Ng，A. Y.（2009a）. Large-scale deep unsupervised learning using graphics processors. In L. Bottou and M. Littman，editors，Proceedings of the Twenty-sixth International Conference on Machine Learning（ICML'09），pages 873–880，New York，NY，USA. ACM.

Raina，R.，Madhavan，A.，and Ng，A. Y.（2009b）. Large-scale deep unsupervised learning using graphics processors. In ICML'2009.

Ramsey，F. P.（1926）. Truth and probability. In R. B. Braithwaite，editor，The Foundations of Mathematics and other Logical Essays，chapter 7，pages 156–198. McMaster University Archive for the History of Economic Thought.

Ranzato，M. and Hinton，G. H.（2010）. Modeling pixel means and covariances using factorized third-order Boltzmann machines. In CVPR'2010，pages 2551–2558.

Ranzato，M.，Poultney，C.，Chopra，S.，and LeCun，Y.（2007a）. Efficient learning of sparse representations with an energy-based model. In NIPS'2006.

Ranzato，M.，Poultney，C.，Chopra，S.，and LeCun，Y.（2007b）. Efficient learning of sparse representations with an energy-based model. In B. Schölkopf，J. Platt，and T. Hoffman，editors，Advances in Neural Information Processing Systems 19（NIPS'06），pages 1137–1144. MIT Press.

Ranzato，M.，Huang，F.，Boureau，Y.，and LeCun，Y.（2007c）. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR'07.

Ranzato，M.，Boureau，Y.，and LeCun，Y.（2008）. Sparse feature learning for deep belief networks. In NIPS'2007.

Ranzato，M.，Krizhevsky，A.，and Hinton，G. E.（2010a）. Factored 3-way restricted Boltzmann machines for modeling natural images. In Proceedings of AISTATS 2010.

Ranzato，M.，Mnih，V.，and Hinton，G.（2010b）. Generating more realistic images using gated MRFs. In NIPS'2010.

Rao，C.（1945）. Information and the accuracy attainable in the estimation of statistical param-eters. Bulletin of the Calcutta Mathematical Society，37，81–89.

Rasmus，A.，Valpola，H.，Honkala，M.，Berglund，M.，and Raiko，T.（2015）. Semi-supervised learning with ladder network. arXiv preprint arXiv:1507.02672.

Recht，B.，Re，C.，Wright，S.，and Niu，F.（2011）. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS'2011.

Reichert，D. P.，Seriès，P.，and Storkey，A. J.（2011）. Neuronal adaptation for sampling-based probabilistic inference in perceptual bistability. In Advances in Neural Information Processing Systems，pages 2357–2365.

Rezende，D. J.，Mohamed，S.，and Wierstra，D.（2014）. Stochastic backpropagation and approx-imate inference in deep generative models. In ICML'2014. Preprint:arXiv:1401.4082.

Rifai，S.，Vincent，P.，Muller，X.，Glorot，X.，and Bengio，Y.（2011a）. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML'2011.

Rifai，S.，Mesnil，G.，Vincent，P.，Muller，X.，Bengio，Y.，Dauphin，Y.，and Glorot，X.（2011b）. Higher order contractive auto-encoder. In ECML PKDD.

Rifai，S.，Dauphin，Y.，Vincent，P.，Bengio，Y.，and Muller，X.（2011c）. The manifold tangent classifier. In NIPS'2011.

Rifai，S.，Dauphin，Y.，Vincent，P.，Bengio，Y.，and Muller，X.（2011d）. The manifold tangent classifier. In NIPS'2011. Student paper award.

Rifai，S.，Bengio，Y.，Dauphin，Y.，and Vincent，P.（2012）. A generative process for sampling contractive auto-encoders. In ICML'2012.

Ringach，D. and Shapley，R.（2004）. Reverse correlation in neurophysiology. Cognitive Science，28（2），147–166.

Roberts，S. and Everson，R.（2001）. Independent component analysis: principles and practice. Cambridge University Press.

Robinson，A. J. and Fallside，F.（1991）. A recurrent error propagation network speech recognition system. Computer Speech and Language，5（3），259–274.

Rockafellar，R. T.（1997）. Convex analysis. princeton landmarks in mathematics.

Romero，A.，Ballas，N.，Ebrahimi Kahou，S.，Chassang，A.，Gatta，C.，and Bengio，Y.（2015）. Fitnets:Hints for thin deep nets. In ICLR'2015，arXiv:1412.6550.

Rosen，J. B.（1960）. The gradient projection method for nonlinear programming. part i. linear constraints. Journal of the Society for Industrial and Applied Mathematics，8（1），pp. 181–217.

Rosenblatt，F.（1958）. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review，65，386–408.

Rosenblatt，F.（1962）. Principles of Neurodynamics. Spartan，New York.

Rosenblatt，M.（1956）. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics，27（3），832–837.

Roweis，S. and Saul，L. K.（2000）. Nonlinear dimensionality reduction by locally linear embedding. Science，290（5500）.

Roweis，S.，Saul，L.，and Hinton，G.（2002）. Global coordination of local linear models. In T. Dietterich，S. Becker，and Z. Ghahramani，editors，Advances in Neural Information Processing Systems 14（NIPS'01），Cambridge，MA. MIT Press.

Rubin，D. B. et al.（1984）. Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics，12（4），1151–1172.

Rumelhart，D.，Hinton，G.，and Williams，R.（1986a）. Learning representations by back-propagating errors. Nature，323，533–536.

Rumelhart，D. E.，Hinton，G. E.，and Williams，R. J.（1986b）. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland，editors，Parallel Distributed Processing，volume 1，chapter 8，pages 318–362. MIT Press，Cambridge.

Rumelhart，D. E.，Hinton，G. E.，and Williams，R. J.（1986c）. Learning representations by back-propagating errors. Nature，323，533–536.

Rumelhart，D. E.，McClelland，J. L.，and the PDP Research Group（1986d）. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press，Cambridge.

Russakovsky，O.，Deng，J.，Su，H.，Krause，J.，Satheesh，S.，Ma，S.，Huang，Z.，Karpathy，A.，Khosla，A.，Bernstein，M.，Berg，A. C.，and Fei-Fei，L.（2014a）. ImageNet Large Scale Visual Recognition Challenge.

Russakovsky，O.，Deng，J.，Su，H.，Krause，J.，Satheesh，S.，Ma，S.，Huang，Z.，Karpathy，A.，Khosla，A.，Bernstein，M.，et al.（2014b）. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575.

Russel，S. J. and Norvig，P.（2003）. Artificial Intelligence:a Modern Approach. Prentice Hall.

Rust，N.，Schwartz，O.，Movshon，J. A.，and Simoncelli，E.（2005）. Spatiotemporal elements of macaque V1 receptivefields. Neuron，46（6），945–956.

Sainath，T.，Mohamed，A.，Kingsbury，B.，and Ramabhadran，B.（2013）. Deep convolutional neural networks for LVCSR. In ICASSP 2013.

Salakhutdinov，R.（2010）. Learning in Markov randomfields using tempered transitions. In Y. Bengio，D. Schuurmans，C. Williams，J. Lafferty，and A. Culotta，editors，Advances in Neural Information Processing Systems 22（NIPS'09）.

Salakhutdinov，R. and Hinton，G.（2009a）. Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics，volume 5，pages 448–455.

Salakhutdinov，R. and Hinton，G.（2009b）. Semantic hashing. In International Journal of Approximate Reasoning.

Salakhutdinov，R. and Hinton，G. E.（2007a）. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AISTATS-2007.

Salakhutdinov，R. and Hinton，G. E.（2007b）. Semantic hashing. In SIGIR'2007.

Salakhutdinov，R. and Hinton，G. E.（2008）. Using deep belief nets to learn covariance kernels for Gaussian processes. In J. Platt，D. Koller，Y. Singer，and S. Roweis，editors，Advances in Neural Information Processing Systems 20（NIPS'07），pages 1249–1256，Cambridge，MA. MIT Press.

Salakhutdinov，R. and Larochelle，H.（2010）. Efficient learning of deep Boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics（AISTATS 2010），JMLR W&CP，volume 9，pages 693–700.

Salakhutdinov，R. and Mnih，A.（2008）. Probabilistic matrix factorization. In NIPS'2008.

Salakhutdinov，R. and Murray，I.（2008）. On the quantitative analysis of deep belief networks. In W. W. Cohen，A. McCallum，and S. T. Roweis，editors，Proceedings of the Twenty-fifth International Conference on Machine Learning（ICML'08），volume 25，pages 872–879. ACM.

Salakhutdinov，R.，Mnih，A.，and Hinton，G.（2007）. Restricted Boltzmann machines for collab-orativefiltering. In ICML.

Sanger，T. D.（1994）. Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Transactions on Robotics and Automation，10（3）.

Saul，L. K. and Jordan，M. I.（1996）. Exploiting tractable substructures in intractable networks. In D. Touretzky，M. Mozer，and M. Hasselmo，editors，Advances in Neural Information Processing Systems 8（NIPS'95）. MIT Press，Cambridge，MA.

Saul，L. K.，Jaakkola，T.，and Jordan，M. I.（1996）. Meanfield theory for sigmoid belief networks. Journal of Artificial Intelligence Research，4，61–76.

Savich，A. W.，Moussa，M.，and Areibi，S.（2007）. The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. Neural Networks，IEEE Transactions on，18（1），240–252.

Saxe，A. M.，Koh，P. W.，Chen，Z.，Bhand，M.，Suresh，B.，and Ng，A.（2011）. On random weights and unsupervised feature learning. In Proc. ICML'2011. ACM.

Saxe，A. M.，McClelland，J. L.，and Ganguli，S.（2013）. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.

Schaul，T.，Antonoglou，I.，and Silver，D.（2014）. Unit tests for stochastic optimization. In International Conference on Learning Representations.

Schmidhuber，J.（1992）. Learning complex，extended sequences using the principle of history compression. Neural Computation，4（2），234–242.

Schmidhuber，J.（1996）. Sequential neural text compression. IEEE Transactions on Neural Networks，7（1），142–146.

Schmidhuber，J.（2012）. Self-delimiting neural networks. arXiv preprint arXiv:1210.0118.

Schölkopf，B. and Smola，A. J.（2002）. Learning with kernels: Support vector machines，regular-ization，optimization，and beyond. MIT press.

Schölkopf，B.，Burges，C. J. C.，and Smola，A. J.（1998a）. Advances in kernel methods: support vector learning. MIT Press，Cambridge，MA.

Schölkopf，B.，Smola，A.，and Müller，K.-R.（1998b）. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation，10，1299–1319.

Schölkopf，B.，Burges，C. J. C.，and Smola，A. J.（1999）. Advances in Kernel Methods—Support Vector Learning. MIT Press，Cambridge，MA.

Schölkopf，B.，Janzing，D.，Peters，J.，Sgouritsa，E.，Zhang，K.，and Mooij，J.（2012）. On causal and anticausal learning. In ICML'2012，pages 1255–1262.

Schuster，M.（1999）. On supervised learning from sequential data with applications for speech recognition.

Schuster，M. and Paliwal，K.（1997）. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing，45（11），2673–2681.

Schwenk，H.（2007）. Continuous space language models. Computer speech and language，21，492–518.

Schwenk，H.（2010）. Continuous space language models for statistical machine translation. The Prague Bulletin of Mathematical Linguistics，93，137–146.

Schwenk，H.（2014）. Cleaned subset of WMT '14 dataset.

Schwenk，H. and Bengio，Y.（1998）. Training methods for adaptive boosting of neural networks. In M. Jordan，M. Kearns，and S. Solla，editors，Advances in Neural Information Processing Systems 10（NIPS'97），pages 647–653. MIT Press.

Schwenk，H. and Gauvain，J.-L.（2002）. Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics，Speech and Signal Processing（ICASSP），pages 765–768，Orlando，Florida.

Schwenk，H.，Costa-jussà，M. R.，and Fonollosa，J. A. R.（2006）. Continuous space language models for the IWSLT 2006 task. In International Workshop on Spoken Language Translation，pages 166–173.

Seide，F.，Li，G.，and Yu，D.（2011）. Conversational speech transcription using context-dependent deep neural networks. In Interspeech 2011，pages 437–440.

Sejnowski，T.（1987）. Higher-order Boltzmann machines. In AIP Conference Proceedings 151 on Neural Networks for Computing，pages 398–403. American Institute of Physics Inc.

Series，P.，Reichert，D. P.，and Storkey，A. J.（2010）. Hallucinations in Charles Bonnet syndrome induced by homeostasis: a deep Boltzmann machine model. In Advances in Neural Information Processing Systems，pages 2020–2028.

Sermanet，P.，Chintala，S.，and LeCun，Y.（2012）. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition（ICPR 2012）.

Sermanet，P.，Kavukcuoglu，K.，Chintala，S.，and LeCun，Y.（2013）. Pedestrian detection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition（CVPR'13）. IEEE.

Shilov，G.（1977）. Linear Algebra. Dover Books on Mathematics Series. Dover Publications.

Siegelmann，H.（1995）. Computation beyond the Turing limit. Science，268（5210），545–548.

Siegelmann，H. and Sontag，E.（1991）. Turing computability with neural nets. Applied Mathe-matics Letters，4（6），77–80.

Siegelmann，H. T. and Sontag，E. D.（1995）. On the computational power of neural nets. Journal of Computer and Systems Sciences，50（1），132–150.

Sietsma，J. and Dow，R.（1991）. Creating artificial neural networks that generalize. Neural Networks，4（1），67–79.

Simard，D.，Steinkraus，P. Y.，and Platt，J. C.（2003）. Best practices for convolutional neural networks. In ICDAR'2003.

Simard，P. and Graf，H. P.（1994）. Backpropagation without multiplication. In Advances in Neural Information Processing Systems，pages 232–239.

Simard，P.，Victorri，B.，LeCun，Y.，and Denker，J.（1992）. Tangent prop-A formalism for specifying selected invariances in an adaptive network. In NIPS'1991.

Simard，P. Y.，LeCun，Y.，and Denker，J.（1993）. Efficient pattern recognition using a new transformation distance. In NIPS'92.

Simard，P. Y.，LeCun，Y. A.，Denker，J. S.，and Victorri，B.（1998）. Transformation invariance in pattern recognition—tangent distance and tangent propagation. Lecture Notes in Computer Science，1524.

Simons，D. J. and Levin，D. T.（1998）. Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review，5（4），644–649.

Simonyan，K. and Zisserman，A.（2015）. Very deep convolutional networks for large-scale image recognition. In ICLR.

Sjöberg，J. and Ljung，L.（1995）. Overtraining，regularization and searching for a minimum，with application to neural networks. International Journal of Control，62（6），1391–1407.

Skinner，B. F.（1958）. Reinforcement today. American Psychologist，13，94–99.

Smolensky，P.（1986）. Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland，editors，Parallel Distributed Processing，volume 1，chapter 6，pages 194–281. MIT Press，Cambridge.

Snoek，J.，Larochelle，H.，and Adams，R. P.（2012）. Practical Bayesian optimization of machine learning algorithms. In NIPS'2012.

Socher，R.，Huang，E. H.，Pennington，J.，Ng，A. Y.，and Manning，C. D.（2011a）. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS'2011.

Socher，R.，Manning，C.，and Ng，A. Y.（2011b）. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference on Machine Learning（ICML'2011）.

Socher，R.，Pennington，J.，Huang，E. H.，Ng，A. Y.，and Manning，C. D.（2011c）. Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP'2011.

Socher，R.，Perelygin，A.，Wu，J. Y.，Chuang，J.，Manning，C. D.，Ng，A. Y.，and Potts，C.（2013a）. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP'2013.

Socher，R.，Ganjoo，M.，Manning，C. D.，and Ng，A. Y.（2013b）. Zero-shot learning through cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems（NIPS 2013）.

Sohl-Dickstein，J.，Weiss，E. A.，Maheswaranathan，N.，and Ganguli，S.（2015）. Deep unsuper-vised learning using nonequilibrium thermodynamics.

Sohn，K.，Zhou，G.，and Lee，H.（2013）. Learning and selecting features jointly with point-wise gated Boltzmann machines. In ICML'2013.

Solomonoff，R. J.（1989）. A system for incremental learning based on algorithmic probability.

Sontag，E. D.（1998）. VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences，168，69–96.

Sontag，E. D. and Sussman，H. J.（1989）. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems，3，91–106.

Sparkes，B.（1996）. The Red and the Black: Studies in Greek Pottery. Routledge.

Spitkovsky，V. I.，Alshawi，H.，and Jurafsky，D.（2010）. From baby steps to leapfrog: how“less is more”in unsupervised dependency parsing. In HLT'10.

Squire，W. and Trapp，G.（1998）. Using complex variables to estimate derivatives of real functions. SIAM Rev.，40（1），110–112.

Srebro，N. and Shraibman，A.（2005）. Rank，trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory，pages 545–560. Springer-Verlag.

Srivastava，N.（2013）. Improving Neural Networks With Dropout. Master's thesis，U. Toronto.

Srivastava，N. and Salakhutdinov，R.（2012）. Multimodal learning with deep Boltzmann machines. In NIPS'2012.

Srivastava，N.，Salakhutdinov，R. R.，and Hinton，G. E.（2013）. Modeling documents with deep Boltzmann machines. arXiv preprint arXiv:1309.6865.

Srivastava，N.，Hinton，G.，Krizhevsky，A.，Sutskever，I.，and Salakhutdinov，R.（2014）. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research，15，1929–1958.

Srivastava，R. K.，Greff，K.，and Schmidhuber，J.（2015）. Highway networks. arXiv:1505.00387.

Steinkrau，D.，Simard，P. Y.，and Buck，I.（2005）. Using GPUs for machine learning algorithms. 2013 12th International Conference on Document Analysis and Recognition，0，1115–1119.

Stoyanov，V.，Ropson，A.，and Eisner，J.（2011）. Empirical risk minimization of graphical model parameters given approximate inference，decoding，and model structure. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics（AISTATS），volume 15 of JMLR Workshop and Conference Proceedings，pages 725–733，Fort Lauderdale. Supplemen-tary material（4 pages） also available.

Sukhbaatar，S.，Szlam，A.，Weston，J.，and Fergus，R.（2015）. Weakly supervised memory networks. arXiv preprint arXiv:1503.08895.

Supancic，J. and Ramanan，D.（2013）. Self-paced learning for long-term tracking. In CVPR'2013.

Sussillo，D.（2014）. Random walks:Training very deep nonlinear feed-forward networks with smart initialization. CoRR，abs/1412.6558.

Sutskever，I.（2012）. Training Recurrent Neural Networks. Ph.D. thesis，Department of computer science，University of Toronto.

Sutskever，I. and Hinton，G. E.（2008）. Deep narrow sigmoid belief networks are universal approximators. Neural Computation，20（11），2629–2636.

Sutskever，I. and Tieleman，T.（2010）. On the Convergence Properties of Contrastive Divergence. In AISTATS'2010.

Sutskever，I.，Hinton，G.，and Taylor，G.（2009）. The recurrent temporal restricted Boltzmann machine. In NIPS'2008.

Sutskever，I.，Martens，J.，and Hinton，G. E.（2011）. Generating text with recurrent neural networks. In ICML'2011，pages 1017–1024.

Sutskever，I.，Martens，J.，Dahl，G.，and Hinton，G.（2013）. On the importance of initialization and momentum in deep learning. In ICML.

Sutskever，I.，Vinyals，O.，and Le，Q. V.（2014）. Sequence to sequence learning with neural networks. In NIPS'2014，arXiv:1409.3215.

Sutton，R. and Barto，A.（1998）. Reinforcement Learning: An Introduction. MIT Press.

Sutton，R. S.，Mcallester，D.，Singh，S.，and Mansour，Y.（2000）. Policy gradient methods for reinforcement learning with function approximation. In NIPS'1999，pages 1057–1063. MIT Press.

Swersky，K.，Ranzato，M.，Buchman，D.，Marlin，B.，and de Freitas，N.（2011）. On autoencoders and score matching for energy based models. In ICML'2011. ACM.

Swersky，K.，Snoek，J.，and Adams，R. P.（2014）. Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896.

Szegedy，C.，Liu，W.，Jia，Y.，Sermanet，P.，Reed，S.，Anguelov，D.，Erhan，D.，Vanhoucke，V.，and Rabinovich，A.（2014a）. Going deeper with convolutions. Technical report，arXiv:1409.4842.

Szegedy，C.，Zaremba，W.，Sutskever，I.，Bruna，J.，Erhan，D.，Goodfellow，I. J.，and Fergus，R.（2014b）. Intriguing properties of neural networks. ICLR，abs/1312.6199.

Szegedy，C.，Vanhoucke，V.，Ioffe，S.，Shlens，J.，and Wojna，Z.（2015）. Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints.

Taigman，Y.，Yang，M.，Ranzato，M.，and Wolf，L.（2014）. DeepFace: Closing the gap to human-level performance in face verification. In CVPR'2014.

Tandy，D. W.（1997）. Works and Days: A Translation and Commentary for the Social Sciences. University of California Press.

Tang，Y. and Eliasmith，C.（2010）. Deep networks for robust visual recognition. In Proceedings of the 27th International Conference on Machine Learning，June 21-24，2010，Haifa，Israel.

Tang，Y.，Salakhutdinov，R.，and Hinton，G.（2012）. Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635.

Taylor，G. and Hinton，G.（2009）. Factored conditional restricted Boltzmann machines for modeling motion style. In L. Bottou and M. Littman，editors，Proceedings of the Twenty-sixth International Conference on Machine Learning（ICML'09），pages 1025–1032，Montreal，Quebec，Canada. ACM.

Taylor，G.，Hinton，G. E.，and Roweis，S.（2007）. Modeling human motion using binary latent variables. In B. Schölkopf，J. Platt，and T. Hoffman，editors，Advances in Neural Information Processing Systems 19（NIPS'06），pages 1345–1352. MIT Press，Cambridge，MA.

Teh，Y.，Welling，M.，Osindero，S.，and Hinton，G. E.（2003）. Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research，4，1235–1260.

Tenenbaum，J.，de Silva，V.，and Langford，J. C.（2000）. A global geometric framework for nonlinear dimensionality reduction. Science，290（5500），2319–2323.

Theis，L.，van den Oord，A.，and Bethge，M.（2015）. A note on the evaluation of generative models. arXiv:1511.01844.

Thompson，J.，Jain，A.，LeCun，Y.，and Bregler，C.（2014）. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS'2014.

Thrun，S.（1995）. Learning to play the game of chess. In NIPS'1994.

Tibshirani，R. J.（1995）. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B，58，267–288.

Tieleman，T.（2008）. Training restricted Boltzmann machines using approximations to the like-lihood gradient. In ICML'2008，pages 1064–1071.

Tieleman，T. and Hinton，G.（2009）. Using fast weights to improve persistent contrastive diver-gence. In ICML'2009.

Tipping，M. E. and Bishop，C. M.（1999）. Probabilistic principal components analysis. Journal of the Royal Statistical Society B，61（3），611–622.

Torralba，A.，Fergus，R.，and Weiss，Y.（2008）. Small codes and large databases for recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference（CVPR'08），pages 1–8.

Touretzky，D. S. and Minton，G. E.（1985）. Symbols among the neurons: Details of a con-nectionist inference architecture. In Proceedings of the 9th International Joint Conference on Artificial Intelligence-Volume 1，IJCAI'85，pages 238–243，San Francisco，CA，USA. Morgan Kaufmann Publishers Inc.

Tu，K. and Honavar，V.（2011）. On the utility of curricula in unsupervised learning of probabilistic grammars. In IJCAI'2011.

Turaga，S. C.，Murray，J. F.，Jain，V.，Roth，F.，Helmstaedter，M.，Briggman，K.，Denk，W.，and Seung，H. S.（2010）. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation，22，511–538.

Turian，J.，Ratinov，L.，and Bengio，Y.（2010）. Word representations: A simple and general method for semi-supervised learning. In Proc. ACL'2010，pages 384–394.

Töscher，A.，Jahrer，M.，and Bell，R. M.（2009）. The BigChaos solution to the Netflix grand prize.

Uria，B.，Murray，I.，and Larochelle，H.（2013）. Rnade: The real-valued neural autoregressive density-estimator. In NIPS'2013.

van den Oörd，A.，Dieleman，S.，and Schrauwen，B.（2013）. Deep content-based music recom-mendation. In NIPS'2013.

van der Maaten，L. and Hinton，G. E.（2008）. Visualizing data using t-SNE. J. Machine Learning Res.，9.

Vanhoucke，V.，Senior，A.，and Mao，M. Z.（2011）. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.

Vapnik，V. N.（1982）. Estimation of Dependences Based on Empirical Data. Springer-Verlag，Berlin.

Vapnik，V. N.（1995）. The Nature of Statistical Learning Theory. Springer，New York.

Vapnik，V. N. and Chervonenkis，A. Y.（1971）. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications，16，264–280.

Vincent，P.（2011）. A connection between score matching and denoising autoencoders. Neural Computation，23（7）.

Vincent，P. and Bengio，Y.（2003）. Manifold Parzen windows. In NIPS'2002. MIT Press.

Vincent，P.，Larochelle，H.，Bengio，Y.，and Manzagol，P.-A.（2008a）. Extracting and composing robust features with denoising autoencoders. In ICM（1a），pages 1096–1103.

Vincent，P.，Larochelle，H.，Bengio，Y.，and Manzagol，P.-A.（2008b）. Extracting and composing robust features with denoising autoencoders. In ICML 2008.

Vincent，P.，Larochelle，H.，Lajoie，I.，Bengio，Y.，and Manzagol，P.-A.（2010）. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res.，11.

Vincent，P.，de Brébisson，A.，and Bouthillier，X.（2015）. Efficient exact gradient update for training deep networks with very large sparse targets. In C. Cortes，N. D. Lawrence，D. D. Lee，M. Sugiyama，and R. Garnett，editors，Advances in Neural Information Processing Systems 28，pages 1108–1116. Curran Associates，Inc.

Vinyals，O.，Kaiser，L.，Koo，T.，Petrov，S.，Sutskever，I.，and Hinton，G.（2014a）. Grammar as a foreign language. arXiv preprint arXiv:1412.7449.

Vinyals，O.，Toshev，A.，Bengio，S.，and Erhan，D.（2014b）. Show and tell:a neural image caption generator. arXiv 1411.4555.

Vinyals，O.，Fortunato，M.，and Jaitly，N.（2015a）. Pointer networks. arXiv preprint arXiv:1506.03134.

Vinyals，O.，Toshev，A.，Bengio，S.，and Erhan，D.（2015b）. Show and tell:a neural image caption generator. In CVPR'2015. arXiv:1411.4555.

Viola，P. and Jones，M.（2001）. Robust real-time object detection. In International Journal of Computer Vision.

Visin，F.，Kastner，K.，Cho，K.，Matteucci，M.，Courville，A.，and Bengio，Y.（2015）. ReNet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393.

Von Melchner，L.，Pallas，S. L.，and Sur，M.（2000）. Visual behaviour mediated by retinal projections directed to the auditory pathway. Nature，404（6780），871–876.

Wager，S.，Wang，S.，and Liang，P.（2013）. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26，pages 351–359.

Waibel，A.，Hanazawa，T.，Hinton，G. E.，Shikano，K.，and Lang，K.（1989）. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics，Speech，and Signal Processing，37，328–339.

Wan，L.，Zeiler，M.，Zhang，S.，LeCun，Y.，and Fergus，R.（2013）. Regularization of neural networks using dropconnect. In ICML'2013.

Wang，S. and Manning，C.（2013）. Fast dropout training. In ICML'2013.

Wang，Z.，Zhang，J.，Feng，J.，and Chen，Z.（2014a）. Knowledge graph and text jointly embedding. In Proc. EMNLP'2014.

Wang，Z.，Zhang，J.，Feng，J.，and Chen，Z.（2014b）. Knowledge graph embedding by translating on hyperplanes. In Proc. AAAI'2014.

Warde-Farley，D.，Goodfellow，I. J.，Courville，A.，and Bengio，Y.（2014）. An empirical analysis of dropout in piecewise linear networks. In ICL（1）.

Wawrzynek，J.，Asanovic，K.，Kingsbury，B.，Johnson，D.，Beck，J.，and Morgan，N.（1996）. Spert-II: A vector microprocessor system. Computer，29（3），79–86.

Weaver，L. and Tao，N.（2001）. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001，pages 538–545.

Weinberger，K. Q. and Saul，L. K.（2004a）. Unsupervised learning of image manifolds by semidefi-nite programming. In Proceedings of the Computer Vision and Pattern Recognition Conference（CVPR'04），volume 2，pages 988–995，Washington D.C.

Weinberger，K. Q. and Saul，L. K.（2004b）. Unsupervised learning of image manifolds by semidefinite programming. In CVPR'2004，pages 988–995.

Weiss，Y.，Torralba，A.，and Fergus，R.（2008）. Spectral hashing. In NIPS，pages 1753–1760.

Welling，M.，Zemel，R. S.，and Hinton，G. E.（2002）. Self supervised boosting. In Advances in Neural Information Processing Systems，pages 665–672.

Welling，M.，Hinton，G. E.，and Osindero，S.（2003a）. Learning sparse topographic representa-tions with products of Student-t distributions. In NIPS'2002.

Welling，M.，Zemel，R.，and Hinton，G. E.（2003b）. Self-supervised boosting. In S. Becker，S. Thrun，and K. Obermayer，editors，Advances in Neural Information Processing Systems 15（NIPS'02），pages 665–672. MIT Press.

Welling，M.，Rosen-Zvi，M.，and Hinton，G. E.（2005）. Exponential family harmoniums with an application to information retrieval. In L. Saul，Y. Weiss，and L. Bottou，editors，Advances in Neural Information Processing Systems 17（NIPS'04），volume 17，Cambridge，MA. MIT Press.

Werbos，P. J.（1981）. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference，31.8-4.9，NYC，pages 762–770.

Weston，J.，Bengio，S.，and Usunier，N.（2010）. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning，81（1），21–35.

Weston，J.，Chopra，S.，and Bordes，A.（2014）. Memory networks. arXiv preprint arXiv:1410.3916.

Widrow，B. and Hoff，M. E.（1960）. Adaptive switching circuits. In 1960 IRE WESCON Convention Record，volume 4，pages 96–104. IRE，New York.

Wikipedia（2015）. List of animals by number of neurons—Wikipedia，the free encyclopedia. ［Online；accessed 4-March-2015］.

Williams，C. K. I. and Agakov，F. V.（2002）. Products of Gaussians and Probabilistic Minor Component Analysis. Neural Computation，14（5），1169–1182.

Williams，C. K. I. and Rasmussen，C. E.（1996）. Gaussian processes for regression. In D. Touretzky，M. Mozer，and M. Hasselmo，editors，Advances in Neural Information Processing Systems 8（NIPS'95），pages 514–520. MIT Press，Cambridge，MA.

Williams，R. J.（1992）. Simple statistical gradient-following algorithms connectionist reinforcement learning. Machine Learning，8，229–256.

Williams，R. J. and Zipser，D.（1989）. A learning algorithm for continually running fully recurrent neural networks. Neural Computation，1，270–280.

Wilson，D. R. and Martinez，T. R.（2003）. The general inefficiency of batch training for gradient descent learning. Neural Networks，16（10），1429–1451.

Wilson，J. R.（1984）. Variance reduction techniques for digital simulation. American Journal of Mathematical and Management Sciences，4（3），277–312.

Wiskott，L. and Sejnowski，T. J.（2002）. Slow feature analysis: Unsupervised learning of invari-ances. Neural Computation，14（4），715–770.

Wolpert，D. and MacReady，W.（1997）. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation，1，67–82.

Wolpert，D. H.（1996）. The lack of a priori distinction between learning algorithms. Neural Computation，8（7），1341–1390.

Wu，R.，Yan，S.，Shan，Y.，Dang，Q.，and Sun，G.（2015）. Deep image: Scaling up image recognition. arXiv:1501.02876.

Wu，Z.（1997）. Global continuation for distance geometry problems. SIAM Journal of Optimization，7，814–836.

Xiong，H. Y.，Barash，Y.，and Frey，B. J.（2011）. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics，27（18），2554–2562.

Xu，K.，Ba，J. L.，Kiros，R.，Cho，K.，Courville，A.，Salakhutdinov，R.，Zemel，R. S.，and Bengio，Y.（2015）. Show，attend and tell: Neural image caption generation with visual attention. In ICML'2015，arXiv:1502.03044.

Yildiz，I. B.，Jaeger，H.，and Kiebel，S. J.（2012）. Re-visiting the echo state property. Neural networks，35，1–9.

Yosinski，J.，Clune，J.，Bengio，Y.，and Lipson，H.（2014）. How transferable are features in deep neural networks? In NIPS 27，pages 3320–3328. Curran Associates，Inc.

Younes，L.（1998）. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. In Stochastics and Stochastics Models，pages 177–228.

Yu，D.，Wang，S.，and Deng，L.（2010）. Sequential labeling using deep-structured conditional randomfields. IEEE Journal of Selected Topics in Signal Processing.

Zaremba，W. and Sutskever，I.（2014）. Learning to execute. arXiv 1410.4615.

Zaremba，W. and Sutskever，I.（2015）. Reinforcement learning neural Turing machines. arXiv:1505.00521.

Zaslavsky，T.（1975）. Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society.

Zeiler，M. D. and Fergus，R.（2014）. Visualizing and understanding convolutional networks. In ECCV'14.

Zeiler，M. D.，Ranzato，M.，Monga，R.，Mao，M.，Yang，K.，Le，Q.，Nguyen，P.，Senior，A.，Vanhoucke，V.，Dean，J.，and Hinton，G. E.（2013）. On rectified linear units for speech processing. In ICASSP 2013.

Zhou，B.，Khosla，A.，Lapedriza，A.，Oliva，A.，and Torralba，A.（2015）. Object detectors emerge in deep scene CNNs. ICLR'2015，arXiv:1412.6856.

Zhou，J. and Troyanskaya，O. G.（2014）. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In ICML'2014.

Zhou，Y. and Chellappa，R.（1988）. Computation of opticalflow using a neural network. In Neural Networks，1988.，IEEE International Conference on，pages 71–78. IEEE.

Zöhrer，M. and Pernkopf，F.（2014）. General stochastic networks for classification. In NIPS'2014.

索引

绝对值整流absolute value rectification

准确率accuracy

声学acoustic

激活函数activation function

AdaGrad AdaGrad

对抗adversarial

对抗样本adversarial example

对抗训练adversarial training

几乎处处almost everywhere

几乎必然almost sure

几乎必然收敛almost sure convergence

选择性剪接数据集alternative splicing dataset

原始采样ancestral sampling

退火重要采样annealed importance sampling

专用集成电路application-specific integrated circuit

近似贝叶斯计算approximate Bayesian computa-tion

近似推断approximate inference

架构architecture

人工智能artificial intelligence

人工神经网络artificial neural network

渐近无偏asymptotically unbiased

异步随机梯度下降Asynchoronous Stochastic Gradient Descent

异步asynchronous

注意力机制attention mechanism

属性attribute

自编码器autoencoder

自动微分automatic differentiation

自动语音识别Automatic Speech Recognition

自回归网络auto-regressive network

反向传播back propagation

回退back-off

反向传播backprop

通过时间反向传播back-propagation through time

词袋bag of words

Bagging bootstrap aggregating

bandit bandit

批量batch

批标准化batch normalization

贝叶斯误差Bayes error

贝叶斯规则Bayes' rule

贝叶斯推断Bayesian inference

贝叶斯网络Bayesian network

贝叶斯概率Bayesian probability

贝叶斯统计Bayesian statistics

基准bechmark

信念网络belief network

Bernoulli分布Bernoulli distribution

基准baseline

BFGS BFGS

偏置bias in affine function

偏差bias in statistics

有偏biased

有偏重要采样biased importance sampling

偏差biass

二元语法bigram

二元关系binary relation

二值稀疏编码binary sparse coding

比特bit

块坐标下降block coordinate descent

块吉布斯采样block Gibbs Sampling

玻尔兹曼分布Boltzmann distribution

玻尔兹曼机Boltzmann Machine

Boosting Boosting

桥式采样bridge sampling

广播broadcasting

磨合Burning-in

变分法calculus of variations

容量capacity

级联cascade

灾难遗忘catastrophic forgetting

范畴分布categorical distribution

因果因子causal factor

因果模型causal modeling

中心差分centered difference

中心极限定理central limit theorem

链式法则chain rule

混沌chaos

弦chord

弦图chordal graph

梯度截断clip gradient

截断梯度clipping the gradient

团clique

团势能clique potential

闭式解closed form solution

级联coalesced

编码code

协同过滤collaborativefiltering

列column

列空间column space

共因common cause

完全图complete graph

复杂细胞complex cell

计算图computational graph

计算机视觉Computer Vision

概念漂移concept drift

条件计算conditional computation

条件概率conditional probability

条件独立的conditionally independent

共轭conjugate

共轭方向conjugate directions

共轭梯度conjugate gradient

联结主义connectionism

一致性consistency

约束优化constrained optimization

特定环境下的独立context-specific independences

contextual bandit contextual bandit

延拓法continuation method

收缩contractive

收缩自编码器contractive autoencoder

对比散度contrastive divergence

凸优化Convex optimization

卷积convolution

卷积玻尔兹曼机Convolutional Boltzmann Machine

卷积网络convolutional net

卷积神经网络convolutional neural network

坐标上升coordinate ascent

坐标下降coordinate descent

共父coparent

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

20.15 结论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。