返回介绍

20.15 结论

发布于 2024-01-20 12:27:17 字数 162417 浏览 0 评论 0 收藏 0

为了让模型理解基于给定训练数据表示的大千世界,训练具有隐藏单元的生成模型是一种有力方法。通过学习模型和表示,生成模型可以解答x输入变量之间关系的许多推断问题,并且可以在不同层对h求期望来提供表示x的许多不同方式。生成模型可以为AI系统提供它们所要理解的、各种不同概念的框架,让它们有能力在面对不确定性的情况下推理这些概念。我们希望读者能够找到增强这些方法的新途径,并继续探究智能和学习背后原理的旅程。

本书由“行行”整理,如果你不知道读什么书或者想获得更多免费电子书请加小编微信或QQ:2338856113 小编也和结交一些喜欢读书的朋友 或者关注小编个人微信公众号名称:幸福的味道 id:d716-716 为了方便书友朋友找书和看书,小编自己做了一个电子书下载网站,网站的名称为:周读 网址:http://www.ireadweek.com

————————————————————

(1) 术语“mcRBM”根据字母M-C-R-B-M发音;“mc”不是“McDonald's”中的“Mc”的发音。

(2) 这个版本的Gaussian-Bernoulli RBM能量函数假定图像数据的每个像素具有零均值。考虑非零像素均值时,可以简单地将像素偏移添加到模型中。

(3) 该论文将模型描述为“深度信念网络”,但因为它可以被描述为纯无向模型(具有易处理逐层均匀场不动点更新),所以它最适合深度玻尔兹曼机的定义。

参考文献

Abadi,M.,Agarwal,A.,Barham,P.,Brevdo,E.,Chen,Z.,Citro,C.,Corrado,G. S.,Davis,A.,Dean,J.,Devin,M.,Ghemawat,S.,Goodfellow,I.,Harp,A.,Irving,G.,Isard,M.,Jia,Y.,Jozefowicz,R.,Kaiser,L.,Kudlur,M.,Levenberg,J.,Mané,D.,Monga,R.,Moore,S.,Murray,D.,Olah,C.,Schuster,M.,Shlens,J.,Steiner,B.,Sutskever,I.,Talwar,K.,Tucker,P.,Vanhoucke,V.,Vasudevan,V.,Viégas,F.,Vinyals,O.,Warden,P.,Wattenberg,M.,Wicke,M.,Yu,Y.,and Zheng,X.(2015). TensorFlow:Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.

Ackley,D. H.,Hinton,G. E.,and Sejnowski,T. J.(1985). A learning algorithm for Boltzmann machines. Cognitive Science,9,147–169.

Alain,G. and Bengio,Y.(2013). What regularized auto-encoders learn from the data generating distribution. In ICLR'2013,arXiv:1211.4246.

Alain,G.,Bengio,Y.,Yao,L.,Éric Thibodeau-Laufer,Yosinski,J.,and Vincent,P.(2015). GSNs: Generative stochastic networks. arXiv:1503.05571.

Anderson,E.(1935). The Irises of the Gaspé Peninsula. Bulletin of the American Iris Society,59,2–5.

Ba,J.,Mnih,V.,and Kavukcuoglu,K.(2014). Multiple object recognition with visual attention. arXiv:1412.7755.

Bachman,P. and Precup,D.(2015). Variational generative stochastic networks with collaborative shaping. In Proceedings of the 32nd International Conference on Machine Learning,ICML 2015,Lille,France,6-11 July 2015,pages 1964–1972.

Bacon,P.-L.,Bengio,E.,Pineau,J.,and Precup,D.(2015). Conditional computation in neural networks using a decision-theoretic approach. In 2nd Multidisciplinary Conference on Rein-forcement Learning and Decision Making(RLDM 2015).

Bagnell,J. A. and Bradley,D. M.(2009). Differentiable sparse coding. In NIPS'2009,pages 113–120.

Bahdanau,D.,Cho,K.,and Bengio,Y.(2015). Neural machine translation by jointly learning to align and translate. In ICLR'2015,arXiv:1409.0473.

Bahl,L. R.,Brown,P.,de Souza,P. V.,and Mercer,R. L.(1987). Speech recognition with continuous-parameter hidden Markov models. Computer,Speech and Language,2,219–234.

Baldi,P. and Hornik,K.(1989). Neural networks and principal component analysis:Learning from examples without local minima. Neural Networks,2,53–58.

Baldi,P.,Brunak,S.,Frasconi,P.,Soda,G.,and Pollastri,G.(1999). Exploiting the past and the future in protein secondary structure prediction. Bioinformatics,15(11),937–946.

Baldi,P.,Sadowski,P.,and Whiteson,D.(2014). Searching for exotic particles in high-energy physics with deep learning. Nature communications,5.

Ballard,D. H.,Hinton,G. E.,and Sejnowski,T. J.(1983). Parallel vision computation. Nature.

Barlow,H. B.(1989). Unsupervised learning. Neural Computation,1,295–311.

Barron,A. E.(1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory,39,930–945.

Bartholomew,D. J.(1987). Latent variable models and factor analysis. Oxford University Press.

Basilevsky,A.(1994). Statistical Factor Analysis and Related Methods:Theory and Applications. Wiley.

Bastien,F.,Lamblin,P.,Pascanu,R.,Bergstra,J.,Goodfellow,I.,Bergeron,A.,Bouchard,N.,Warde-Farley,D.,and Bengio,Y.(2012a). Theano:new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop,http://www.iro.umontreal.ca/lisa/publications2/index.php/publications/show/551.

Bastien,F.,Lamblin,P.,Pascanu,R.,Bergstra,J.,Goodfellow,I. J.,Bergeron,A.,Bouchard,N.,and Bengio,Y.(2012b). Theano:new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.

Basu,S. and Christensen,J.(2013). Teaching classification boundaries to humans. In AAAI'2013.

Baxter,J.(1995). Learning internal representations. In Proceedings of the 8th International Conference on Computational Learning Theory(COLT'95),pages 311–320,Santa Cruz,California. ACM Press.

Bayer,J. and Osendorfer,C.(2014). Learning stochastic recurrent networks. ArXiv e-prints.

Becker,S. and Hinton,G.(1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature,355,161–163.

Behnke,S.(2001). Learning iterative image reconstruction in the neural abstraction pyramid. Int. J. Computational Intelligence and Applications,1(4),427–438.

Beiu,V.,Quintana,J. M.,and Avedillo,M. J.(2003). VLSI implementations of threshold logic-a comprehensive survey. Neural Networks,IEEE Transactions on,14(5),1217–1243.

Belkin,M. and Niyogi,P.(2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich,S. Becker,and Z. Ghahramani,editors,Advances in Neural Information Processing Systems 14(NIPS'01),Cambridge,MA. MIT Press.

Belkin,M. and Niyogi,P.(2003a). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation,15(6),1373–1396.

Belkin,M. and Niyogi,P.(2003b). Using manifold structure for partially labeled classification. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),Cambridge,MA. MIT Press.

Bengio,E.,Bacon,P.-L.,Pineau,J.,and Precup,D.(2015a). Conditional computation in neural networks for faster models. arXiv:1511.06297.

Bengio,S. and Bengio,Y.(2000a). Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks,special issue on Data Mining and Knowledge Discovery,11(3),550–557.

Bengio,S.,Vinyals,O.,Jaitly,N.,and Shazeer,N.(2015b). Scheduled sampling for sequence prediction with recurrent neural networks. Technical report,arXiv:1506.03099.

Bengio,Y.(1991). Artificial Neural Networks and their Application to Sequence Recognition. Ph.D. thesis,McGill University,(Computer Science),Montreal,Canada.

Bengio,Y.(2000). Gradient-based optimization of hyperparameters. Neural Computation,12(8),1889–1900.

Bengio,Y.(2002). New distributed probabilistic language models. Technical Report 1215,Dept. IRO,Université de Montréal.

Bengio,Y.(2009). Learning deep architectures for AI. Now Publishers.

Bengio,Y.(2013). Deep learning of representations: looking forward. In Statistical Language and Speech Processing,volume 7978 of Lecture Notes in Computer Science,pages 1–37. Springer,also in arXiv at http://arxiv.org/abs/1305.0445.

Bengio,Y.(2015). Early inference in energy-based models approximates back-propagation. Technical Report arXiv:1510.02777,Universite de Montreal.

Bengio,Y. and Bengio,S.(2000b). Modeling high-dimensional discrete data with multi-layer neural networks. In NIPS 12,pages 400–406. MIT Press.

Bengio,Y. and Delalleau,O.(2009). Justifying and generalizing contrastive divergence. Neural Computation,21(6),1601–1621.

Bengio,Y. and Grandvalet,Y.(2004). No unbiased estimator of the variance of k-fold cross-validation. In JML(1),pages 1089–1105.

Bengio,Y. and LeCun,Y.(2007a). Scaling learning algorithms towards AI. In Large Scale Kernel Machines.

Bengio,Y. and LeCun,Y.(2007b). Scaling learning algorithms towards AI. In L. Bottou,O. Chapelle,D. DeCoste,and J. Weston,editors,Large Scale Kernel Machines. MIT Press.

Bengio,Y. and Monperrus,M.(2005). Non-local manifold tangent learning. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04),pages 129–136. MIT Press.

Bengio,Y. and Sénécal,J.-S.(2003). Quick training of probabilistic neural nets by importance sampling. In Proceedings of AISTATS 2003.

Bengio,Y. and Sénécal,J.-S.(2008). Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Networks,19(4),713–722.

Bengio,Y.,De Mori,R.,Flammia,G.,and Kompe,R.(1991). Phonetically motivated acoustic parameters for continuous speech recognition using artificial neural networks. In Proceedings of EuroSpeech'91.

Bengio,Y.,De Mori,R.,Flammia,G.,and Kompe,R.(1992). Neural network-Gaussian mix-ture hybrid for speech recognition or density estimation. In NIPS 4,pages 175–182. Morgan Kaufmann.

Bengio,Y.,Frasconi,P.,and Simard,P.(1993). The problem of learning long-term dependencies in recurrent networks. In IEEE International Conference on Neural Networks,pages 1183–1195,San Francisco. IEEE Press.(invited paper).

Bengio,Y.,Simard,P.,and Frasconi,P.(1994a). Learning long-term dependencies with gradient descent is difficult. IEEE Tr. Neural Nets.

Bengio,Y.,Simard,P.,and Frasconi,P.(1994b). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks,5(2),157–166.

Bengio,Y.,Simard,P.,and Frasconi,P.(1994c). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks,5(2),157–166.

Bengio,Y.,Latendresse,S.,and Dugas,C.(1999). Gradient-based learning of hyper-parameters. In Learning Conference.

Bengio,Y.,Ducharme,R.,and Vincent,P.(2001a). A neural probabilistic language model. In T. Leen,T. Dietterich,and V. Tresp,editors,Advances in Neural Information Processing Systems 13(NIPS'00),pages 933–938. MIT Press.

Bengio,Y.,Ducharme,R.,and Vincent,P.(2001b). A neural probabilistic language model. In T. K. Leen,T. G. Dietterich,and V. Tresp,editors,NIPS'2000,pages 932–938. MIT Press.

Bengio,Y.,Ducharme,R.,Vincent,P.,and Jauvin,C.(2003). A neural probabilistic language model. JMLR,3,1137–1155.

Bengio,Y.,Delalleau,O.,and Le Roux,N.(2006a). The curse of highly variable functions for local kernel machines. In NIPS'2005.

Bengio,Y.,Larochelle,H.,and Vincent,P.(2006b). Non-local manifold Parzen windows. In NIPS'2005. MIT Press.

Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007a). Greedy layer-wise training of deep networks. In NIPS'2006.

Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007b). Greedy layer-wise training of deep networks. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 153–160. MIT Press.

Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007c). Greedy layer-wise training of deep networks. In Adv. Neural Inf. Proc. Sys. 19,pages 153–160.

Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007d). Greedy layer-wise training of deep networks. In NIPS 19,pages 153–160. MIT Press.

Bengio,Y.,Louradour,J.,Collobert,R.,and Weston,J.(2009). Curriculum learning. In ICML'09. ACM.

Bengio,Y.,Mesnil,G.,Dauphin,Y.,and Rifai,S.(2013a). Better mixing via deep representa-tions. In ICML'2013.

Bengio,Y.,Léonard,N.,and Courville,A.(2013b). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432.

Bengio,Y.,Yao,L.,Alain,G.,and Vincent,P.(2013c). Generalized denoising auto-encoders as generative models. In NIPS'2013.

Bengio,Y.,Courville,A.,and Vincent,P.(2013d). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence,IEEE Transactions on,35(8),1798–1828.

Bengio,Y.,Thibodeau-Laufer,E.,Alain,G.,and Yosinski,J.(2014). Deep generative stochastic networks trainable by backprop. In ICML'2014.

Bennett,C.(1976). Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics,22(2),245–268.

Bennett,J. and Lanning,S.(2007). The Netflix prize.

Berger,A. L.,Della Pietra,V. J.,and Della Pietra,S. A.(1996). A maximum entropy approach to natural language processing. Computational Linguistics,22,39–71.

Berglund,M. and Raiko,T.(2013). Stochastic gradient estimate variance in contrastive diver-gence and persistent contrastive divergence. CoRR,abs/1312.6002.

Bergstra,J.(2011). Incorporating Complex Cells into Neural Networks for Pattern Classification. Ph.D. thesis,Université de Montréal.

Bergstra,J. and Bengio,Y.(2009). Slow,decorrelated features for pretraining complex cell-like networks. In NIPS 22,pages 99–107. MIT Press.

Bergstra,J. and Bengio,Y.(2011). Random search for hyper-parameter optimization. The Learning Workshop,Fort Lauderdale,Florida.

Bergstra,J. and Bengio,Y.(2012). Random search for hyper-parameter optimization. J. Machine Learning Res.,13,281–305.

Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010a). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference(SciPy). Oral Presentation.

Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010b). Theano: a CPU and GPU math expression com-piler. In Proc. SciPy.

Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010c). Theano:a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference(SciPy).

Bergstra,J.,Bardenet,R.,Bengio,Y.,and Kégl,B.(2011). Algorithms for hyper-parameter optimization. In NIPS'2011.

Berkes,P. and Wiskott,L.(2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision,5(6),579–602.

Bertsekas,D. P. and Tsitsiklis,J.(1996). Neuro-Dynamic Programming. Athena Scientific.

Besag,J.(1975). Statistical analysis of non-lattice data. The Statistician,24(3),179–195.

Bishop,C. M.(1994). Mixture density networks.

Bishop,C. M.(1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN'95,volume 1,page 141–148.

Bishop,C. M.(1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation,7(1),108–116.

Bishop,C. M.(2006). Pattern Recognition and Machine Learning. Springer.

Blum,A. L. and Rivest,R. L.(1992). Training a 3-node neural network is NP-complete.

Blumer,A.,Ehrenfeucht,A.,Haussler,D.,and Warmuth,M. K.(1989). Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM,36(4),865–929.

Bonnet,G.(1964). Transformations des signaux aléatoires à travers les systèmes non linéaires sans mémoire. Annales des Télécommunications,19(9–10),203–220.

Bordes,A.,Weston,J.,Collobert,R.,and Bengio,Y.(2011). Learning structured embeddings of knowledge bases. In AAAI 2011.

Bordes,A.,Glorot,X.,Weston,J.,and Bengio,Y.(2012). Joint learning of words and meaning representations for open-text semantic parsing. AISTATS'2012.

Bordes,A.,Glorot,X.,Weston,J.,and Bengio,Y.(2013a). A semantic matching energy func-tion for learning with multi-relational data. Machine Learning: Special Issue on Learning Semantics.

Bordes,A.,Usunier,N.,Garcia-Duran,A.,Weston,J.,and Yakhnenko,O.(2013b). Translating embeddings for modeling multi-relational data. In C. Burges,L. Bottou,M. Welling,Z. Ghahramani,and K. Weinberger,editors,Advances in Neural Information Processing Systems 26,pages 2787–2795. Curran Associates,Inc.

Bornschein,J. and Bengio,Y.(2015). Reweighted wake-sleep. In ICLR'2015,arXiv:1406.2751.

Bornschein,J.,Shabanian,S.,Fischer,A.,and Bengio,Y.(2015). Training bidirectional Helmholtz machines. Technical report,arXiv:1506.03877.

Boser,B. E.,Guyon,I. M.,and Vapnik,V. N.(1992). A training algorithm for optimal margin classifiers. In COLT '92: Proceedings of thefifth annual workshop on Computational learning theory,pages 144–152,New York,NY,USA. ACM.

Bottou,L.(1998). Online algorithms and stochastic approximations. In D. Saad,editor,Online Learning in Neural Networks. Cambridge University Press,Cambridge,UK.

Bottou,L.(2011). From machine learning to machine reasoning. Technical report,arXiv.1102.1808.

Bottou,L.(2015). Multilayer neural networks. Deep Learning Summer School.

Bottou,L. and Bousquet,O.(2008a). The tradeoffs of large scale learning. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),volume 20. MIT Press,Cambridge,MA.

Bottou,L. and Bousquet,O.(2008b). The tradeoffs of large scale learning. In NIPS'2008.

Boulanger-Lewandowski,N.,Bengio,Y.,and Vincent,P.(2012). Modeling temporal dependen-cies in high-dimensional sequences: Application to polyphonic music generation and transcrip-tion. In ICML'12.

Boureau,Y.,Ponce,J.,and LeCun,Y.(2010). A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning(ICML'10).

Boureau,Y.,Le Roux,N.,Bach,F.,Ponce,J.,and LeCun,Y.(2011). Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer Vision(ICCV'11). IEEE.

Bourlard,H. and Kamp,Y.(1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics,59,291–294.

Bourlard,H. and Wellekens,C.(1989). Speech pattern discrimination and multi-layered percep-trons. Computer Speech and Language,3,1–19.

Boyd,S. and Vandenberghe,L.(2004). Convex Optimization. Cambridge University Press,New York,NY,USA.

Brady,M. L.,Raghavan,R.,and Slawny,J.(1989). Back-propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems,36(5),665–674.

Brakel,P.,Stroobandt,D.,and Schrauwen,B.(2013). Training energy-based models for time-series imputation. Journal of Machine Learning Research,14,2771–2797.

Brand,M.(2003a). Charting a manifold. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 961–968. MIT Press.

Brand,M.(2003b). Charting a manifold. In NIPS'2002,pages 961–968. MIT Press.

Breiman,L.(1994). Bagging predictors. Machine Learning,24(2),123–140.

Breiman,L.,Friedman,J. H.,Olshen,R. A.,and Stone,C. J.(1984). Classification and Regression Trees. Wadsworth International Group,Belmont,CA.

Bridle,J. S.(1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden Markov model interpretation. Speech Communication,9(1),83–92.

Briggman,K.,Denk,W.,Seung,S.,Helmstaedter,M. N.,and Turaga,S. C.(2009). Maximin affinity learning of image segmentation. In NIPS'2009,pages 1865–1873.

Brown,P. F.,Cocke,J.,Pietra,S. A. D.,Pietra,V. J. D.,Jelinek,F.,Lafferty,J. D.,Mercer,R. L.,and Roossin,P. S.(1990). A statistical approach to machine translation. Computational linguistics,16(2),79–85.

Brown,P. F.,Pietra,V. J. D.,DeSouza,P. V.,Lai,J. C.,and Mercer,R. L.(1992). Class-based n-gram models of natural language. Computational Linguistics,18,467–479.

Bryson,A. and Ho,Y.(1969). Applied optimal control: optimization,estimation,and control. Blaisdell Pub. Co.

Bryson,Jr.,A. E. and Denham,W. F.(1961). A steepest-ascent method for solving optimum programming problems. Technical Report BR-1303,Raytheon Company,Missle and Space Division.

Buciluǎ,C.,Caruana,R.,and Niculescu-Mizil,A.(2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,pages 535–541. ACM.

Burda,Y.,Grosse,R.,and Salakhutdinov,R.(2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.

Cai,M.,Shi,Y.,and Liu,J.(2013). Deep maxout neural networks for speech recognition. In Automatic Speech Recognition and Understanding(ASRU),2013 IEEE Workshop on,pages 291–296. IEEE.

Carreira-Perpiñan,M. A. and Hinton,G. E.(2005). On contrastive divergence learning. In AISTATS'2005,pages 33–40.

Caruana,R.(1993). Multitask connectionist learning. In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379.

Cauchy,A.(1847). Méthode générale pour la résolution de systèmes d'équations simultanées. In Compte rendu des séances de l'académie des sciences,pages 536–538.

Cayton,L.(2005). Algorithms for manifold learning. Technical Report CS2008-0923,UCSD.

Chandola,V.,Banerjee,A.,and Kumar,V.(2009). Anomaly detection: A survey. ACM computing surveys(CSUR),41(3),15.

Chapelle,O.,Weston,J.,and Schölkopf,B.(2003). Cluster kernels for semi-supervised learning. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 585–592,Cambridge,MA. MIT Press.

Chapelle,O.,Schölkopf,B.,and Zien,A.,editors(2006). Semi-Supervised Learning. MIT Press,Cambridge,MA.

Chellapilla,K.,Puri,S.,and Simard,P.(2006). High Performance Convolutional Neural Net-works for Document Processing. In Guy Lorette,editor,Tenth International Workshop on Frontiers in Handwriting Recognition,La Baule(France). Université de Rennes 1,Suvisoft. http://www.suvisoft.com.

Chen,B.,Ting,J.-A.,Marlin,B. M.,and de Freitas,N.(2010). Deep learning of invariant spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop.

Chen,S. F. and Goodman,J. T.(1999). An empirical study of smoothing techniques for language modeling. Computer,Speech and Language,13(4),359–393.

Chen,T.,Du,Z.,Sun,N.,Wang,J.,Wu,C.,Chen,Y.,and Temam,O.(2014a). DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems,pages 269–284. ACM.

Chen,T.,Li,M.,Li,Y.,Lin,M.,Wang,N.,Wang,M.,Xiao,T.,Xu,B.,Zhang,C.,and Zhang,Z.(2015). MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.

Chen,Y.,Luo,T.,Liu,S.,Zhang,S.,He,L.,Wang,J.,Li,L.,Chen,T.,Xu,Z.,Sun,N.,et al.(2014b). DaDianNao: A machine-learning supercomputer. In Microarchitecture(MICRO),2014 47th Annual IEEE/ACM International Symposium on,pages 609–622. IEEE.

Chilimbi,T.,Suzue,Y.,Apacible,J.,and Kalyanaraman,K.(2014). Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation(OSDI'14).

Cho,K.,Raiko,T.,and Ilin,A.(2010a). Parallel tempering is efficient for learning restricted Boltzmann machines. In Proceedings of the International Joint Conference on Neural Networks(IJCNN 2010),Barcelona,Spain.

Cho,K.,Raiko,T.,and Ilin,A.(2010b). Parallel tempering is efficient for learning restricted Boltzmann machines. In IJCNN'2010.

Cho,K.,Raiko,T.,and Ilin,A.(2011). Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. In ICML'2011,pages 105–112.

Cho,K.,Van Merriënboer,B.,Gülçehre,Ç.,Bahdanau,D.,Bougares,F.,Schwenk,H.,and Bengio,Y.(2014a). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP),pages 1724–1734. Association for Computational Linguistics.

Cho,K.,van Merriënboer,B.,Gulcehre,C.,Bougares,F.,Schwenk,H.,and Bengio,Y.(2014b). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing(EMNLP 2014).

Cho,K.,Van Merriënboer,B.,Bahdanau,D.,and Bengio,Y.(2014c). On the properties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints,abs/1409.1259.

Choromanska,A.,Henaff,M.,Mathieu,M.,Arous,G. B.,and LeCun,Y.(2014). The loss surface of multilayer networks.

Chorowski,J.,Bahdanau,D.,Cho,K.,and Bengio,Y.(2014). End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602.

Christianson,B.(1992). Automatic Hessians by reverse accumulation. IMA Journal of Numerical Analysis,12(2),135–150.

Chrupala,G.,Kadar,A.,and Alishahi,A.(2015). Learning language through pictures. arXiv 1506.03694.

Chung,J.,Gulcehre,C.,Cho,K.,and Bengio,Y.(2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS'2014 Deep Learning workshop,arXiv 1412.3555.

Chung,J.,Gülçehre,Ç.,Cho,K.,and Bengio,Y.(2015a). Gated feedback recurrent neural networks. In ICML'15.

Chung,J.,Kastner,K.,Dinh,L.,Goel,K.,Courville,A.,and Bengio,Y.(2015b). A recurrent latent variable model for sequential data. In NIPS'2015.

Ciresan,D.,Meier,U.,Masci,J.,and Schmidhuber,J.(2012). Multi-column deep neural network for traffic sign classification. Neural Networks,32,333–338.

Ciresan,D. C.,Meier,U.,Gambardella,L. M.,and Schmidhuber,J.(2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation,22,1–14.

Coates,A. and Ng,A. Y.(2011). The importance of encoding versus training with sparse coding and vector quantization. In ICML'2011.

Coates,A.,Lee,H.,and Ng,A. Y.(2011). An analysis of single-layer networks in unsuper-vised feature learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS 2011).

Coates,A.,Huval,B.,Wang,T.,Wu,D.,Catanzaro,B.,and Andrew,N.(2013). Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester,editors,Proceedings of the 30th International Conference on Machine Learning(ICML-13),volume 28(3),pages 1337–1345. JMLR Workshop and Conference Proceedings.

Cohen,N.,Sharir,O.,and Shashua,A.(2015). On the expressive power of deep learning: A tensor analysis. arXiv:1509.05009.

Collobert,R.(2004). Large Scale Machine Learning. Ph.D. thesis,Université de Paris VI,LIP6.

Collobert,R.(2011). Deep learning for efficient discriminative parsing. In AISTATS'2011.

Collobert,R. and Weston,J.(2008a). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.

Collobert,R. and Weston,J.(2008b). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.

Collobert,R.,Bengio,S.,and Bengio,Y.(2001). A parallel mixture of SVMs for very large scale problems. Technical Report 12,IDIAP.

Collobert,R.,Bengio,S.,and Bengio,Y.(2002). Parallel mixture of SVMs for very large scale problem. Neural Computation.

Collobert,R.,Weston,J.,Bottou,L.,Karlen,M.,Kavukcuoglu,K.,and Kuksa,P.(2011a). Natural language processing(almost) from scratch. The Journal of Machine Learning Research,12,2493–2537.

Collobert,R.,Kavukcuoglu,K.,and Farabet,C.(2011b). Torch7: A Matlab-like environment for machine learning. In BigLearn,NIPS Workshop.

Comon,P.(1994). Independent component analysis-a new concept?Signal Processing,36,287–314.

Cortes,C. and Vapnik,V.(1995). Support vector networks. Machine Learning,20,273–297.

Couprie,C.,Farabet,C.,Najman,L.,and LeCun,Y.(2013). Indoor semantic segmentation using depth information. In International Conference on Learning Representations(ICLR2013).

Courbariaux,M.,Bengio,Y.,and David,J.-P.(2015). Low precision arithmetic for deep learning. In Arxiv:1412.7024,ICLR'2015 Workshop.

Courville,A.,Bergstra,J.,and Bengio,Y.(2011a). Unsupervised models of images by spike-and-slab RBMs. In ICML'2011.

Courville,A.,Bergstra,J.,and Bengio,Y.(2011b). Unsupervised models of images by spike-and-slab RBMs. In ICM(1b).

Courville,A.,Desjardins,G.,Bergstra,J.,and Bengio,Y.(2014). The spike-and-slab RBM and extensions to discrete and sparse data distributions. Pattern Analysis and Machine Intelligence,IEEE Transactions on,36(9),1874–1887.

Cover,T. M. and Thomas,J. A.(2006). Elements of Information Theory,2nd Edition. Wiley-Interscience.

Cox,D. and Pinto,N.(2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face & Gesture Recognition and Workshops(FG 2011),2011 IEEE International Conference on,pages 8–15. IEEE.

Cramér,H.(1946). Mathematical methods of statistics. Princeton University Press.

Crick,F. H. C. and Mitchison,G.(1983). The function of dream sleep. Nature,304,111–114.

Cybenko,G.(1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control,Signals,and Systems,2,303–314.

Dahl,G. E.,Ranzato,M.,Mohamed,A.,and Hinton,G. E.(2010). Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems(NIPS).

Dahl,G. E.,Yu,D.,Deng,L.,and Acero,A.(2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio,Speech,and Language Processing,20(1),33–42.

Dahl,G. E.,Sainath,T. N.,and Hinton,G. E.(2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP'2013.

Dahl,G. E.,Jaitly,N.,and Salakhutdinov,R.(2014). Multi-task neural networks for QSAR predictions. arXiv:1406.1231.

Dauphin,Y. and Bengio,Y.(2013). Stochastic ratio matching of RBMs for sparse high-dimensional inputs. In NIP(1).

Dauphin,Y.,Glorot,X.,and Bengio,Y.(2011). Large-scale learning of embeddings with recon-struction sampling. In ICML'2011.

Dauphin,Y.,Pascanu,R.,Gulcehre,C.,Cho,K.,Ganguli,S.,and Bengio,Y.(2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS'2014.

Davis,A.,Rubinstein,M.,Wadhwa,N.,Mysore,G.,Durand,F.,and Freeman,W. T.(2014). The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics(Proc. SIGGRAPH),33(4),79:1–79:10.

Dayan,P.(1990). Reinforcement comparison. In Connectionist Models: Proceedings of the 1990 Connectionist Summer School,San Mateo,CA.

Dayan,P. and Hinton,G. E.(1996). Varieties of Helmholtz machine. Neural Networks,9(8),1385–1403.

Dayan,P.,Hinton,G. E.,Neal,R. M.,and Zemel,R. S.(1995). The Helmholtz machine. Neural computation,7(5),889–904.

Dean,J.,Corrado,G.,Monga,R.,Chen,K.,Devin,M.,Le,Q.,Mao,M.,Ranzato,M.,Senior,A.,Tucker,P.,Yang,K.,and Ng,A. Y.(2012). Large scale distributed deep networks. In NIPS'2012.

Dean,T. and Kanazawa,K.(1989). A model for reasoning about persistence and causation. Computational Intelligence,5(3),142–150.

Deerwester,S.,Dumais,S. T.,Furnas,G. W.,Landauer,T. K.,and Harshman,R.(1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science,41(6),391–407.

Delalleau,O. and Bengio,Y.(2011). Shallow vs. deep sum-product networks. In NIPS.

Deng,J.,Dong,W.,Socher,R.,Li,L.-J.,Li,K.,and Fei-Fei,L.(2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

Deng,J.,Berg,A. C.,Li,K.,and Fei-Fei,L.(2010a). What does classifying more than 10,000 image categories tell us? In Proceedings of the 11th European Conference on Computer Vision: Part V,ECCV'10,pages 71–84,Berlin,Heidelberg. Springer-Verlag.

Deng,L. and Yu,D.(2014). Deep learning–methods and applications. Foundations and Trends in Signal Processing.

Deng,L.,Seltzer,M.,Yu,D.,Acero,A.,Mohamed,A.,and Hinton,G.(2010b). Binary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010,Makuhari,Chiba,Japan.

Denil,M.,Bazzani,L.,Larochelle,H.,and de Freitas,N.(2012). Learning where to attend with deep architectures for image tracking. Neural Computation,24(8),2151–2184.

Denton,E.,Chintala,S.,Szlam,A.,and Fergus,R.(2015). Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS.

Desjardins,G. and Bengio,Y.(2008). Empirical evaluation of convolutional RBMs for vision. Technical Report 1327,Département d'Informatique et de Recherche Opérationnelle,Université de Montréal.

Desjardins,G.,Courville,A. C.,Bengio,Y.,Vincent,P.,and Delalleau,O.(2010). Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In International Conference on Artificial Intelligence and Statistics,pages 145–152.

Desjardins,G.,Courville,A.,and Bengio,Y.(2011). On tracking the partition function. In NIPS'2011.

Devlin,J.,Zbib,R.,Huang,Z.,Lamar,T.,Schwartz,R.,and Makhoul,J.(2014). Fast and robust neural network joint models for statistical machine translation. In Proc. ACL'2014.

Devroye,L.(2013). Non-Uniform Random Variate Generation. SpringerLink: Bücher. Springer New York.

DiCarlo,J. J.(2013). Mechanisms underlying visual object recognition:Humans vs. neurons vs. machines. NIPS Tutorial.

Dinh,L.,Krueger,D.,and Bengio,Y.(2014). NICE: Non-linear independent components esti-mation. arXiv:1410.8516.

Donahue,J.,Hendricks,L. A.,Guadarrama,S.,Rohrbach,M.,Venugopalan,S.,Saenko,K.,and Darrell,T.(2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.

Donoho,D. L. and Grimes,C.(2003). Hessian eigenmaps: new locally linear embedding tech-niques for high-dimensional data. Technical Report 2003-08,Dept. Statistics,Stanford University.

Dosovitskiy,A.,Springenberg,J. T.,and Brox,T.(2015). Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1538–1546.

Doya,K.(1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on Neural Networks,1,75–80.

Dreyfus,S. E.(1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications,5(1),30–45.

Dreyfus,S. E.(1973). The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control,18(4),383–385.

Drucker,H. and LeCun,Y.(1992). Improving generalisation performance using double back-propagation. IEEE Transactions on Neural Networks,3(6),991–997.

Duchi,J.,Hazan,E.,and Singer,Y.(2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

Dudik,M.,Langford,J.,and Li,L.(2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine learning,ICML '11.

Dugas,C.,Bengio,Y.,Bélisle,F.,and Nadeau,C.(2001). Incorporating second-order functional knowledge for better option pricing. In T. Leen,T. Dietterich,and V. Tresp,editors,Advances in Neural Information Processing Systems 13(NIPS'00),pages 472–478. MIT Press.

Dziugaite,G. K.,Roy,D. M.,and Ghahramani,Z.(2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.

El Hihi,S. and Bengio,Y.(1996). Hierarchical recurrent neural networks for long-term depen-dencies. In NIPS 8. MIT Press.

Elkahky,A. M.,Song,Y.,and He,X.(2015). A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web,pages 278–288.

Elman,J. L.(1993). Learning and development in neural networks: The importance of starting small. Cognition,48,781–799.

Erhan,D.,Manzagol,P.-A.,Bengio,Y.,Bengio,S.,and Vincent,P.(2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. In AISTATS'2009,pages 153–160.

Erhan,D.,Bengio,Y.,Courville,A.,Manzagol,P.,Vincent,P.,and Bengio,S.(2010). Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

Fahlman,S. E.,Hinton,G. E.,and Sejnowski,T. J.(1983). Massively parallel architectures for AI: NETL,thistle,and Boltzmann machines. In Proceedings of the National Conference on Artificial Intelligence AAAI-83.

Fang,H.,Gupta,S.,Iandola,F.,Srivastava,R.,Deng,L.,Dollár,P.,Gao,J.,He,X.,Mitchell,M.,Platt,J. C.,Zitnick,C. L.,and Zweig,G.(2015). From captions to visual concepts and back. arXiv:1411.4952.

Farabet,C.,LeCun,Y.,Kavukcuoglu,K.,Culurciello,E.,Martini,B.,Akselrod,P.,and Talay,S.(2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman,M. Bilenko,and J. Langford,editors,Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press.

Farabet,C.,Couprie,C.,Najman,L.,and LeCun,Y.(2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(8),1915–1929.

Fei-Fei,L.,Fergus,R.,and Perona,P.(2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence,28(4),594–611.

Finn,C.,Tan,X. Y.,Duan,Y.,Darrell,T.,Levine,S.,and Abbeel,P.(2015). Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv preprint arXiv:1509.06113.

Fisher,R. A.(1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics,7,179–188.

Földiák,P.(1989). Adaptive network for optimal linear feature extraction. In International Joint Conference on Neural Networks(IJCNN),volume 1,pages 401–405,Washington 1989. IEEE,New York.

Franzius,M.,Sprekeler,H.,and Wiskott,L.(2007). Slowness and sparseness lead to place,head-direction,and spatial-view cells.

Franzius,M.,Wilbert,N.,and Wiskott,L.(2008). Invariant object recognition with slow feature analysis. In Proceedings of the 18th international conference on Artificial Neural Networks,Part I,ICANN '08,pages 961–970,Berlin,Heidelberg. Springer-Verlag.

Frasconi,P.,Gori,M.,and Sperduti,A.(1997). On the efficient classification of data structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence.

Frasconi,P.,Gori,M.,and Sperduti,A.(1998). A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks,9(5),768–786.

Freund,Y. and Schapire,R. E.(1996a). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference,pages 148–156,USA. ACM.

Freund,Y. and Schapire,R. E.(1996b). Game theory,on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory,pages 325–332.

Frey,B. J.(1998). Graphical models for machine learning and digital communication. MIT Press.

Frey,B. J.,Hinton,G. E.,and Dayan,P.(1996). Does the wake-sleep algorithm learn good density estimators? In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95),pages 661–670. MIT Press,Cambridge,MA.

Frobenius,G.(1908). Über matrizen aus positiven elementen,s. B. Preuss. Akad. Wiss. Berlin,Germany.

Fukushima,K.(1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernetics,20,121–136.

Fukushima,K.(1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics,36,193–202.

Gal,Y. and Ghahramani,Z.(2015). Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158.

Gallinari,P.,LeCun,Y.,Thiria,S.,and Fogelman-Soulie,F.(1987). Memoires associatives distribuees. In Proceedings of COGNITIVA 87,Paris,La Villette.

Garcia-Duran,A.,Bordes,A.,Usunier,N.,and Grandvalet,Y.(2015). Combining two and three-way embeddings models for link prediction in knowledge bases. arXiv preprint arXiv:1506.00999.

Garofolo,J. S.,Lamel,L. F.,Fisher,W. M.,Fiscus,J. G.,and Pallett,D. S.(1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon Technical Report N,93,27403.

Garson,J.(1900). The metric system of identification of criminals,as used in Great Britain and Ireland. The Journal of the Anthropological Institute of Great Britain and Ireland,(2),177–227.

Gers,F. A.,Schmidhuber,J.,and Cummins,F.(2000). Learning to forget: Continual prediction with LSTM. Neural computation,12(10),2451–2471.

Ghahramani,Z. and Hinton,G. E.(1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1,Dpt. of Comp. Sci.,Univ. of Toronto.

Gillick,D.,Brunk,C.,Vinyals,O.,and Subramanya,A.(2015). Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103.

Girshick,R.,Donahue,J.,Darrell,T.,and Malik,J.(2015). Region-based convolutional networks for accurate object detection and segmentation.

Giudice,M. D.,Manera,V.,and Keysers,C.(2009). Programmed to learn? The ontogeny of mirror neurons. Dev. Sci.,12(2),350–363.

Glorot,X. and Bengio,Y.(2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS'2010.

Glorot,X.,Bordes,A.,and Bengio,Y.(2011a). Deep sparse rectifier neural networks. In AISTATS'2011.

Glorot,X.,Bordes,A.,and Bengio,Y.(2011b). Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML'2011.

Glorot,X.,Bordes,A.,and Bengio,Y.(2011c). Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICM(1b),pages 97–110.

Goldberger,J.,Roweis,S.,Hinton,G. E.,and Salakhutdinov,R.(2005). Neighbourhood components analysis. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04). MIT Press.

Gong,S.,McKenna,S.,and Psarrou,A.(2000). Dynamic Vision: From Images to Face Recognition. Imperial College Press.

Goodfellow,I.,Le,Q.,Saxe,A.,and Ng,A.(2009). Measuring invariances in deep networks. In Y. Bengio,D. Schuurmans,C. Williams,J. Lafferty,and A. Culotta,editors,Advances in Neural Information Processing Systems 22(NIPS'09),pages 646–654.

Goodfellow,I.,Koenig,N.,Muja,M.,Pantofaru,C.,Sorokin,A.,and Takayama,L.(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction(HRI),Osaka,Japan. ACM Press,ACM Press.

Goodfellow,I.,Mirza,M.,Xiao,D.,Courville,A.,and Bengio,Y.(2014a). An empirical inves-tigation of catastrophic forgetting in gradient-based neural networks. In ICLR'14.

Goodfellow,I. J.(2010). Technical report:Multidimensional,downsampled convolution for autoencoders. Technical report,Université de Montréal.

Goodfellow,I. J.(2014). On distinguishability criteria for estimating generative models. In International Conference on Learning Representations,Workshops Track.

Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2011). Spike-and-slab sparse coding for unsu-pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.

Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013a). Maxout networks. In ICML'2013.

Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013b). Maxout networks. In ICM(1c),pages 1319–1327.

Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013c). Maxout networks. Technical Report arXiv:1302.4389,Université de Montréal.

Goodfellow,I. J.,Mirza,M.,Courville,A.,and Bengio,Y.(2013d). Multi-prediction deep Boltzmann machines. In NIP(1).

Goodfellow,I. J.,Warde-Farley,D.,Lamblin,P.,Dumoulin,V.,Mirza,M.,Pascanu,R.,Bergstra,J.,Bastien,F.,and Bengio,Y.(2013e). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.

Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2013f). Scaling up spike-and-slab models for unsupervised feature learning. IEEE T. PAMI,pages 1902–1914.

Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2013g). Scaling up spike-and-slab models for un-supervised feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(8),1902–1914.

Goodfellow,I. J.,Shlens,J.,and Szegedy,C.(2014b). Explaining and harnessing adversarial examples. CoRR,abs/1412.6572.

Goodfellow,I. J.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,and Bengio,Y.(2014c). Generative adversarial networks. In NIPS'2014.

Goodfellow,I. J.,Bulatov,Y.,Ibarz,J.,Arnoud,S.,and Shet,V.(2014d). Multi-digit number recognition from Street View imagery using deep convolutional neural networks. In International Conference on Learning Representations.

Goodfellow,I. J.,Vinyals,O.,and Saxe,A. M.(2015). Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations.

Goodman,J.(2001). Classes for fast maximum entropy training. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),Utah.

Gori,M. and Tesi,A.(1992). On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-14(1),76–86.

Gosset,W. S.(1908). The probable error of a mean. Biometrika,6(1),1–25. Originally published under the pseudonym“Student”.

Gouws,S.,Bengio,Y.,and Corrado,G.(2014). BilBOWA: Fast bilingual distributed representations without word alignments. Technical report,arXiv:1410.2455.

Graf,H. P. and Jackel,L. D.(1989). Analog electronic neural network circuits. Circuits and Devices Magazine,IEEE,5(4),44–49.

Graves,A.(2011). Practical variational inference for neural networks. In NIPS'2011.

Graves,A.(2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer.

Graves,A.(2013). Generating sequences with recurrent neural networks. Technical report,arXiv:1308.0850.

Graves,A. and Jaitly,N.(2014). Towards end-to-end speech recognition with recurrent neural networks. In ICML'2014.

Graves,A. and Schmidhuber,J.(2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks,18(5),602–610.

Graves,A. and Schmidhuber,J.(2009). Offine handwriting recognition with multidimensional recurrent neural networks. In D. Koller,D. Schuurmans,Y. Bengio,and L. Bottou,editors,NIPS'2008,pages 545–552.

Graves,A.,Fernández,S.,Gomez,F.,and Schmidhuber,J.(2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML'2006,pages 369–376,Pittsburgh,USA.

Graves,A.,Liwicki,M.,Bunke,H.,Schmidhuber,J.,and Fernández,S.(2008). Unconstrained on-line handwriting recognition with recurrent neural networks. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,NIPS'2007,pages 577–584.

Graves,A.,Liwicki,M.,Fernández,S.,Bertolami,R.,Bunke,H.,and Schmidhuber,J.(2009). A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on,31(5),855–868.

Graves,A.,Mohamed,A.,and Hinton,G.(2013). Speech recognition with deep recurrent neural networks. In ICASSP'2013,pages 6645–6649.

Graves,A.,Wayne,G.,and Danihelka,I.(2014). Neural Turing machines. arXiv:1410.5401.

Grefenstette,E.,Hermann,K. M.,Suleyman,M.,and Blunsom,P.(2015). Learning to transduce with unbounded memory. In NIPS'2015.

Greff,K.,Srivastava,R. K.,Koutník,J.,Steunebrink,B. R.,and Schmidhuber,J.(2015). LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069.

Gregor,K. and LeCun,Y.(2010a). Emergence of complex-like cells in a temporal product network with local receptivefields. Technical report,arXiv:1006.0448.

Gregor,K. and LeCun,Y.(2010b). Learning fast approximations of sparse coding. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-seventh International Conference on Machine Learning(ICML-10). ACM.

Gregor,K.,Danihelka,I.,Mnih,A.,Blundell,C.,and Wierstra,D.(2014). Deep autoregressive networks. In International Conference on Machine Learning(ICML'2014).

Gregor,K.,Danihelka,I.,Graves,A.,and Wierstra,D.(2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

Gretton,A.,Borgwardt,K. M.,Rasch,M. J.,Schölkopf,B.,and Smola,A.(2012). A kernel two-sample test. The Journal of Machine Learning Research,13(1),723–773.

Guillaume Desjardins,Karen Simonyan,R. P. K. K.(2015). Natural neural networks. Technical report,arXiv:1507.00210.

Gulcehre,C. and Bengio,Y.(2013). Knowledge matters: Importance of prior information for optimization. Technical Report arXiv:1301.4083,Universite de Montreal.

Guo,H. and Gelfand,S. B.(1992). Classification trees with neural network feature extraction. Neural Networks,IEEE Transactions on,3(6),923–933.

Gupta,S.,Agrawal,A.,Gopalakrishnan,K.,and Narayanan,P.(2015). Deep learning with limited numerical precision. CoRR,abs/1502.02551.

Gutmann,M. and Hyvarinen,A.(2010). Noise-contrastive estimation: A new estimation princi-ple for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS'10).

Hadsell,R.,Sermanet,P.,Ben,J.,Erkan,A.,Han,J.,Muller,U.,and LeCun,Y.(2007). Online learning for offroad robots: Spatial label propagation to learn long-range traversability. In Proceedings of Robotics: Science and Systems,Atlanta,GA,USA.

Hajnal,A.,Maass,W.,Pudlak,P.,Szegedy,M.,and Turan,G.(1993). Threshold circuits of bounded depth. J. Comput. System. Sci.,46,129–154.

Håstad,J.(1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing,pages 6–20,Berkeley,California. ACM Press.

Håstad,J. and Goldmann,M.(1991). On the power of small-depth threshold circuits. Computational Complexity,1,113–129.

Hastie,T.,Tibshirani,R.,and Friedman,J.(2001). The elements of statistical learning: data mining,inference and prediction. Springer Series in Statistics. Springer Verlag.

He,K.,Zhang,X.,Ren,S.,and Sun,J.(2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv preprint arXiv:1502.01852.

Hebb,D. O.(1949). The Organization of Behavior. Wiley,New York.

Henaff,M.,Jarrett,K.,Kavukcuoglu,K.,and LeCun,Y.(2011). Unsupervised learning of sparse features for scalable audio classification. In ISMIR'11.

Henderson,J.(2003). Inducing history representations for broad coverage statistical parsing. In HLT-NAACL,pages 103–110.

Henderson,J.(2004). Discriminative training of a neural network statistical parser. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,page 95.

Henniges,M.,Puertas,G.,Bornschein,J.,Eggert,J.,and Lücke,J.(2010). Binary sparse coding. In Latent Variable Analysis and Signal Separation,pages 450–457. Springer.

Herault,J. and Ans,B.(1984). Circuits neuronaux à synapses modifiables: Décodage de messages composites par apprentissage non supervisé. Comptes Rendus de l'Académie des Sciences,299(III-13),525–528.

Hinton,G.,Deng,L.,Dahl,G. E.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T.,and Kingsbury,B.(2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,29(6),82–97.

Hinton,G.,Vinyals,O.,and Dean,J.(2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Hinton,G. E.(1989). Connectionist learning procedures. Artificial Intelligence,40,185–234.

Hinton,G. E.(1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence,46(1),47–75.

Hinton,G. E.(1999). Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks(ICANN),volume 1,pages 1–6,Edinburgh,Scotland. IEE.

Hinton,G. E.(2000). Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004,Gatsby Unit,University College London.

Hinton,G. E.(2006). To recognize shapes,first learn to generate images. Technical Report UTML TR 2006-003,University of Toronto.

Hinton,G. E.(2007a). How to do backpropagation in a brain. Invited talk at the NIPS'2007 Deep Learning Workshop.

Hinton,G. E.(2007b). Learning multiple layers of representation. Trends in cognitive sciences,11(10),428–434.

Hinton,G. E.(2010). A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003,Comp. Sc.,University of Toronto.

Hinton,G. E.(2012). Tutorial on deep learning. IPAM Graduate Summer School: Deep Learning,Feature Learning.

Hinton,G. E. and Ghahramani,Z.(1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London.

Hinton,G. E. and McClelland,J. L.(1988). Learning representations by recirculation. In NIPS'1987,pages 358–366.

Hinton,G. E. and Roweis,S.(2003). Stochastic neighbor embedding. In NIPS'2002.

Hinton,G. E. and Salakhutdinov,R.(2006). Reducing the dimensionality of data with neural networks. Science,313(5786),504–507.

Hinton,G. E. and Sejnowski,T. J.(1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 7,pages 282–317. MIT Press,Cambridge.

Hinton,G. E. and Sejnowski,T. J.(1999). Unsupervised learning: foundations of neural computation. MIT press.

Hinton,G. E. and Shallice,T.(1991). Lesioning an attractor network: investigations of acquired dyslexia. Psychological review,98(1),74.

Hinton,G. E. and Zemel,R. S.(1994). Autoencoders,minimum description length,and Helmholtz free energy. In NIPS'1993.

Hinton,G. E.,Sejnowski,T. J.,and Ackley,D. H.(1984a). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119,Carnegie-Mellon Uni-versity,Dept. of Computer Science.

Hinton,G. E.,Sejnowski,T. J.,and Ackley,D. H.(1984b). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119,Carnegie-Mellon Uni-versity,Dept. of Computer Science.

Hinton,G. E.,McClelland,J.,and Rumelhart,D.(1986). Distributed representations. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing: Explorations in the Microstructure of Cognition,volume 1,pages 77–109. MIT Press,Cambridge.

Hinton,G. E.,Revow,M.,and Dayan,P.(1995a). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro,D. Touretzky,and T. Leen,editors,Advances in Neural Information Processing Systems 7(NIPS'94),pages 1015–1022. MIT Press,Cambridge,MA.

Hinton,G. E.,Dayan,P.,Frey,B. J.,and Neal,R. M.(1995b). The wake-sleep algorithm for unsupervised neural networks. Science,268,1558–1161.

Hinton,G. E.,Dayan,P.,and Revow,M.(1997). Modelling the manifolds of images of hand-written digits. IEEE Transactions on Neural Networks,8,65–74.

Hinton,G. E.,Welling,M.,Teh,Y. W.,and Osindero,S.(2001). A new view of ICA. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation(ICA'01),pages 746–751,San Diego,CA.

Hinton,G. E.,Osindero,S.,and Teh,Y.(2006a). A fast learning algorithm for deep belief nets. Neural Computation,18,1527–1554.

Hinton,G. E.,Osindero,S.,and Teh,Y.-W.(2006b). A fast learning algorithm for deep belief nets. Neural Computation,18,1527–1554.

Hinton,G. E.,Deng,L.,Yu,D.,Dahl,G. E.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T. N.,and Kingsbury,B.(2012b). Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Process. Mag.,29(6),82–97.

Hinton,G. E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report,arXiv:1207.0580.

Hinton,G. E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2012d). Improving neural networks by preventing co-adaptation of feature detectors. Technical report,arXiv:1207.0580.

Hinton,G. E.,Vinyals,O.,and Dean,J.(2014). Dark knowledge. Invited talk at the BayLearn Bay Area Machine Learning Symposium.

Hochreiter,S.(1991a). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,T.U. München.

Hochreiter,S.(1991b). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,Institut für Informatik,Lehrstuhl Prof. Brauer,Technische Universität München.

Hochreiter,S. and Schmidhuber,J.(1995). Simplifying neural nets by discoveringflat minima. In Advances in Neural Information Processing Systems 7,pages 529–536. MIT Press.

Hochreiter,S. and Schmidhuber,J.(1997). Long short-term memory. Neural Computation,9(8),1735–1780.

Hochreiter,S.,Bengio,Y.,and Frasconi,P.(2001). Gradientflow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer,editors,Field Guide to Dynamical Recurrent Networks. IEEE Press.

Holi,J. L. and Hwang,J.-N.(1993). Finite precision error analysis of neural network hardware implementations. Computers,IEEE Transactions on,42(3),281–290.

Holt,J. L. and Baker,T. E.(1991). Back propagation simulations using limited precision calculations. In Neural Networks,1991.,IJCNN-91-Seattle International Joint Conference on,volume 2,pages 121–126. IEEE.

Hornik,K.,Stinchcombe,M.,and White,H.(1989). Multilayer feedforward networks are universal approximators. Neural Networks,2,359–366.

Hornik,K.,Stinchcombe,M.,and White,H.(1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks,3(5),551–560.

Hsu,F.-H.(2002). Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton University Press,Princeton,NJ,USA.

Huang,F. and Ogata,Y.(2002). Generalized pseudo-likelihood estimates for Markov random fields on lattice. Annals of the Institute of Statistical Mathematics,54(1),1–18.

Huang,P.-S.,He,X.,Gao,J.,Deng,L.,Acero,A.,and Heck,L.(2013). Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management,pages 2333–2338. ACM.

Hubel,D. and Wiesel,T.(1968). Receptivefields and functional architecture of monkey striate cortex. Journal of Physiology(London),195,215–243.

Hubel,D. H. and Wiesel,T. N.(1959). Receptivefields of single neurons in the cat's striate cortex. Journal of Physiology,148,574–591.

Hubel,D. H. and Wiesel,T. N.(1962). Receptivefields,binocular interaction,and functional architecture in the cat's visual cortex. Journal of Physiology(London),160,106–154.

Huszar,F.(2015). How(not) to train your generative model: schedule sampling,likelihood,adversary? arXiv:1511.05101.

Hutter,F.,Hoos,H.,and Leyton-Brown,K.(2011). Sequential model-based optimization for general algorithm configuration. In LION-5. Extended version as UBC Tech report TR-2010-10.

Hyotyniemi,H.(1996). Turing machines are recurrent neural networks. In STeP'96,pages 13–24.

Hyvärinen,A.(1999). Survey on independent component analysis. Neural Computing Surveys,2,94–128.

Hyvärinen,A.(2005a). Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research,6,695–709.

Hyvärinen,A.(2005b). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res.,6.

Hyvärinen,A.(2007a). Connections between score matching,contrastive divergence,and pseu-dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks,18,1529–1531.

Hyvärinen,A.(2007b). Some extensions of score matching. Computational Statistics and Data Analysis,51,2499–2512.

Hyvärinen,A. and Hoyer,P. O.(1999). Emergence of topography and complex cell properties from natural images using extensions of ica. In NIPS,pages 827–833.

Hyvärinen,A. and Pajunen,P.(1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks,12(3),429–439.

Hyvärinen,A.,Karhunen,J.,and Oja,E.(2001a). Independent Component Analysis. Wiley-Interscience.

Hyvärinen,A.,Hoyer,P. O.,and Inki,M. O.(2001b). Topographic independent component analysis. Neural Computation,13(7),1527–1558.

Hyvärinen,A.,Hurri,J.,and Hoyer,P. O.(2009). Natural Image Statistics: A probabilistic approach to early computational vision. Springer-Verlag.

Iba,Y.(2001). Extended ensemble Monte Carlo. International Journal of Modern Physics,C12,623–656.

Inayoshi,H. and Kurita,T.(2005). Improved generalization by adding both auto-association and hidden-layer noise to neural-network-based-classifiers. IEEE Workshop on Machine Learning for Signal Processing,pages 141–146.

Ioffe,S. and Szegedy,C.(2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.

Jacobs,R. A.(1988). Increased rates of convergence through learning rate adaptation. Neural networks,1(4),295–307.

Jacobs,R. A.,Jordan,M. I.,Nowlan,S. J.,and Hinton,G. E.(1991). Adaptive mixtures of local experts. Neural Computation,3,79–87.

Jaeger,H.(2003). Adaptive nonlinear system identification with echo state networks. In Advances in Neural Information Processing Systems 15.

Jaeger,H.(2007a). Discovering multiscale dynamical features with hierarchical echo state networks. Technical report,Jacobs University.

Jaeger,H.(2007b). Echo state network. Scholarpedia,2(9),2330.

Jaeger,H.(2012). Long short-term memory in echo state networks: Details of a simulation study. Technical report,Technical report,Jacobs University Bremen.

Jaeger,H. and Haas,H.(2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science,304(5667),78–80.

Jaeger,H.,Lukosevicius,M.,Popovici,D.,and Siewert,U.(2007). Optimization and applications of echo state networks with leaky-integrator neurons. Neural Networks,20(3),335–352.

Jain,V.,Murray,J. F.,Roth,F.,Turaga,S.,Zhigulin,V.,Briggman,K. L.,Helmstaedter,M. N.,Denk,W.,and Seung,H. S.(2007). Supervised learning of image restoration with convolutional networks. In Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on,pages 1–8. IEEE.

Jaitly,N. and Hinton,G.(2011). Learning a better representation of speech soundwaves using restricted Boltzmann machines. In Acoustics,Speech and Signal Processing(ICASSP),2011 IEEE International Conference on,pages 5884–5887. IEEE.

Jaitly,N. and Hinton,G. E.(2013). Vocal tract length perturbation(VTLP) improves speech recognition. In ICML'2013.

Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2009a). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision(ICCV'09),pages 2146–2153. IEEE.

Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2009b). What is the best multi-stage architecture for object recognition? In ICCV'09.

Jarzynski,C.(1997). Nonequilibrium equality for free energy differences. Phys. Rev. Lett.,78,2690–2693.

Jaynes,E. T.(2003). Probability Theory: The Logic of Science. Cambridge University Press.

Jean,S.,Cho,K.,Memisevic,R.,and Bengio,Y.(2014). On using very large target vocabulary for neural machine translation. arXiv:1412.2007.

Jelinek,F. and Mercer,R. L.(1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal,editors,Pattern Recognition in Practice. North-Holland,Amsterdam.

Jia,Y.(2013). Caffe:An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.

Jia,Y.,Huang,C.,and Darrell,T.(2012). Beyond spatial pyramids: Receptivefield learning for pooled image features. In Computer Vision and Pattern Recognition(CVPR),2012 IEEE Conference on,pages 3370–3377. IEEE.

Jim,K.-C.,Giles,C. L.,and Horne,B. G.(1996). An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks,7(6),1424–1438.

Jordan,M. I.(1998). Learning in Graphical Models. Kluwer,Dordrecht,Netherlands.

Joulin,A. and Mikolov,T.(2015). Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv preprint arXiv:1503.01007.

Jozefowicz,R.,Zaremba,W.,and Sutskever,I.(2015). An empirical evaluation of recurrent network architectures. In ICML'2015.

Judd,J. S.(1989). Neural Network Design and the Complexity of Learning. MIT press.

Jutten,C. and Herault,J.(1991). Blind separation of sources,part I: an adaptive algorithm based on neuromimetic architecture. Signal Processing,24,1–10.

Kahou,S. E.,Pal,C.,Bouthillier,X.,Froumenty,P.,Gülçehre,c.,Memisevic,R.,Vincent,P.,Courville,A.,Bengio,Y.,Ferrari,R. C.,Mirza,M.,Jean,S.,Carrier,P. L.,Dauphin,Y.,Boulanger-Lewandowski,N.,Aggarwal,A.,Zumer,J.,Lamblin,P.,Raymond,J.-P.,Des-jardins,G.,Pascanu,R.,Warde-Farley,D.,Torabi,A.,Sharma,A.,Bengio,E.,Côté,M.,Konda,K. R.,and Wu,Z.(2013). Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction.

Kalchbrenner,N. and Blunsom,P.(2013). Recurrent continuous translation models. In EMNLP'2013.

Kalchbrenner,N.,Danihelka,I.,and Graves,A.(2015). Grid long short-term memory. arXiv preprint arXiv:1507.01526.

Kamyshanska,H. and Memisevic,R.(2015). The potential energy of an autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Karpathy,A. and Li,F.-F.(2015). Deep visual-semantic alignments for generating image de-scriptions. In CVPR'2015. arXiv:1412.2306.

Karpathy,A.,Toderici,G.,Shetty,S.,Leung,T.,Sukthankar,R.,and Fei-Fei,L.(2014). Large-scale video classification with convolutional neural networks. In CVPR.

Karush,W.(1939). Minima of Functions of Several Variables with Inequalities as Side Constraints. Master's thesis,Dept. of Mathematics,Univ. of Chicago.

Katz,S. M.(1987). Estimation of probabilities from sparse data for the language model compo-nent of a speech recognizer. IEEE Transactions on Acoustics,Speech,and Signal Processing,ASSP-35(3),400–401.

Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2008). Fast inference in sparse coding algorithms with applications to object recognition. Technical report,Computational and Biological Learn-ing Lab,Courant Institute,NYU. Tech Report CBLL-TR-2008-12-01.

Kavukcuoglu,K.,Ranzato,M.-A.,Fergus,R.,and LeCun,Y.(2009). Learning invariant features through topographicfilter maps. In CVPR'2009.

Kavukcuoglu,K.,Sermanet,P.,Boureau,Y.-L.,Gregor,K.,Mathieu,M.,and LeCun,Y.(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS'2010.

Kelley,H. J.(1960). Gradient theory of optimalflight paths. ARS Journal,30(10),947–954.

Khan,F.,Zhu,X.,and Mutlu,B.(2011). How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems 24(NIPS'11),pages 1449–1457.

Kim,S. K.,McAfee,L. C.,McMahon,P. L.,and Olukotun,K.(2009). A highly scalable restricted Boltzmann machine FPGA implementation. In Field Programmable Logic and Applications,2009. FPL 2009. International Conference on,pages 367–372. IEEE.

Kindermann,R.(1980). Markov Random Fields and Their Applications(Contemporary Mathe-matics;V. 1). American Mathematical Society.

Kingma,D. and Ba,J.(2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kingma,D. and LeCun,Y.(2010a). Regularized estimation of image statistics by score matching. In NIPS'2010.

Kingma,D. and LeCun,Y.(2010b). Regularized estimation of image statistics by score matching. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1126–1134.

Kingma,D.,Rezende,D.,Mohamed,S.,and Welling,M.(2014). Semi-supervised learning with deep generative models. In NIPS'2014.

Kingma,D. P.(2013). Fast gradient-based inference with continuous latent variable models in auxiliary form. Technical report,arxiv:1306.0733.

Kingma,D. P. and Welling,M.(2014a). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations(ICLR).

Kingma,D. P. and Welling,M.(2014b). Efficient gradient-based inference through transforma-tions between bayes nets and neural nets. Technical report,arxiv:1402.0480.

Kirkpatrick,S.,Jr.,C. D. G.,and Vecchi,M. P.(1983). Optimization by simulated annealing. Science,220,671–680.

Kiros,R.,Salakhutdinov,R.,and Zemel,R.(2014a). Multimodal neural language models. In ICML'2014.

Kiros,R.,Salakhutdinov,R.,and Zemel,R.(2014b). Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs.LG].

Klementiev,A.,Titov,I.,and Bhattarai,B.(2012). Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012.

Knowles-Barley,S.,Jones,T. R.,Morgan,J.,Lee,D.,Kasthuri,N.,Lichtman,J. W.,and Pfister,H.(2014). Deep learning for the connectome. GPU Technology Conference.

Koller,D. and Friedman,N.(2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Konig,Y.,Bourlard,H.,and Morgan,N.(1996). REMAP:Recursive estimation and maxi-mization of a posteriori probabilities–application to transition-based connectionist speech recognition. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95). MIT Press,Cambridge,MA.

Koren,Y.(2009). The BellKor solution to the Netflix grand prize.

Kotzias,D.,Denil,M.,de Freitas,N.,and Smyth,P.(2015). From group to individual labels using deep features. In ACM SIGKDD.

Koutnik,J.,Greff,K.,Gomez,F.,and Schmidhuber,J.(2014). A clockwork RNN. In ICML'2014.

Kočiský,T.,Hermann,K. M.,and Blunsom,P.(2014). Learning Bilingual Word Representations by Marginalizing Alignments. In Proceedings of ACL.

Krause,O.,Fischer,A.,Glasmachers,T.,and Igel,C.(2013). Approximation properties of DBNs with binary hidden units and real-valued visible units. In ICML'2013.

Krizhevsky,A.(2010). Convolutional deep belief networks on CIFAR-10. Technical report,Uni-versity of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/kriz/conv-cifar10-aug2010.pdf.

Krizhevsky,A. and Hinton,G.(2009). Learning multiple layers of features from tiny images. Technical report,University of Toronto.

Krizhevsky,A. and Hinton,G. E.(2011). Using very deep autoencoders for content-based image retrieval. In ESANN.

Krizhevsky,A.,Sutskever,I.,and Hinton,G.(2012a). ImageNet classification with deep convo-lutional neural networks. In NIPS'2012.

Krizhevsky,A.,Sutskever,I.,and Hinton,G.(2012b). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25(NIPS'2012).

Krueger,K. A. and Dayan,P.(2009). Flexible shaping: how learning in small steps helps. Cognition,110,380–394.

Kuhn,H. W. and Tucker,A. W.(1951). Nonlinear programming. In Proceedings of the Sec-ond Berkeley Symposium on Mathematical Statistics and Probability,pages 481–492,Berkeley,Calif. University of California Press.

Kumar,A.,Irsoy,O.,Ondruska,P.,Iyyer,M.,Bradbury,J.,Gulrajani,I.,and Socher,R.(2015a). Ask me anything: Dynamic memory networks for natural language processing. Technical report,arXiv:1506.07285.

Kumar,A.,Irsoy,O.,Su,J.,Bradbury,J.,English,R.,Pierce,B.,Ondruska,P.,Iyyer,M.,Gulrajani,I.,and Socher,R.(2015b). Ask me anything: Dynamic memory networks for natural language processing. arXiv:1506.07285.

Kumar,M. P.,Packer,B.,and Koller,D.(2010). Self-paced learning for latent variable models. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1189–1197.

Lang,K. J. and Hinton,G. E.(1988). The development of the time-delay neural network archi-tecture for speech recognition. Technical Report CMU-CS-88-152,Carnegie-Mellon University.

Lang,K. J.,Waibel,A. H.,and Hinton,G. E.(1990). A time-delay neural network architecture for isolated word recognition. Neural networks,3(1),23–43.

Langford,J. and Zhang,T.(2008). The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS'2008,pages 1096–1103.

Lappalainen,H.,Giannakopoulos,X.,Honkela,A.,and Karhunen,J.(2000). Nonlinear independent component analysis using ensemble learning: Experiments and discussion. In Proc. ICA. Citeseer.

Larochelle,H. and Bengio,Y.(2008a). Classification using discriminative restricted Boltzmann machines. In ICML'2008.

Larochelle,H. and Bengio,Y.(2008b). Classification using discriminative restricted Boltzmann machines. In ICM(1a),pages 536–543.

Larochelle,H. and Hinton,G. E.(2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in Neural Information Processing Systems 23,pages 1243–1251.

Larochelle,H. and Murray,I.(2011). The Neural Autoregressive Distribution Estimator. In AISTATS'2011.

Larochelle,H.,Erhan,D.,and Bengio,Y.(2008). Zero-data learning of new tasks. In AAAI Conference on Artificial Intelligence.

Larochelle,H.,Bengio,Y.,Louradour,J.,and Lamblin,P.(2009). Exploring strategies for training deep neural networks. In JML(1),pages 1–40.

Lasserre,J. A.,Bishop,C. M.,and Minka,T. P.(2006). Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'06),pages 87–94,Washington,DC,USA. IEEE Computer Society.

Le,Q.,Ngiam,J.,Chen,Z.,hao Chia,D. J.,Koh,P. W.,and Ng,A.(2010). Tiled convolutional neural networks. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23(NIPS'10),pages 1279–1287.

Le,Q.,Ngiam,J.,Coates,A.,Lahiri,A.,Prochnow,B.,and Ng,A.(2011). On optimization methods for deep learning. In Proc. ICML'2011. ACM.

Le,Q.,Ranzato,M.,Monga,R.,Devin,M.,Corrado,G.,Chen,K.,Dean,J.,and Ng,A.(2012). Building high-level features using large scale unsupervised learning. In ICML'2012.

Le Roux,N. and Bengio,Y.(2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation,20(6),1631–1649.

Le Roux,N. and Bengio,Y.(2010). Deep belief networks are compact universal approximators. Neural Computation,22(8),2192–2207.

LeCun,Y.(1985). Une procédure d'apprentissage pour Réseau à seuil assymétrique. In Cognitiva 85: A la Frontière de l'Intelligence Artificielle,des Sciences de la Connaissance et des Neurosciences,pages 599–604,Paris 1985. CESTA,Paris.

LeCun,Y.(1986). Learning processes in an asymmetric threshold network. In E. Bienenstock,F. Fogelman-Soulié,and G. Weisbuch,editors,Disordered Systems and Biological Organization,pages 233–240. Springer-Verlag,Berlin,Les Houches 1985.

LeCun,Y.(1987). Modèles connexionistes de l'apprentissage. Ph.D. thesis,Université de Paris VI.

LeCun,Y.(1989). Generalization and network design strategies. Technical Report CRG-TR-89-4,University of Toronto.

LeCun,Y.,Jackel,L. D.,Boser,B.,Denker,J. S.,Graf,H. P.,Guyon,I.,Henderson,D.,Howard,R. E.,and Hubbard,W.(1989). Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine,27(11),41–46.

LeCun,Y.,Bottou,L.,Orr,G. B.,and Müller,K.-R.(1998a). Efficient backprop. In Neural Networks,Tricks of the Trade,Lecture Notes in Computer Science LNCS 1524. Springer Verlag.

LeCun,Y.,Bottou,L.,Orr,G. B.,and Müller,K.(1998b). Efficient backprop. In Neural Networks,Tricks of the Trade.

LeCun,Y.,Bottou,L.,Bengio,Y.,and Haffner,P.(1998c). Gradient based learning applied to document recognition. Proc. IEEE.

LeCun,Y.,Kavukcuoglu,K.,and Farabet,C.(2010). Convolutional networks and applications in vision. In Circuits and Systems(ISCAS),Proceedings of 2010 IEEE International Symposium on,pages 253–256. IEEE.

L'Ecuyer,P.(1994). Efficiency improvement and variance reduction. In Proceedings of the 1994 Winter Simulation Conference,pages 122–132.

Lee,C.-Y.,Xie,S.,Gallagher,P.,Zhang,Z.,and Tu,Z.(2014). Deeply-supervised nets. arXiv preprint arXiv:1409.5185.

Lee,H.,Battle,A.,Raina,R.,and Ng,A.(2007). Efficient sparse coding algorithms. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 801–808. MIT Press.

Lee,H.,Ekanadham,C.,and Ng,A.(2008). Sparse deep belief net model for visual area V2. In NIPS'07.

Lee,H.,Grosse,R.,Ranganath,R.,and Ng,A. Y.(2009). Convolutional deep belief net-works for scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09). ACM,Montreal,Canada.

Lee,Y. J. and Grauman,K.(2011). Learning the easy thingsfirst: self-paced visual category discovery. In CVPR'2011.

Leibniz,G. W.(1676). Memoir using the chain rule.(Cited in TMME 7:2&3 p 321-332,2010).

Lenat,D. B. and Guha,R. V.(1989). Building large knowledge-based systems;representation and inference in the Cyc project. Addison-Wesley Longman Publishing Co.,Inc.

Leshno,M.,Lin,V. Y.,Pinkus,A.,and Schocken,S.(1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks,6,861–867.

Levenberg,K.(1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics,II(2),164–168.

L'Hôpital,G. F. A.(1696). Analyse des infiniment petits,pour l'intelligence des lignes courbes. Paris: L'Imprimerie Royale.

Li,Y.,Swersky,K.,and Zemel,R. S.(2015). Generative moment matching networks. CoRR,abs/1502.02761.

Lin,T.,Horne,B. G.,Tino,P.,and Giles,C. L.(1996). Learning long-term dependencies is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural Networks,7(6),1329–1338.

Lin,Y.,Liu,Z.,Sun,M.,Liu,Y.,and Zhu,X.(2015). Learning entity and relation embeddings for knowledge graph completion. In Proc. AAAI'15.

Linde,N.(1992). The machine that changed the world,episode 3. Documentary miniseries.

Lindsey,C. and Lindblad,T.(1994). Review of hardware neural networks: a user's perspective. In Proc. Third Workshop on Neural Networks: From Biology to High Energy Physics,pages 195–202,Isola d'Elba,Italy.

Linnainmaa,S.(1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics,16(2),146–160.

LISA(2008). Deep learning tutorials:Restricted Boltzmann machines. Technical report,LISA Lab,Université de Montréal.

Long,P. M. and Servedio,R. A.(2010). Restricted Boltzmann machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning(ICML'10).

Lotter,W.,Kreiman,G.,and Cox,D.(2015). Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380.

Lovelace,A.(1842). Notes upon L. F. Menabrea's“Sketch of the Analytical Engine invented by Charles Babbage”.

Lu,L.,Zhang,X.,Cho,K.,and Renals,S.(2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proc. Interspeech.

Lu,T.,Pál,D.,and Pál,M.(2010). Contextual multi-armed bandits. In International Conference on Artificial Intelligence and Statistics,pages 485–492.

Luenberger,D. G.(1984). Linear and Nonlinear Programming. Addison Wesley.

Lukoševičius,M. and Jaeger,H.(2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review,3(3),127–149.

Luo,H.,Shen,R.,Niu,C.,and Ullrich,C.(2011). Learning class-relevant features and class-irrelevant features via a hybrid third-order RBM. In International Conference on Artificial Intelligence and Statistics,pages 470–478.

Luo,H.,Carrier,P. L.,Courville,A.,and Bengio,Y.(2013). Texture modeling with convolutional spike-and-slab RBMs and deep extensions. In AISTATS'2013.

Lyu,S.(2009). Interpretation and generalization of score matching. In Proceedings of the Twenty-fifth Conference in Uncertainty in Artificial Intelligence(UAI'09).

Ma,J.,Sheridan,R. P.,Liaw,A.,Dahl,G. E.,and Svetnik,V.(2015). Deep neural nets as a method for quantitative structure–activity relationships. J. Chemical information and modeling.

Maas,A. L.,Hannun,A. Y.,and Ng,A. Y.(2013). Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio,Speech,and Language Processing.

Maass,W.(1992). Bounds for the computational power and learning complexity of analog neural nets(extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing,pages 335–344.

Maass,W.,Schnitger,G.,and Sontag,E. D.(1994). A comparison of the computational power of sigmoid and Boolean threshold circuits. Theoretical Advances in Neural Computation and Learning,pages 127–151.

Maass,W.,Natschlaeger,T.,and Markram,H.(2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation,14(11),2531–2560.

MacKay,D.(2003). Information Theory,Inference and Learning Algorithms. Cambridge University Press.

Maclaurin,D.,Duvenaud,D.,and Adams,R. P.(2015). Gradient-based hyperparameter optimization through reversible learning. arXiv preprint arXiv:1502.03492.

Mao,J.,Xu,W.,Yang,Y.,Wang,J.,and Yuille,A.(2014). Deep captioning with multimodal recurrent neural networks(m-rnn). arXiv:1412.6632[cs.CV].

Marcotte,P. and Savard,G.(1992). Novel approaches to the discrimination problem. Zeitschrift für Operations Research(Theory),36,517–545.

Marlin,B. and de Freitas,N.(2011). Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In UAI'2011.

Marlin,B.,Swersky,K.,Chen,B.,and de Freitas,N.(2010). Inductive principles for restricted Boltzmann machine learning. In AISTATS'2010,pages 509–516.

Marquardt,D. W.(1963). An algorithm for least-squares estimation of non-linear parameters. Journal of the Society of Industrial and Applied Mathematics,11(2),431–441.

Marr,D. and Poggio,T.(1976). Cooperative computation of stereo disparity. Science,194.

Martens,J.(2010). Deep learning via Hessian-free optimization. In ICML'2010,pages 735–742.

Martens,J. and Medabalimi,V.(2014). On the expressive efficiency of sum product networks. arXiv:1411.7717.

Martens,J. and Sutskever,I.(2011). Learning recurrent neural networks with Hessian-free optimization. In Proc. ICML'2011. ACM.

Mase,S.(1995). Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian processes. The Annals of Applied Probability,5(3),pp. 603–612.

McClelland,J.,Rumelhart,D.,and Hinton,G.(1995). The appeal of parallel distributed processing. In Computation & intelligence,pages 305–341. American Association for Artificial Intelligence.

McCulloch,W. S. and Pitts,W.(1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics,5,115–133.

Mead,C. and Ismail,M.(2012). Analog VLSI implementation of neural systems,volume 80. Springer Science & Business Media.

Melchior,J.,Fischer,A.,and Wiskott,L.(2013). How to center binary deep Boltzmann machines. arXiv preprint arXiv:1311.1354.

Memisevic,R. and Hinton,G. E.(2007). Unsupervised learning of image transformations. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'07).

Memisevic,R. and Hinton,G. E.(2010). Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computation,22(6),1473–1492.

Mesnil,G.,Dauphin,Y.,Glorot,X.,Rifai,S.,Bengio,Y.,Goodfellow,I.,Lavoie,E.,Muller,X.,Desjardins,G.,Warde-Farley,D.,Vincent,P.,Courville,A.,and Bergstra,J.(2011). Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised and Transfer Learning,volume 7.

Mesnil,G.,Rifai,S.,Dauphin,Y.,Bengio,Y.,and Vincent,P.(2012). Surfing on the manifold. Learning Workshop,Snowbird.

Miikkulainen,R. and Dyer,M. G.(1991). Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science,15,343–399.

Mikolov,T.(2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,Brno University of Technology.

Mikolov,T.,Deoras,A.,Kombrink,S.,Burget,L.,and Cernocky,J.(2011a). Empirical evaluation and combination of advanced language modeling techniques. In Proc. 12th annual conference of the international speech communication association(INTERSPEECH 2011).

Mikolov,T.,Deoras,A.,Povey,D.,Burget,L.,and Cernocky,J.(2011b). Strategies for training large scale neural network language models. In Proc. ASRU'2011.

Mikolov,T.,Chen,K.,Corrado,G.,and Dean,J.(2013a). Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.

Mikolov,T.,Le,Q. V.,and Sutskever,I.(2013b). Exploiting similarities among languages for machine translation. Technical report,arXiv:1309.4168.

Minka,T.(2005). Divergence measures and message passing. Microsoft Research Cambridge UK Tech Rep MSRTR2005173,72(TR-2005-173).

Minsky,M. L. and Papert,S. A.(1969). Perceptrons. MIT Press,Cambridge.

Mirza,M. and Osindero,S.(2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.

Mishkin,D. and Matas,J.(2015). All you need is a good init. arXiv preprint arXiv:1511.06422.

Misra,J. and Saha,I.(2010). Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing,74(1),239–255.

Mitchell,T. M.(1997). Machine Learning. McGraw-Hill,New York.

Miyato,T.,Maeda,S.,Koyama,M.,Nakae,K.,and Ishii,S.(2015). Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677.

Mnih,A. and Gregor,K.(2014). Neural variational inference and learning in belief networks. In ICML'2014.

Mnih,A. and Hinton,G. E.(2007). Three new graphical models for statistical language mod-elling. In Z. Ghahramani,editor,Proceedings of the Twenty-fourth International Conference on Machine Learning(ICML'07),pages 641–648. ACM.

Mnih,A. and Hinton,G. E.(2009). A scalable hierarchical distributed language model. In D. Koller,D. Schuurmans,Y. Bengio,and L. Bottou,editors,Advances in Neural Information Processing Systems 21(NIPS'08),pages 1081–1088.

Mnih,A. and Kavukcuoglu,K.(2013). Learning word embeddings efficiently with noise-contrastive estimation. In C. Burges,L. Bottou,M. Welling,Z. Ghahramani,and K. Weinberger,editors,Advances in Neural Information Processing Systems 26,pages 2265–2273. Curran Associates,Inc.

Mnih,A. and Teh,Y. W.(2012). A fast and simple algorithm for training neural probabilistic language models. In ICML'2012,pages 1751–1758.

Mnih,V. and Hinton,G.(2010). Learning to detect roads in high-resolution aerial images. In Proceedings of the 11th European Conference on Computer Vision(ECCV).

Mnih,V.,Larochelle,H.,and Hinton,G.(2011). Conditional restricted Boltzmann machines for structure output prediction. In Proc. Conf. on Uncertainty in Artificial Intelligence(UAI).

Mnih,V.,Kavukcuoglo,K.,Silver,D.,Graves,A.,Antonoglou,I.,and Wierstra,D.(2013). Playing Atari with deep reinforcement learning. Technical report,arXiv:1312.5602.

Mnih,V.,Heess,N.,Graves,A.,and Kavukcuoglu,K.(2014). Recurrent models of visual attention. In Z. Ghahramani,M. Welling,C. Cortes,N. Lawrence,and K. Weinberger,editors,NIPS'2014,pages 2204–2212.

Mnih,V.,Kavukcuoglo,K.,Silver,D.,Rusu,A. A.,Veness,J.,Bellemare,M. G.,Graves,A.,Riedmiller,M.,Fidgeland,A. K.,Ostrovski,G.,Petersen,S.,Beattie,C.,Sadik,A.,Antonoglou,I.,King,H.,Kumaran,D.,Wierstra,D.,Legg,S.,and Hassabis,D.(2015). Human-level control through deep reinforcement learning. Nature,518,529–533.

Mobahi,H. and Fisher,III,J. W.(2015). A theoretical analysis of optimization by Gaussian continuation. In AAAI'2015.

Mobahi,H.,Collobert,R.,and Weston,J.(2009). Deep learning from temporal coherence in video. In L. Bottou and M. Littman,editors,Proceedings of the 26th International Conference on Machine Learning,pages 737–744,Montreal. Omnipress.

Mohamed,A.,Dahl,G.,and Hinton,G.(2009). Deep belief networks for phone recognition.

Mohamed,A.,Sainath,T. N.,Dahl,G.,Ramabhadran,B.,Hinton,G. E.,and Picheny,M. A.(2011). Deep belief networks using discriminative features for phone recognition. In Acoustics,Speech and Signal Processing(ICASSP),2011 IEEE International Conference on,pages 5060–5063. IEEE.

Mohamed,A.,Dahl,G.,and Hinton,G.(2012a). Acoustic modeling using deep belief networks. IEEE Trans. on Audio,Speech and Language Processing,20(1),14–22.

Mohamed,A.,Hinton,G.,and Penn,G.(2012b). Understanding how deep belief networks perform acoustic modelling. In Acoustics,Speech and Signal Processing(ICASSP),2012 IEEE International Conference on,pages 4273–4276. IEEE.

Moller,M.(1993). Efficient Training of Feed-Forward Neural Networks. Ph.D. thesis,Aarhus University,Aarhus,Denmark.

Montavon,G. and Muller,K.-R.(2012). Deep Boltzmann machines and the centering trick. In G. Montavon,G. Orr,and K.-R. Müller,editors,Neural Networks: Tricks of the Trade,volume 7700 of Lecture Notes in Computer Science,pages 621–637. Preprint: http://arxiv.org/abs/1203.3783.

Montúfar,G.(2014). Universal approximation depth and errors of narrow belief networks with discrete units. Neural Computation,26.

Montúfar,G. and Ay,N.(2011). Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation,23(5),1306–1319.

Montufar,G. F.,Pascanu,R.,Cho,K.,and Bengio,Y.(2014). On the number of linear regions of deep neural networks. In NIPS'2014.

Mor-Yosef,S.,Samueloff,A.,Modan,B.,Navot,D.,and Schenker,J. G.(1990). Ranking the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,75(6),944–7.

Morin,F. and Bengio,Y.(2005). Hierarchical probabilistic neural network language model. In AISTATS'2005.

Mozer,M. C.(1992). The induction of multiscale temporal structure. In J. M. S. Hanson and R. Lippmann,editors,Advances in Neural Information Processing Systems 4(NIPS'91),pages 275–282,San Mateo,CA. Morgan Kaufmann.

Murphy,K. P.(2012). Machine Learning: a Probabilistic Perspective. MIT Press,Cambridge,MA,USA.

Murray,B. U. I. and Larochelle,H.(2014). A deep and tractable density estimator. In ICML'2014.

Nair,V. and Hinton,G.(2010a). Rectified linear units improve restricted Boltzmann machines. In ICML'2010.

Nair,V. and Hinton,G. E.(2009). 3d object recognition with deep belief nets. In Y. Bengio,D. Schuurmans,J. D. Lafferty,C. K. I. Williams,and A. Culotta,editors,Advances in Neural Information Processing Systems 22,pages 1339–1347. Curran Associates,Inc.

Nair,V. and Hinton,G. E.(2010b). Rectified linear units improve restricted Boltzmann machines. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-seventh International Conference on Machine Learning(ICML-10),pages 807–814. ACM.

Narayanan,H. and Mitter,S.(2010). Sample complexity of testing the manifold hypothesis. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1786–1794.

Naumann,U.(2008). Optimal Jacobian accumulation is NP-complete. Mathematical Programming,112(2),427–441.

Navigli,R. and Velardi,P.(2005). Structural semantic interconnections: a knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and Machine Intelligence,27(7),1075–1086.

Neal,R. and Hinton,G.(1999). A view of the EM algorithm that justifies incremental,sparse,and other variants. In M. I. Jordan,editor,Learning in Graphical Models. MIT Press,Cambridge,MA.

Neal,R. M.(1990). Learning stochastic feedforward networks. Technical report.

Neal,R. M.(1993). Probabilistic inference using Markov chain Monte-Carlo methods. Technical Report CRG-TR-93-1,Dept. of Computer Science,University of Toronto.

Neal,R. M.(1994). Sampling from multimodal distributions using tempered transitions. Technical Report 9421,Dept. of Statistics,University of Toronto.

Neal,R. M.(1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer.

Neal,R. M.(2001). Annealed importance sampling. Statistics and Computing,11(2),125–139.

Neal,R. M.(2005). Estimating ratios of normalizing constants using linked importance sampling.

Nesterov,Y.(1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady,27,372–376.

Nesterov,Y.(2004). Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publ.,Boston,Dordrecht,London.

Netzer,Y.,Wang,T.,Coates,A.,Bissacco,A.,Wu,B.,and Ng,A. Y.(2011). Reading digits in natural images with unsupervised feature learning. Deep Learning and Unsupervised Feature Learning Workshop,NIPS.

Ney,H. and Kneser,R.(1993). Improved clustering techniques for class-based statistical language modelling. In European Conference on Speech Communication and Technology(Eurospeech),pages 973–976,Berlin.

Ng,A.(2015). Advice for applying machine learning. https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf.

Niesler,T. R.,Whittaker,E. W. D.,and Woodland,P. C.(1998). Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 177–180.

Ning,F.,Delhomme,D.,LeCun,Y.,Piano,F.,Bottou,L.,and Barbano,P. E.(2005). To-ward automatic phenotyping of developing embryos from videos. Image Processing,IEEE Transactions on,14(9),1360–1371.

Nocedal,J. and Wright,S.(2006). Numerical Optimization. Springer.

Norouzi,M. and Fleet,D. J.(2011). Minimal loss hashing for compact binary codes. In ICML'2011.

Nowlan,S. J.(1990). Competing experts: An experimental investigation of associative mixture models. Technical Report CRG-TR-90-5,University of Toronto.

Nowlan,S. J. and Hinton,G. E.(1992). Adaptive soft weight tying using Gaussian mixtures. In J. M. S. Hanson and R. Lippmann,editors,Advances in Neural Information Processing Systems 4(NIPS'91),pages 993–1000,San Mateo,CA. Morgan Kaufmann.

Olshausen,B. and Field,D. J.(2005). How close are we to understanding V1? Neural Computation,17,1665–1699.

Olshausen,B. A. and Field,D. J.(1996). Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature,381,607–609.

Olshausen,B. A.,Anderson,C. H.,and Van Essen,D. C.(1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci.,13(11),4700–4719.

Opper,M. and Archambeau,C.(2009). The variational Gaussian approximation revisited. Neural computation,21(3),786–792.

Oquab,M.,Bottou,L.,Laptev,I.,and Sivic,J.(2014). Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition(CVPR),2014 IEEE Conference on,pages 1717–1724. IEEE.

Osindero,S. and Hinton,G. E.(2008). Modeling image patches with a directed hierarchy of Markov randomfields. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),pages 1121–1128,Cambridge,MA. MIT Press.

Ovid and Martin,C.(2004). Metamorphoses. W.W. Norton.

Paccanaro,A. and Hinton,G. E.(2000). Extracting distributed representations of concepts and relations from positive and negative propositions. In International Joint Conference on Neural Networks(IJCNN),Como,Italy. IEEE,New York.

Paine,T. L.,Khorrami,P.,Han,W.,and Huang,T. S.(2014). An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597.

Palatucci,M.,Pomerleau,D.,Hinton,G. E.,and Mitchell,T. M.(2009). Zero-shot learning with semantic output codes. In Y. Bengio,D. Schuurmans,J. D. Lafferty,C. K. I. Williams,and A. Culotta,editors,Advances in Neural Information Processing Systems 22,pages 1410–1418. Curran Associates,Inc.

Parker,D. B.(1985). Learning-logic. Technical Report TR-47,Center for Comp. Research in Economics and Management Sci.,MIT.

Pascanu,R.,Mikolov,T.,and Bengio,Y.(2013a). On the difficulty of training recurrent neural networks. In ICML'2013.

Pascanu,R.,Mikolov,T.,and Bengio,Y.(2013b). On the difficulty of training recurrent neural networks. In ICM(1c).

Pascanu,R.,Gulcehre,C.,Cho,K.,and Bengio,Y.(2014a). How to construct deep recurrent neural networks. In ICLR.

Pascanu,R.,Montufar,G.,and Bengio,Y.(2014b). On the number of inference regions of deep feed forward networks with piece-wise linear activations. In ICL(1).

Pati,Y.,Rezaiifar,R.,and Krishnaprasad,P.(1993). Orthogonal matching pursuit:Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals,Systems,and Computers,pages 40–44.

Pearl,J.(1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society,University of California,Irvine,pages 329–334.

Pearl,J.(1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.

Perron,O.(1907). Zur theorie der matrices. Mathematische Annalen,64(2),248–263.

Petersen,K. B. and Pedersen,M. S.(2006). The matrix cookbook. Version 20051003.

Peterson,G. B.(2004). A day of great illumination: B. F. Skinner's discovery of shaping. Journal of the Experimental Analysis of Behavior,82(3),317–328.

Pham,D.-T.,Garat,P.,and Jutten,C.(1992). Separation of a mixture of independent sources through a maximum likelihood approach. In EUSIPCO,pages 771–774.

Pham,P.-H.,Jelaca,D.,Farabet,C.,Martini,B.,LeCun,Y.,and Culurciello,E.(2012). Neu-Flow: dataflow vision processing system-on-a-chip. In Circuits and Systems(MWSCAS),2012 IEEE 55th International Midwest Symposium on,pages 1044–1047. IEEE.

Pinheiro,P. H. O. and Collobert,R.(2014). Recurrent convolutional neural networks for scene labeling. In ICML'2014.

Pinheiro,P. H. O. and Collobert,R.(2015). From image-level to pixel-level labeling with con-volutional networks. In Conference on Computer Vision and Pattern Recognition(CVPR).

Pinto,N.,Cox,D. D.,and DiCarlo,J. J.(2008). Why is real-world visual object recognition hard? PLoS Comput Biol,4.

Pinto,N.,Stone,Z.,Zickler,T.,and Cox,D.(2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops(CVPRW),2011 IEEE Computer Society Conference on,pages 35–42. IEEE.

Pollack,J. B.(1990). Recursive distributed representations. Artificial Intelligence,46(1),77–105.

Polyak,B. and Juditsky,A.(1992). Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization,30(4),838–855.

Polyak,B. T.(1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics,4(5),1–17.

Poole,B.,Sohl-Dickstein,J.,and Ganguli,S.(2014). Analyzing noise in autoencoders and deep networks. CoRR,abs/1406.1831.

Poon,H. and Domingos,P.(2011). Sum-product networks for deep learning. In Learning Workshop,Fort Lauderdale,FL.

Presley,R. K. and Haggard,R. L.(1994). Afixed point implementation of the backpropaga-tion learning algorithm. In Southeastcon '94. Creative Technology Transfer-A Global Affair.,Proceedings of the 1994 IEEE,pages 136–138. IEEE.

Price,R.(1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE Transactions on Information Theory,4(2),69–72.

Quiroga,R. Q.,Reddy,L.,Kreiman,G.,Koch,C.,and Fried,I.(2005). Invariant visual representation by single neurons in the human brain. Nature,435(7045),1102–1107.

Radford,A.,Metz,L.,and Chintala,S.(2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

Raiko,T.,Yao,L.,Cho,K.,and Bengio,Y.(2014). Iterative neural autoregressive distribution estimator(NADE-k). Technical report,arXiv:1406.1485.

Raina,R.,Madhavan,A.,and Ng,A. Y.(2009a). Large-scale deep unsupervised learning using graphics processors. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09),pages 873–880,New York,NY,USA. ACM.

Raina,R.,Madhavan,A.,and Ng,A. Y.(2009b). Large-scale deep unsupervised learning using graphics processors. In ICML'2009.

Ramsey,F. P.(1926). Truth and probability. In R. B. Braithwaite,editor,The Foundations of Mathematics and other Logical Essays,chapter 7,pages 156–198. McMaster University Archive for the History of Economic Thought.

Ranzato,M. and Hinton,G. H.(2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In CVPR'2010,pages 2551–2558.

Ranzato,M.,Poultney,C.,Chopra,S.,and LeCun,Y.(2007a). Efficient learning of sparse representations with an energy-based model. In NIPS'2006.

Ranzato,M.,Poultney,C.,Chopra,S.,and LeCun,Y.(2007b). Efficient learning of sparse representations with an energy-based model. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 1137–1144. MIT Press.

Ranzato,M.,Huang,F.,Boureau,Y.,and LeCun,Y.(2007c). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR'07.

Ranzato,M.,Boureau,Y.,and LeCun,Y.(2008). Sparse feature learning for deep belief networks. In NIPS'2007.

Ranzato,M.,Krizhevsky,A.,and Hinton,G. E.(2010a). Factored 3-way restricted Boltzmann machines for modeling natural images. In Proceedings of AISTATS 2010.

Ranzato,M.,Mnih,V.,and Hinton,G.(2010b). Generating more realistic images using gated MRFs. In NIPS'2010.

Rao,C.(1945). Information and the accuracy attainable in the estimation of statistical param-eters. Bulletin of the Calcutta Mathematical Society,37,81–89.

Rasmus,A.,Valpola,H.,Honkala,M.,Berglund,M.,and Raiko,T.(2015). Semi-supervised learning with ladder network. arXiv preprint arXiv:1507.02672.

Recht,B.,Re,C.,Wright,S.,and Niu,F.(2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS'2011.

Reichert,D. P.,Seriès,P.,and Storkey,A. J.(2011). Neuronal adaptation for sampling-based probabilistic inference in perceptual bistability. In Advances in Neural Information Processing Systems,pages 2357–2365.

Rezende,D. J.,Mohamed,S.,and Wierstra,D.(2014). Stochastic backpropagation and approx-imate inference in deep generative models. In ICML'2014. Preprint:arXiv:1401.4082.

Rifai,S.,Vincent,P.,Muller,X.,Glorot,X.,and Bengio,Y.(2011a). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML'2011.

Rifai,S.,Mesnil,G.,Vincent,P.,Muller,X.,Bengio,Y.,Dauphin,Y.,and Glorot,X.(2011b). Higher order contractive auto-encoder. In ECML PKDD.

Rifai,S.,Dauphin,Y.,Vincent,P.,Bengio,Y.,and Muller,X.(2011c). The manifold tangent classifier. In NIPS'2011.

Rifai,S.,Dauphin,Y.,Vincent,P.,Bengio,Y.,and Muller,X.(2011d). The manifold tangent classifier. In NIPS'2011. Student paper award.

Rifai,S.,Bengio,Y.,Dauphin,Y.,and Vincent,P.(2012). A generative process for sampling contractive auto-encoders. In ICML'2012.

Ringach,D. and Shapley,R.(2004). Reverse correlation in neurophysiology. Cognitive Science,28(2),147–166.

Roberts,S. and Everson,R.(2001). Independent component analysis: principles and practice. Cambridge University Press.

Robinson,A. J. and Fallside,F.(1991). A recurrent error propagation network speech recognition system. Computer Speech and Language,5(3),259–274.

Rockafellar,R. T.(1997). Convex analysis. princeton landmarks in mathematics.

Romero,A.,Ballas,N.,Ebrahimi Kahou,S.,Chassang,A.,Gatta,C.,and Bengio,Y.(2015). Fitnets:Hints for thin deep nets. In ICLR'2015,arXiv:1412.6550.

Rosen,J. B.(1960). The gradient projection method for nonlinear programming. part i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,8(1),pp. 181–217.

Rosenblatt,F.(1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,65,386–408.

Rosenblatt,F.(1962). Principles of Neurodynamics. Spartan,New York.

Rosenblatt,M.(1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics,27(3),832–837.

Roweis,S. and Saul,L. K.(2000). Nonlinear dimensionality reduction by locally linear embedding. Science,290(5500).

Roweis,S.,Saul,L.,and Hinton,G.(2002). Global coordination of local linear models. In T. Dietterich,S. Becker,and Z. Ghahramani,editors,Advances in Neural Information Processing Systems 14(NIPS'01),Cambridge,MA. MIT Press.

Rubin,D. B. et al.(1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics,12(4),1151–1172.

Rumelhart,D.,Hinton,G.,and Williams,R.(1986a). Learning representations by back-propagating errors. Nature,323,533–536.

Rumelhart,D. E.,Hinton,G. E.,and Williams,R. J.(1986b). Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 8,pages 318–362. MIT Press,Cambridge.

Rumelhart,D. E.,Hinton,G. E.,and Williams,R. J.(1986c). Learning representations by back-propagating errors. Nature,323,533–536.

Rumelhart,D. E.,McClelland,J. L.,and the PDP Research Group(1986d). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,Cambridge.

Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z.,Karpathy,A.,Khosla,A.,Bernstein,M.,Berg,A. C.,and Fei-Fei,L.(2014a). ImageNet Large Scale Visual Recognition Challenge.

Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z.,Karpathy,A.,Khosla,A.,Bernstein,M.,et al.(2014b). Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575.

Russel,S. J. and Norvig,P.(2003). Artificial Intelligence:a Modern Approach. Prentice Hall.

Rust,N.,Schwartz,O.,Movshon,J. A.,and Simoncelli,E.(2005). Spatiotemporal elements of macaque V1 receptivefields. Neuron,46(6),945–956.

Sainath,T.,Mohamed,A.,Kingsbury,B.,and Ramabhadran,B.(2013). Deep convolutional neural networks for LVCSR. In ICASSP 2013.

Salakhutdinov,R.(2010). Learning in Markov randomfields using tempered transitions. In Y. Bengio,D. Schuurmans,C. Williams,J. Lafferty,and A. Culotta,editors,Advances in Neural Information Processing Systems 22(NIPS'09).

Salakhutdinov,R. and Hinton,G.(2009a). Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics,volume 5,pages 448–455.

Salakhutdinov,R. and Hinton,G.(2009b). Semantic hashing. In International Journal of Approximate Reasoning.

Salakhutdinov,R. and Hinton,G. E.(2007a). Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AISTATS-2007.

Salakhutdinov,R. and Hinton,G. E.(2007b). Semantic hashing. In SIGIR'2007.

Salakhutdinov,R. and Hinton,G. E.(2008). Using deep belief nets to learn covariance kernels for Gaussian processes. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),pages 1249–1256,Cambridge,MA. MIT Press.

Salakhutdinov,R. and Larochelle,H.(2010). Efficient learning of deep Boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS 2010),JMLR W&CP,volume 9,pages 693–700.

Salakhutdinov,R. and Mnih,A.(2008). Probabilistic matrix factorization. In NIPS'2008.

Salakhutdinov,R. and Murray,I.(2008). On the quantitative analysis of deep belief networks. In W. W. Cohen,A. McCallum,and S. T. Roweis,editors,Proceedings of the Twenty-fifth International Conference on Machine Learning(ICML'08),volume 25,pages 872–879. ACM.

Salakhutdinov,R.,Mnih,A.,and Hinton,G.(2007). Restricted Boltzmann machines for collab-orativefiltering. In ICML.

Sanger,T. D.(1994). Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Transactions on Robotics and Automation,10(3).

Saul,L. K. and Jordan,M. I.(1996). Exploiting tractable substructures in intractable networks. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95). MIT Press,Cambridge,MA.

Saul,L. K.,Jaakkola,T.,and Jordan,M. I.(1996). Meanfield theory for sigmoid belief networks. Journal of Artificial Intelligence Research,4,61–76.

Savich,A. W.,Moussa,M.,and Areibi,S.(2007). The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. Neural Networks,IEEE Transactions on,18(1),240–252.

Saxe,A. M.,Koh,P. W.,Chen,Z.,Bhand,M.,Suresh,B.,and Ng,A.(2011). On random weights and unsupervised feature learning. In Proc. ICML'2011. ACM.

Saxe,A. M.,McClelland,J. L.,and Ganguli,S.(2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.

Schaul,T.,Antonoglou,I.,and Silver,D.(2014). Unit tests for stochastic optimization. In International Conference on Learning Representations.

Schmidhuber,J.(1992). Learning complex,extended sequences using the principle of history compression. Neural Computation,4(2),234–242.

Schmidhuber,J.(1996). Sequential neural text compression. IEEE Transactions on Neural Networks,7(1),142–146.

Schmidhuber,J.(2012). Self-delimiting neural networks. arXiv preprint arXiv:1210.0118.

Schölkopf,B. and Smola,A. J.(2002). Learning with kernels: Support vector machines,regular-ization,optimization,and beyond. MIT press.

Schölkopf,B.,Burges,C. J. C.,and Smola,A. J.(1998a). Advances in kernel methods: support vector learning. MIT Press,Cambridge,MA.

Schölkopf,B.,Smola,A.,and Müller,K.-R.(1998b). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation,10,1299–1319.

Schölkopf,B.,Burges,C. J. C.,and Smola,A. J.(1999). Advances in Kernel Methods—Support Vector Learning. MIT Press,Cambridge,MA.

Schölkopf,B.,Janzing,D.,Peters,J.,Sgouritsa,E.,Zhang,K.,and Mooij,J.(2012). On causal and anticausal learning. In ICML'2012,pages 1255–1262.

Schuster,M.(1999). On supervised learning from sequential data with applications for speech recognition.

Schuster,M. and Paliwal,K.(1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing,45(11),2673–2681.

Schwenk,H.(2007). Continuous space language models. Computer speech and language,21,492–518.

Schwenk,H.(2010). Continuous space language models for statistical machine translation. The Prague Bulletin of Mathematical Linguistics,93,137–146.

Schwenk,H.(2014). Cleaned subset of WMT '14 dataset.

Schwenk,H. and Bengio,Y.(1998). Training methods for adaptive boosting of neural networks. In M. Jordan,M. Kearns,and S. Solla,editors,Advances in Neural Information Processing Systems 10(NIPS'97),pages 647–653. MIT Press.

Schwenk,H. and Gauvain,J.-L.(2002). Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 765–768,Orlando,Florida.

Schwenk,H.,Costa-jussà,M. R.,and Fonollosa,J. A. R.(2006). Continuous space language models for the IWSLT 2006 task. In International Workshop on Spoken Language Translation,pages 166–173.

Seide,F.,Li,G.,and Yu,D.(2011). Conversational speech transcription using context-dependent deep neural networks. In Interspeech 2011,pages 437–440.

Sejnowski,T.(1987). Higher-order Boltzmann machines. In AIP Conference Proceedings 151 on Neural Networks for Computing,pages 398–403. American Institute of Physics Inc.

Series,P.,Reichert,D. P.,and Storkey,A. J.(2010). Hallucinations in Charles Bonnet syndrome induced by homeostasis: a deep Boltzmann machine model. In Advances in Neural Information Processing Systems,pages 2020–2028.

Sermanet,P.,Chintala,S.,and LeCun,Y.(2012). Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition(ICPR 2012).

Sermanet,P.,Kavukcuoglu,K.,Chintala,S.,and LeCun,Y.(2013). Pedestrian detection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition(CVPR'13). IEEE.

Shilov,G.(1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications.

Siegelmann,H.(1995). Computation beyond the Turing limit. Science,268(5210),545–548.

Siegelmann,H. and Sontag,E.(1991). Turing computability with neural nets. Applied Mathe-matics Letters,4(6),77–80.

Siegelmann,H. T. and Sontag,E. D.(1995). On the computational power of neural nets. Journal of Computer and Systems Sciences,50(1),132–150.

Sietsma,J. and Dow,R.(1991). Creating artificial neural networks that generalize. Neural Networks,4(1),67–79.

Simard,D.,Steinkraus,P. Y.,and Platt,J. C.(2003). Best practices for convolutional neural networks. In ICDAR'2003.

Simard,P. and Graf,H. P.(1994). Backpropagation without multiplication. In Advances in Neural Information Processing Systems,pages 232–239.

Simard,P.,Victorri,B.,LeCun,Y.,and Denker,J.(1992). Tangent prop-A formalism for specifying selected invariances in an adaptive network. In NIPS'1991.

Simard,P. Y.,LeCun,Y.,and Denker,J.(1993). Efficient pattern recognition using a new transformation distance. In NIPS'92.

Simard,P. Y.,LeCun,Y. A.,Denker,J. S.,and Victorri,B.(1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation. Lecture Notes in Computer Science,1524.

Simons,D. J. and Levin,D. T.(1998). Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review,5(4),644–649.

Simonyan,K. and Zisserman,A.(2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

Sjöberg,J. and Ljung,L.(1995). Overtraining,regularization and searching for a minimum,with application to neural networks. International Journal of Control,62(6),1391–1407.

Skinner,B. F.(1958). Reinforcement today. American Psychologist,13,94–99.

Smolensky,P.(1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 6,pages 194–281. MIT Press,Cambridge.

Snoek,J.,Larochelle,H.,and Adams,R. P.(2012). Practical Bayesian optimization of machine learning algorithms. In NIPS'2012.

Socher,R.,Huang,E. H.,Pennington,J.,Ng,A. Y.,and Manning,C. D.(2011a). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS'2011.

Socher,R.,Manning,C.,and Ng,A. Y.(2011b). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference on Machine Learning(ICML'2011).

Socher,R.,Pennington,J.,Huang,E. H.,Ng,A. Y.,and Manning,C. D.(2011c). Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP'2011.

Socher,R.,Perelygin,A.,Wu,J. Y.,Chuang,J.,Manning,C. D.,Ng,A. Y.,and Potts,C.(2013a). Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP'2013.

Socher,R.,Ganjoo,M.,Manning,C. D.,and Ng,A. Y.(2013b). Zero-shot learning through cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems(NIPS 2013).

Sohl-Dickstein,J.,Weiss,E. A.,Maheswaranathan,N.,and Ganguli,S.(2015). Deep unsuper-vised learning using nonequilibrium thermodynamics.

Sohn,K.,Zhou,G.,and Lee,H.(2013). Learning and selecting features jointly with point-wise gated Boltzmann machines. In ICML'2013.

Solomonoff,R. J.(1989). A system for incremental learning based on algorithmic probability.

Sontag,E. D.(1998). VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences,168,69–96.

Sontag,E. D. and Sussman,H. J.(1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems,3,91–106.

Sparkes,B.(1996). The Red and the Black: Studies in Greek Pottery. Routledge.

Spitkovsky,V. I.,Alshawi,H.,and Jurafsky,D.(2010). From baby steps to leapfrog: how“less is more”in unsupervised dependency parsing. In HLT'10.

Squire,W. and Trapp,G.(1998). Using complex variables to estimate derivatives of real functions. SIAM Rev.,40(1),110–112.

Srebro,N. and Shraibman,A.(2005). Rank,trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory,pages 545–560. Springer-Verlag.

Srivastava,N.(2013). Improving Neural Networks With Dropout. Master's thesis,U. Toronto.

Srivastava,N. and Salakhutdinov,R.(2012). Multimodal learning with deep Boltzmann machines. In NIPS'2012.

Srivastava,N.,Salakhutdinov,R. R.,and Hinton,G. E.(2013). Modeling documents with deep Boltzmann machines. arXiv preprint arXiv:1309.6865.

Srivastava,N.,Hinton,G.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,15,1929–1958.

Srivastava,R. K.,Greff,K.,and Schmidhuber,J.(2015). Highway networks. arXiv:1505.00387.

Steinkrau,D.,Simard,P. Y.,and Buck,I.(2005). Using GPUs for machine learning algorithms. 2013 12th International Conference on Document Analysis and Recognition,0,1115–1119.

Stoyanov,V.,Ropson,A.,and Eisner,J.(2011). Empirical risk minimization of graphical model parameters given approximate inference,decoding,and model structure. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics(AISTATS),volume 15 of JMLR Workshop and Conference Proceedings,pages 725–733,Fort Lauderdale. Supplemen-tary material(4 pages) also available.

Sukhbaatar,S.,Szlam,A.,Weston,J.,and Fergus,R.(2015). Weakly supervised memory networks. arXiv preprint arXiv:1503.08895.

Supancic,J. and Ramanan,D.(2013). Self-paced learning for long-term tracking. In CVPR'2013.

Sussillo,D.(2014). Random walks:Training very deep nonlinear feed-forward networks with smart initialization. CoRR,abs/1412.6558.

Sutskever,I.(2012). Training Recurrent Neural Networks. Ph.D. thesis,Department of computer science,University of Toronto.

Sutskever,I. and Hinton,G. E.(2008). Deep narrow sigmoid belief networks are universal approximators. Neural Computation,20(11),2629–2636.

Sutskever,I. and Tieleman,T.(2010). On the Convergence Properties of Contrastive Divergence. In AISTATS'2010.

Sutskever,I.,Hinton,G.,and Taylor,G.(2009). The recurrent temporal restricted Boltzmann machine. In NIPS'2008.

Sutskever,I.,Martens,J.,and Hinton,G. E.(2011). Generating text with recurrent neural networks. In ICML'2011,pages 1017–1024.

Sutskever,I.,Martens,J.,Dahl,G.,and Hinton,G.(2013). On the importance of initialization and momentum in deep learning. In ICML.

Sutskever,I.,Vinyals,O.,and Le,Q. V.(2014). Sequence to sequence learning with neural networks. In NIPS'2014,arXiv:1409.3215.

Sutton,R. and Barto,A.(1998). Reinforcement Learning: An Introduction. MIT Press.

Sutton,R. S.,Mcallester,D.,Singh,S.,and Mansour,Y.(2000). Policy gradient methods for reinforcement learning with function approximation. In NIPS'1999,pages 1057–1063. MIT Press.

Swersky,K.,Ranzato,M.,Buchman,D.,Marlin,B.,and de Freitas,N.(2011). On autoencoders and score matching for energy based models. In ICML'2011. ACM.

Swersky,K.,Snoek,J.,and Adams,R. P.(2014). Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896.

Szegedy,C.,Liu,W.,Jia,Y.,Sermanet,P.,Reed,S.,Anguelov,D.,Erhan,D.,Vanhoucke,V.,and Rabinovich,A.(2014a). Going deeper with convolutions. Technical report,arXiv:1409.4842.

Szegedy,C.,Zaremba,W.,Sutskever,I.,Bruna,J.,Erhan,D.,Goodfellow,I. J.,and Fergus,R.(2014b). Intriguing properties of neural networks. ICLR,abs/1312.6199.

Szegedy,C.,Vanhoucke,V.,Ioffe,S.,Shlens,J.,and Wojna,Z.(2015). Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints.

Taigman,Y.,Yang,M.,Ranzato,M.,and Wolf,L.(2014). DeepFace: Closing the gap to human-level performance in face verification. In CVPR'2014.

Tandy,D. W.(1997). Works and Days: A Translation and Commentary for the Social Sciences. University of California Press.

Tang,Y. and Eliasmith,C.(2010). Deep networks for robust visual recognition. In Proceedings of the 27th International Conference on Machine Learning,June 21-24,2010,Haifa,Israel.

Tang,Y.,Salakhutdinov,R.,and Hinton,G.(2012). Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635.

Taylor,G. and Hinton,G.(2009). Factored conditional restricted Boltzmann machines for modeling motion style. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09),pages 1025–1032,Montreal,Quebec,Canada. ACM.

Taylor,G.,Hinton,G. E.,and Roweis,S.(2007). Modeling human motion using binary latent variables. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 1345–1352. MIT Press,Cambridge,MA.

Teh,Y.,Welling,M.,Osindero,S.,and Hinton,G. E.(2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research,4,1235–1260.

Tenenbaum,J.,de Silva,V.,and Langford,J. C.(2000). A global geometric framework for nonlinear dimensionality reduction. Science,290(5500),2319–2323.

Theis,L.,van den Oord,A.,and Bethge,M.(2015). A note on the evaluation of generative models. arXiv:1511.01844.

Thompson,J.,Jain,A.,LeCun,Y.,and Bregler,C.(2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS'2014.

Thrun,S.(1995). Learning to play the game of chess. In NIPS'1994.

Tibshirani,R. J.(1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B,58,267–288.

Tieleman,T.(2008). Training restricted Boltzmann machines using approximations to the like-lihood gradient. In ICML'2008,pages 1064–1071.

Tieleman,T. and Hinton,G.(2009). Using fast weights to improve persistent contrastive diver-gence. In ICML'2009.

Tipping,M. E. and Bishop,C. M.(1999). Probabilistic principal components analysis. Journal of the Royal Statistical Society B,61(3),611–622.

Torralba,A.,Fergus,R.,and Weiss,Y.(2008). Small codes and large databases for recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'08),pages 1–8.

Touretzky,D. S. and Minton,G. E.(1985). Symbols among the neurons: Details of a con-nectionist inference architecture. In Proceedings of the 9th International Joint Conference on Artificial Intelligence-Volume 1,IJCAI'85,pages 238–243,San Francisco,CA,USA. Morgan Kaufmann Publishers Inc.

Tu,K. and Honavar,V.(2011). On the utility of curricula in unsupervised learning of probabilistic grammars. In IJCAI'2011.

Turaga,S. C.,Murray,J. F.,Jain,V.,Roth,F.,Helmstaedter,M.,Briggman,K.,Denk,W.,and Seung,H. S.(2010). Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation,22,511–538.

Turian,J.,Ratinov,L.,and Bengio,Y.(2010). Word representations: A simple and general method for semi-supervised learning. In Proc. ACL'2010,pages 384–394.

Töscher,A.,Jahrer,M.,and Bell,R. M.(2009). The BigChaos solution to the Netflix grand prize.

Uria,B.,Murray,I.,and Larochelle,H.(2013). Rnade: The real-valued neural autoregressive density-estimator. In NIPS'2013.

van den Oörd,A.,Dieleman,S.,and Schrauwen,B.(2013). Deep content-based music recom-mendation. In NIPS'2013.

van der Maaten,L. and Hinton,G. E.(2008). Visualizing data using t-SNE. J. Machine Learning Res.,9.

Vanhoucke,V.,Senior,A.,and Mao,M. Z.(2011). Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.

Vapnik,V. N.(1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,Berlin.

Vapnik,V. N.(1995). The Nature of Statistical Learning Theory. Springer,New York.

Vapnik,V. N. and Chervonenkis,A. Y.(1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications,16,264–280.

Vincent,P.(2011). A connection between score matching and denoising autoencoders. Neural Computation,23(7).

Vincent,P. and Bengio,Y.(2003). Manifold Parzen windows. In NIPS'2002. MIT Press.

Vincent,P.,Larochelle,H.,Bengio,Y.,and Manzagol,P.-A.(2008a). Extracting and composing robust features with denoising autoencoders. In ICM(1a),pages 1096–1103.

Vincent,P.,Larochelle,H.,Bengio,Y.,and Manzagol,P.-A.(2008b). Extracting and composing robust features with denoising autoencoders. In ICML 2008.

Vincent,P.,Larochelle,H.,Lajoie,I.,Bengio,Y.,and Manzagol,P.-A.(2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res.,11.

Vincent,P.,de Brébisson,A.,and Bouthillier,X.(2015). Efficient exact gradient update for training deep networks with very large sparse targets. In C. Cortes,N. D. Lawrence,D. D. Lee,M. Sugiyama,and R. Garnett,editors,Advances in Neural Information Processing Systems 28,pages 1108–1116. Curran Associates,Inc.

Vinyals,O.,Kaiser,L.,Koo,T.,Petrov,S.,Sutskever,I.,and Hinton,G.(2014a). Grammar as a foreign language. arXiv preprint arXiv:1412.7449.

Vinyals,O.,Toshev,A.,Bengio,S.,and Erhan,D.(2014b). Show and tell:a neural image caption generator. arXiv 1411.4555.

Vinyals,O.,Fortunato,M.,and Jaitly,N.(2015a). Pointer networks. arXiv preprint arXiv:1506.03134.

Vinyals,O.,Toshev,A.,Bengio,S.,and Erhan,D.(2015b). Show and tell:a neural image caption generator. In CVPR'2015. arXiv:1411.4555.

Viola,P. and Jones,M.(2001). Robust real-time object detection. In International Journal of Computer Vision.

Visin,F.,Kastner,K.,Cho,K.,Matteucci,M.,Courville,A.,and Bengio,Y.(2015). ReNet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393.

Von Melchner,L.,Pallas,S. L.,and Sur,M.(2000). Visual behaviour mediated by retinal projections directed to the auditory pathway. Nature,404(6780),871–876.

Wager,S.,Wang,S.,and Liang,P.(2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26,pages 351–359.

Waibel,A.,Hanazawa,T.,Hinton,G. E.,Shikano,K.,and Lang,K.(1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics,Speech,and Signal Processing,37,328–339.

Wan,L.,Zeiler,M.,Zhang,S.,LeCun,Y.,and Fergus,R.(2013). Regularization of neural networks using dropconnect. In ICML'2013.

Wang,S. and Manning,C.(2013). Fast dropout training. In ICML'2013.

Wang,Z.,Zhang,J.,Feng,J.,and Chen,Z.(2014a). Knowledge graph and text jointly embedding. In Proc. EMNLP'2014.

Wang,Z.,Zhang,J.,Feng,J.,and Chen,Z.(2014b). Knowledge graph embedding by translating on hyperplanes. In Proc. AAAI'2014.

Warde-Farley,D.,Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2014). An empirical analysis of dropout in piecewise linear networks. In ICL(1).

Wawrzynek,J.,Asanovic,K.,Kingsbury,B.,Johnson,D.,Beck,J.,and Morgan,N.(1996). Spert-II: A vector microprocessor system. Computer,29(3),79–86.

Weaver,L. and Tao,N.(2001). The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001,pages 538–545.

Weinberger,K. Q. and Saul,L. K.(2004a). Unsupervised learning of image manifolds by semidefi-nite programming. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'04),volume 2,pages 988–995,Washington D.C.

Weinberger,K. Q. and Saul,L. K.(2004b). Unsupervised learning of image manifolds by semidefinite programming. In CVPR'2004,pages 988–995.

Weiss,Y.,Torralba,A.,and Fergus,R.(2008). Spectral hashing. In NIPS,pages 1753–1760.

Welling,M.,Zemel,R. S.,and Hinton,G. E.(2002). Self supervised boosting. In Advances in Neural Information Processing Systems,pages 665–672.

Welling,M.,Hinton,G. E.,and Osindero,S.(2003a). Learning sparse topographic representa-tions with products of Student-t distributions. In NIPS'2002.

Welling,M.,Zemel,R.,and Hinton,G. E.(2003b). Self-supervised boosting. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 665–672. MIT Press.

Welling,M.,Rosen-Zvi,M.,and Hinton,G. E.(2005). Exponential family harmoniums with an application to information retrieval. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04),volume 17,Cambridge,MA. MIT Press.

Werbos,P. J.(1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference,31.8-4.9,NYC,pages 762–770.

Weston,J.,Bengio,S.,and Usunier,N.(2010). Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning,81(1),21–35.

Weston,J.,Chopra,S.,and Bordes,A.(2014). Memory networks. arXiv preprint arXiv:1410.3916.

Widrow,B. and Hoff,M. E.(1960). Adaptive switching circuits. In 1960 IRE WESCON Convention Record,volume 4,pages 96–104. IRE,New York.

Wikipedia(2015). List of animals by number of neurons—Wikipedia,the free encyclopedia. [Online;accessed 4-March-2015].

Williams,C. K. I. and Agakov,F. V.(2002). Products of Gaussians and Probabilistic Minor Component Analysis. Neural Computation,14(5),1169–1182.

Williams,C. K. I. and Rasmussen,C. E.(1996). Gaussian processes for regression. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95),pages 514–520. MIT Press,Cambridge,MA.

Williams,R. J.(1992). Simple statistical gradient-following algorithms connectionist reinforcement learning. Machine Learning,8,229–256.

Williams,R. J. and Zipser,D.(1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation,1,270–280.

Wilson,D. R. and Martinez,T. R.(2003). The general inefficiency of batch training for gradient descent learning. Neural Networks,16(10),1429–1451.

Wilson,J. R.(1984). Variance reduction techniques for digital simulation. American Journal of Mathematical and Management Sciences,4(3),277–312.

Wiskott,L. and Sejnowski,T. J.(2002). Slow feature analysis: Unsupervised learning of invari-ances. Neural Computation,14(4),715–770.

Wolpert,D. and MacReady,W.(1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation,1,67–82.

Wolpert,D. H.(1996). The lack of a priori distinction between learning algorithms. Neural Computation,8(7),1341–1390.

Wu,R.,Yan,S.,Shan,Y.,Dang,Q.,and Sun,G.(2015). Deep image: Scaling up image recognition. arXiv:1501.02876.

Wu,Z.(1997). Global continuation for distance geometry problems. SIAM Journal of Optimization,7,814–836.

Xiong,H. Y.,Barash,Y.,and Frey,B. J.(2011). Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics,27(18),2554–2562.

Xu,K.,Ba,J. L.,Kiros,R.,Cho,K.,Courville,A.,Salakhutdinov,R.,Zemel,R. S.,and Bengio,Y.(2015). Show,attend and tell: Neural image caption generation with visual attention. In ICML'2015,arXiv:1502.03044.

Yildiz,I. B.,Jaeger,H.,and Kiebel,S. J.(2012). Re-visiting the echo state property. Neural networks,35,1–9.

Yosinski,J.,Clune,J.,Bengio,Y.,and Lipson,H.(2014). How transferable are features in deep neural networks? In NIPS 27,pages 3320–3328. Curran Associates,Inc.

Younes,L.(1998). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. In Stochastics and Stochastics Models,pages 177–228.

Yu,D.,Wang,S.,and Deng,L.(2010). Sequential labeling using deep-structured conditional randomfields. IEEE Journal of Selected Topics in Signal Processing.

Zaremba,W. and Sutskever,I.(2014). Learning to execute. arXiv 1410.4615.

Zaremba,W. and Sutskever,I.(2015). Reinforcement learning neural Turing machines. arXiv:1505.00521.

Zaslavsky,T.(1975). Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society.

Zeiler,M. D. and Fergus,R.(2014). Visualizing and understanding convolutional networks. In ECCV'14.

Zeiler,M. D.,Ranzato,M.,Monga,R.,Mao,M.,Yang,K.,Le,Q.,Nguyen,P.,Senior,A.,Vanhoucke,V.,Dean,J.,and Hinton,G. E.(2013). On rectified linear units for speech processing. In ICASSP 2013.

Zhou,B.,Khosla,A.,Lapedriza,A.,Oliva,A.,and Torralba,A.(2015). Object detectors emerge in deep scene CNNs. ICLR'2015,arXiv:1412.6856.

Zhou,J. and Troyanskaya,O. G.(2014). Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In ICML'2014.

Zhou,Y. and Chellappa,R.(1988). Computation of opticalflow using a neural network. In Neural Networks,1988.,IEEE International Conference on,pages 71–78. IEEE.

Zöhrer,M. and Pernkopf,F.(2014). General stochastic networks for classification. In NIPS'2014.

索引

绝对值整流absolute value rectification

准确率accuracy

声学acoustic

激活函数activation function

AdaGrad AdaGrad

对抗adversarial

对抗样本adversarial example

对抗训练adversarial training

几乎处处almost everywhere

几乎必然almost sure

几乎必然收敛almost sure convergence

选择性剪接数据集alternative splicing dataset

原始采样ancestral sampling

退火重要采样annealed importance sampling

专用集成电路application-specific integrated circuit

近似贝叶斯计算approximate Bayesian computa-tion

近似推断approximate inference

架构architecture

人工智能artificial intelligence

人工神经网络artificial neural network

渐近无偏asymptotically unbiased

异步随机梯度下降Asynchoronous Stochastic Gradient Descent

异步asynchronous

注意力机制attention mechanism

属性attribute

自编码器autoencoder

自动微分automatic differentiation

自动语音识别Automatic Speech Recognition

自回归网络auto-regressive network

反向传播back propagation

回退back-off

反向传播backprop

通过时间反向传播back-propagation through time

词袋bag of words

Bagging bootstrap aggregating

bandit bandit

批量batch

批标准化batch normalization

贝叶斯误差Bayes error

贝叶斯规则Bayes' rule

贝叶斯推断Bayesian inference

贝叶斯网络Bayesian network

贝叶斯概率Bayesian probability

贝叶斯统计Bayesian statistics

基准bechmark

信念网络belief network

Bernoulli分布Bernoulli distribution

基准baseline

BFGS BFGS

偏置bias in affine function

偏差bias in statistics

有偏biased

有偏重要采样biased importance sampling

偏差biass

二元语法bigram

二元关系binary relation

二值稀疏编码binary sparse coding

比特bit

块坐标下降block coordinate descent

块吉布斯采样block Gibbs Sampling

玻尔兹曼分布Boltzmann distribution

玻尔兹曼机Boltzmann Machine

Boosting Boosting

桥式采样bridge sampling

广播broadcasting

磨合Burning-in

变分法calculus of variations

容量capacity

级联cascade

灾难遗忘catastrophic forgetting

范畴分布categorical distribution

因果因子causal factor

因果模型causal modeling

中心差分centered difference

中心极限定理central limit theorem

链式法则chain rule

混沌chaos

弦chord

弦图chordal graph

梯度截断clip gradient

截断梯度clipping the gradient

团clique

团势能clique potential

闭式解closed form solution

级联coalesced

编码code

协同过滤collaborativefiltering

列column

列空间column space

共因common cause

完全图complete graph

复杂细胞complex cell

计算图computational graph

计算机视觉Computer Vision

概念漂移concept drift

条件计算conditional computation

条件概率conditional probability

条件独立的conditionally independent

共轭conjugate

共轭方向conjugate directions

共轭梯度conjugate gradient

联结主义connectionism

一致性consistency

约束优化constrained optimization

特定环境下的独立context-specific independences

contextual bandit contextual bandit

延拓法continuation method

收缩contractive

收缩自编码器contractive autoencoder

对比散度contrastive divergence

凸优化Convex optimization

卷积convolution

卷积玻尔兹曼机Convolutional Boltzmann Machine

卷积网络convolutional net

卷积神经网络convolutional neural network

坐标上升coordinate ascent

坐标下降coordinate descent

共父coparent

相关系数correlation

代价cost

代价函数cost function

协方差covariance

协方差矩阵covariance matrix

协方差RBM covariance RBM

覆盖coverage

准则criterion

临界点critical point

临界温度critical temperatures

互相关函数cross-correlation

交叉熵cross-entropy

累积函数cumulative function

课程学习curriculum learning

维数灾难curse of dimensionality

曲率curvature

控制论cybernetics

衰减damping

数据生成分布data generating distribution

数据生成过程data generating process

数据并行data parallelism

数据点data point

数据集dataset

数据集增强dataset augmentation

决策树decision tree

解码器decoder

分解decompose

深度信念网络deep belief network

深度玻尔兹曼机Deep Boltzmann Machine

深度回路deep circuit

深度前馈网络deep feedforward network

深度生成模型deep generative model

深度学习deep learning

深度模型deep model

深度网络deep network

点积dot product

双反向传播double backprop

双重分块循环矩阵doubly block circulant matrix

降采样downsampling

Dropout Dropout

Dropout Boosting Dropout Boosting

d-分离d-separation

动态规划dynamic programming

动态结构dynamic structure

提前终止early stopping

回声状态网络echo state network

有效容量effective capacity

特征分解eigendecomposition

特征值eigenvalue

特征向量eigenvector

基本单位向量elementary basis vectors

元素对应乘积element-wise product

嵌入embedding

经验分布empirical distribution

经验频率empirical frequency

经验风险empirical risk

经验风险最小化empirical risk minimization

编码器encoder

端到端的end-to-end

能量函数energy function

基于能量的模型Energy-based model

集成ensemble

集成学习ensemble learning

轮epoch

轮数epochs

等式约束equality constraint

均衡分布Equilibrium Distribution

等变equivariance

等变表示equivariant representations

误差条error bar

误差函数error function

误差度量error metric

错误率error rate

估计量estimator

欧几里得范数Euclidean norm

欧拉-拉格朗日方程Euler-Lagrange Equation

证据下界evidence lower bound

样本example

额外误差excess error

期望expectation

期望最大化expectation maximization

E步expectation step

期望值expected value

经验experience,E

专家网络expert network

相消解释explaining away

相消解释作用explaining away effect

解释因子explanatory factort

梯度爆炸exploding gradient

开发exploitation

探索exploration

指数分布exponential distribution

因子factor

因子分析factor analysis

因子图factor graph

因子factorial

分解factorization

分解的factorized

变差因素factors of variation

快速Dropout fast dropout

快速持续性对比散度fast persistent contrastive di-vergence

可行feasible

特征feature

特征提取器feature extractor

特征映射feature map

特征选择feature selection

反馈feedback

前向feedforward

前馈分类器feedforward classifier

前馈网络feedforward network

前馈神经网络feedforward neural network

现场可编程门阵列field programmable gated array

精调fine-tune

精调fine-tuning

有限差分finite difference

第一层first layer

不动点方程fixed point equation

定点运算fixed-point arithmetic

翻转flip

浮点运算float-point arithmetic

遗忘门forget gate

前向传播forward propagation

傅里叶变换Fourier transform

中央凹fovea

自由能free energy

频率派概率frequentist probability

频率派统计frequentist statistics

Frobenius范数Frobenius norm

F分数F-score

全full

泛函functional

泛函导数functional derivative

Gabor函数Gabor function

Gamma分布Gamma distribution

门控gated

门控循环网络gated recurrent net

门控循环单元gated recurrent unit

门控RNN gated RNN

选通器gater

高斯分布Gaussian distribution

高斯核Gaussian kernel

高斯混合模型Gaussian Mixture Model

高斯混合体Gaussian mixtures

高斯输出分布Gaussian output distribution

高斯RBM Gaussian RBM

Gaussian-Bernoulli RBM Gaussian-Bernoulli RBM

通用GPU general purpose GPU

泛化generalization

泛化误差generalization error

广义函数generalized function

广义Lagrange函数generalized Lagrange function

广义Lagrangian generalized Lagrangian

广义伪似然generalized pseudolikelihood

广义伪似然估计generalized pseudolikelihood esti-mator

广义得分匹配generalized score matching

生成式对抗框架generative adversarial framework

生成式对抗网络generative adversarial network

生成模型generative model

生成式建模generative modeling

生成矩匹配网络generative moment matching net-work

生成随机网络generative stochastic network

生成器网络generator network

吉布斯分布Gibbs distribution

Gibbs采样Gibbs Sampling

吉布斯步数Gibbs steps

全局对比度归一化Global contrast normalization

全局极小值global minima

全局最小点global minimum

梯度gradient

梯度上升gradient ascent

梯度截断gradient clipping

梯度下降gradient descent

图模型graphical model

图形处理器Graphics Processing Unit

贪心greedy

贪心算法greedy algorithm

贪心逐层预训练greedy layer-wise pretraining

贪心逐层训练greedy layer-wise training

贪心逐层无监督预训练greedy layer-wise unsuper-vised pretraining

贪心监督预训练greedy supervised pretraining

贪心无监督预训练greedy unsupervised pretraining

网格搜索grid search

Hadamard乘积Hadamard product

汉明距离Hamming distance

硬专家混合体hard mixture of experts

硬双曲正切函数hard tanh

簧风琴harmonium

哈里斯链Harris Chain

Helmholtz机Helmholtz machine

Hessian Hessian

异方差heteroscedastic

隐藏层hidden layer

隐马尔可夫模型Hidden Markov Model

隐藏单元hidden unit

隐藏变量hidden variable

爬山hill climbing

超参数hyperparameter

超参数优化hyperparameter optimization

假设空间hypothesis space

同分布的identically distributed

可辨认的identifiable

单位矩阵identity matrix

独立同分布假设i.i.d. assumption

病态ill conditioning

不道德immorality

重要采样Importance Sampling

相互独立的independent

独立成分分析independent component analysis

独立同分布independent identically distributed

独立子空间分析independent subspace analysis

索引index of matrix

不等式约束inequality constraint

推断inference

无限infinite

信息检索information retrieval

内积inner product

输入input

输入分布input distribution

干预查询intervention query

不变invariant

求逆invert

Isomap Isomap

各向同性isotropic

Jacobian Jacobian

Jacobian矩阵Jacobian matrix

联合概率分布joint probability distribution

Karush-Kuhn-Tucker Karush-Kuhn-Tucker

核函数kernel function

核机器kernel machine

核方法kernel method

核技巧kernel trick

KL散度KL divergence

知识库knowledge base

知识图谱knowledge graph

Krylov方法Krylov method

KL散度Kullback-Leibler(KL) divergence

标签label

标注labeled

拉格朗日乘子Lagrange multiplier

语言模型language model

Laplace分布Laplace distribution

大学习步骤large learning step

潜在latent

潜层latent layer

潜变量latent variable

大数定理Law of large number

逐层的layer-wise

L-BFGS L-BFGS

渗漏整流线性单元Leaky ReLU

渗漏单元leaky unit

学成learned

学习近似推断learned approximate inference

学习器learner

学习率learning rate

勒贝格可积Lebesgue-integrable

左特征向量left eigenvector

左奇异向量left singular vector

莱布尼兹法则Leibniz's rule

似然likelihood

线搜索line search

线性自回归网络linear auto-regressive network

线性分类器linear classifier

线性组合linear combination

线性相关linear dependence

线性因子模型linear factor model

线性模型linear model

线性回归linear regression

线性阈值单元linear threshold units

线性无关linearly independent

链接预测link prediction

链接重要采样linked importance sampling

Lipschitz Lipschitz

Lipschitz常数Lipschitz constant

Lipschitz连续Lipschitz continuous

流体状态机liquid state machine

局部条件概率分布local conditional probability dis-tribution

局部不变性先验local constancy prior

局部对比度归一化local contrast normalization

局部下降local descent

局部核local kernel

局部极大值local maxima

局部极大点local maximum

局部极小值local minima

局部极小点local minimum

对数尺度logarithmic scale

逻辑回归logistic regression

logistic sigmoid logistic sigmoid

分对数logit

对数线性模型log-linear model

长短期记忆long short-term memory

长期依赖long-term dependency

环loop

环状信念传播loopy belief propagation

损失loss

损失函数loss function

机器学习machine learning

机器学习模型machine learning model

机器翻译machine translation

主对角线main diagonal

流形manifold

流形假设manifold hypothesis

流形学习manifold learning

边缘概率分布marginal probability distribution

马尔可夫链Markov Chain

马尔可夫链蒙特卡罗Markov Chain Monte Carlo

马尔可夫网络Markov network

马尔可夫随机场Markov randomfield

掩码mask

矩阵matrix

矩阵逆matrix inversion

矩阵乘积matrix product

最大范数max norm

池pool

最大池化max pooling

极大值maxima

M步maximization step

最大后验Maximum A Posteriori

最大似然maximum likelihood

最大似然估计maximum likelihood estimation

最大平均偏差maximum mean discrepancy

maxout maxout

maxout单元maxout unit

平均绝对误差mean absolute error

均值和协方差RBM mean and covariance RBM

学生t分布均值乘积mean product of Student t-distribution

均方误差mean squared error

均值-协方差RBM mean-covariance restricted Boltzmann machine

均匀场meanfield

均值场mean-field

测度论measure theory

零测度measure zero

记忆网络memory network

信息传输message passing

小批量minibatch

小批量随机minibatch stochastic

极小值minima

极小点minimum

混合Mixing

混合时间Mixing Time

混合密度网络mixture density network

混合分布mixture distribution

专家混合体mixture of experts

模态modality

峰值mode

模型model

模型平均model averaging

模型压缩model compression

模型可辨识性model identifiability

模型并行model parallelism

矩moment

矩匹配moment matching

动量momentum

蒙特卡罗Monte Carlo

Moore-Penrose伪逆Moore-Penrose pseudoinverse

道德化moralization

道德图moralized graph

多层感知机multilayer perceptron

多峰值multimodal

多模态学习multimodal learning

多项式分布multinomial distribution

Multinoulli分布multinoulli distribution

多预测深度玻尔兹曼机multi-prediction deep Boltzmann machine

多任务学习multitask learning

多维正态分布multivariate normal distribution

朴素贝叶斯naive Bayes

奈特nats

自然语言处理Natural Language Processing

最近邻nearest neighbor

最近邻图nearest neighbor graph

最近邻回归nearest neighbor regression

负定negative definite

负部函数negative part function

负相negative phase

半负定negative semidefinite

Nesterov动量Nesterov momentum

网络network

神经自回归密度估计器neural auto-regressive den-sity estimator

神经自回归网络neural auto-regressive network

神经语言模型Neural Language Model

神经机器翻译Neural Machine Translation

神经网络neural network

神经网络图灵机neural Turing machine

牛顿法Newton's method

n-gram n-gram

没有免费午餐定理no free lunch theorem

噪声noise

噪声分布noise distribution

噪声对比估计noise-contrastive estimation

非凸nonconvex

非分布式nondistributed

非分布式表示nondistributed representation

非线性共轭梯度nonlinear conjugate gradients

非线性独立成分估计nonlinear independent com-ponents estimation

非参数non-parametric

范数norm

正态分布normal distribution

正规方程normal equation

归一化的normalized

标准初始化normalized initialization

数值numeric value

数值优化numerical optimization

对象识别object recognition

目标objective

目标函数objective function

奥卡姆剃刀Occam's razor

one-hot one-hot

一次学习one-shot learning

在线online

在线学习online learning

操作operation

最佳容量optimal capacity

原点origin

正交orthogonal

正交矩阵orthogonal matrix

标准正交orthonormal

输出output

输出层output layer

过完备overcomplete

过估计overestimation

过拟合overfitting

过拟合机制overfitting regime

上溢overflow

并行分布式处理Parallel Distributed Processing

并行回火parallel tempering

参数parameter

参数服务器parameter server

参数共享parameter sharing

有参情况parametric case

参数化整流线性单元parametric ReLU

偏导数partial derivative

配分函数Partition Function

性能度量performance measures

性能度量performance metrics

置换不变性permutation invariant

持续性对比散度persistent contrastive divergence

音素phoneme

语音phonetic

分段piecewise

点估计point estimator

策略policy

策略梯度policy gradient

池化pooling

池化函数pooling function

病态条件poor conditioning

正定positive definite

正部函数positive part function

正相positive phase

半正定positive semidefinite

后验概率posterior probability

幂方法power method

PR曲线PR curve

精度precision

精度矩阵precision matrix

预测稀疏分解predictive sparse decomposition

预训练pretraining

初级视觉皮层primary visual cortex

主成分分析principal components analysis

先验概率prior probability

先验概率分布prior probability distribution

概率PCA probabilistic PCA

概率密度函数probability density function

概率分布probability distribution

概率质量函数probability mass function

专家之积product of expert

乘法法则product rule

成比例proportional

提议分布proposal distribution

伪似然pseudolikelihood

象限对quadrature pair

量子力学quantum mechanics

径向基函数radial basis function

随机搜索random search

随机变量random variable

值域range

比率匹配ratio matching

召回率recall

接受域receptivefield

再循环recirculation

推荐系统recommender system

重构reconstruction

重构误差reconstruction error

整流线性rectified linear

整流线性变换rectified linear transformation

整流线性单元rectified linear unit

整流网络rectifier network

循环recurrence

循环卷积网络recurrent convolutional network

循环网络recurrent network

循环神经网络recurrent neural network

回归regression

正则化regularization

正则化regularize

正则化项regularizer

强化学习reinforcement learning

关系relation

关系型数据库relational database

重参数化reparametrization

重参数化技巧reparametrization trick

表示representation

表示学习representation learning

表示容量representational capacity

储层计算reservoir computing

受限玻尔兹曼机Restricted Boltzmann Machine

反向相关reverse correlation

反向模式累加reverse mode accumulation

岭回归ridge regression

右特征向量right eigenvector

右奇异向量right singular vector

风险risk

行row

扫视saccade

鞍点saddle point

无鞍牛顿法saddle-free Newton method

相同same

样本均值sample mean

样本方差sample variance

饱和saturate

标量scalar

得分score

得分匹配score matching

二阶导数second derivative

二阶导数测试second derivative test

第二层second layer

二阶方法second-order method

自对比估计self-contrastive estimation

自信息self-information

语义哈希semantic hashing

半受限波尔兹曼机semi-restricted Boltzmann Ma-chine

半监督semi-supervised

半监督学习semi-supervised learning

可分离的separable

分离的separate

分离separation

情景setting

浅度回路shadow circuit

香农熵Shannon entropy

香农shannons

塑造shaping

短列表shortlist

sigmoid sigmoid

sigmoid信念网络sigmoid Belief Network

简单细胞simple cell

奇异的singular

奇异值singular value

奇异值分解singular value decomposition

奇异向量singular vector

跳跃连接skip connection

慢特征分析slow feature analysis

慢性原则slowness principle

平滑smoothing

平滑先验smoothness prior

softmax softmax

softmax函数softmax function

softmax单元softmax unit

softplus softplus

softplus函数softplus function

生成子空间span

稀疏sparse

稀疏激活sparse activation

稀疏编码sparse coding

稀疏连接sparse connectivity

稀疏初始化sparse initialization

稀疏交互sparse interactions

稀疏权重sparse weights

谱半径spectral radius

语音识别Speech Recognition

sphering sphering

尖峰和平板spike and slab

尖峰和平板RBM spike and slab RBM

虚假模态spurious modes

方阵square

标准差standard deviation

标准差standard error

标准正态分布standard normal distribution

声明statement

平稳的stationary

平稳分布Stationary Distribution

驻点stationary point

统计效率statistic efficiency

统计学习理论statistical learning theory

统计量statistics

最陡下降steepest descent

随机stochastic

随机课程stochastic curriculum

随机梯度上升Stochastic Gradient Ascent

随机梯度下降stochastic gradient descent

随机矩阵Stochastic Matrix

随机最大似然stochastic maximum likelihood

流stream

步幅stride

结构学习structure learning

结构化概率模型structured probabilistic model

结构化变分推断structured variational inference

亚原子subatomic

子采样subsample

求和法则sum rule

和–积网络sum-product network

监督supervised

监督学习supervised learning

监督学习算法supervised learning algorithm

监督模型supervised model

监督预训练supervised pretraining

支持向量support vector

代理损失函数surrogate loss function

符号symbol

符号表示symbolic representation

对称symmetric

切面距离tangent distance

切平面tangent plane

正切传播tangent prop

泰勒taylor

导师驱动过程teacher forcing

温度temperature

回火转移tempered transition

回火tempering

张量tensor

测试误差test error

测试集test set

碰撞情况the collider case

绑定的权重tied weights

Tikhonov正则Tikhonov regularization

平铺卷积tiled convolution

时延神经网络time delay neural network

时间步time step

Toeplitz矩阵Toeplitz matrix

标记token

容差tolerance

地质ICA topographic ICA

训练误差training error

训练集training set

转录transcribe

转录系统transcription system

迁移学习transfer learning

转移transition

转置transpose

三角不等式triangle inequality

三角形化triangulate

三角形化图triangulated graph

三元语法trigram

无偏unbiased

无偏样本方差unbiased sample variance

欠完备undercomplete

欠定的underdetermined

欠估计underestimation

欠拟合underfitting

欠拟合机制underfitting regime

下溢underflow

潜在underlying

潜在成因underlying cause

无向undirected

无向模型undirected model

展开图unfolded graph

展开unfolding

均匀分布uniform distribution

一元语法unigram

单峰值unimodal

单元unit

单位范数unit norm

单位向量unit vector

万能近似定理universal approximation theorem

万能近似器universal approximator

万能函数近似器universal function approximator

未标注unlabeled

未归一化概率函数unnormalized probability func-tion

非共享卷积unshared convolution

无监督unsupervised

无监督学习unsupervised learning

无监督学习算法unsupervised learning algorithm

无监督预训练unsupervised pretraining

有效valid

验证集validation set

梯度消失与爆炸问题vanishing and exploding gra-dient problem

梯度消失vanishing gradient

Vapnik-Chervonenkis维度Vapnik-Chervonenkis dimension

变量消去variable elimination

方差variance

方差减小variance reduction

变分自编码器variational auto-encoder

变分导数variational derivative

变分自由能variational free energy

变分推断variational inference

向量vector

虚拟对抗样本virtual adversarial example

虚拟对抗训练virtual adversarial training

可见层visible layer

V-结构V-structure

醒眠wake sleep

warp warp

支持向量机support vector machine

无向图模型undirected graphical model

权重weight

权重衰减weight decay

权重比例推断规则weight scaling inference rule

权重空间对称性weight space symmetry

条件概率分布conditional probability distribution

白化whitening

宽度width

赢者通吃winner-take-all

正切传播tangent propagation

流形正切分类器manifold tangent classifier

词嵌入word embedding

词义消歧word-sense disambiguation

零数据学习zero-data learning

零次学习zero-shot learning

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文