20.15 结论
为了让模型理解基于给定训练数据表示的大千世界,训练具有隐藏单元的生成模型是一种有力方法。通过学习模型和表示,生成模型可以解答x输入变量之间关系的许多推断问题,并且可以在不同层对h求期望来提供表示x的许多不同方式。生成模型可以为AI系统提供它们所要理解的、各种不同概念的框架,让它们有能力在面对不确定性的情况下推理这些概念。我们希望读者能够找到增强这些方法的新途径,并继续探究智能和学习背后原理的旅程。
本书由“行行”整理,如果你不知道读什么书或者想获得更多免费电子书请加小编微信或QQ:2338856113 小编也和结交一些喜欢读书的朋友 或者关注小编个人微信公众号名称:幸福的味道 id:d716-716 为了方便书友朋友找书和看书,小编自己做了一个电子书下载网站,网站的名称为:周读 网址:http://www.ireadweek.com
————————————————————
(1) 术语“mcRBM”根据字母M-C-R-B-M发音;“mc”不是“McDonald's”中的“Mc”的发音。
(2) 这个版本的Gaussian-Bernoulli RBM能量函数假定图像数据的每个像素具有零均值。考虑非零像素均值时,可以简单地将像素偏移添加到模型中。
(3) 该论文将模型描述为“深度信念网络”,但因为它可以被描述为纯无向模型(具有易处理逐层均匀场不动点更新),所以它最适合深度玻尔兹曼机的定义。
Abadi,M.,Agarwal,A.,Barham,P.,Brevdo,E.,Chen,Z.,Citro,C.,Corrado,G. S.,Davis,A.,Dean,J.,Devin,M.,Ghemawat,S.,Goodfellow,I.,Harp,A.,Irving,G.,Isard,M.,Jia,Y.,Jozefowicz,R.,Kaiser,L.,Kudlur,M.,Levenberg,J.,Mané,D.,Monga,R.,Moore,S.,Murray,D.,Olah,C.,Schuster,M.,Shlens,J.,Steiner,B.,Sutskever,I.,Talwar,K.,Tucker,P.,Vanhoucke,V.,Vasudevan,V.,Viégas,F.,Vinyals,O.,Warden,P.,Wattenberg,M.,Wicke,M.,Yu,Y.,and Zheng,X.(2015). TensorFlow:Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Ackley,D. H.,Hinton,G. E.,and Sejnowski,T. J.(1985). A learning algorithm for Boltzmann machines. Cognitive Science,9,147–169.
Alain,G. and Bengio,Y.(2013). What regularized auto-encoders learn from the data generating distribution. In ICLR'2013,arXiv:1211.4246.
Alain,G.,Bengio,Y.,Yao,L.,Éric Thibodeau-Laufer,Yosinski,J.,and Vincent,P.(2015). GSNs: Generative stochastic networks. arXiv:1503.05571.
Anderson,E.(1935). The Irises of the Gaspé Peninsula. Bulletin of the American Iris Society,59,2–5.
Ba,J.,Mnih,V.,and Kavukcuoglu,K.(2014). Multiple object recognition with visual attention. arXiv:1412.7755.
Bachman,P. and Precup,D.(2015). Variational generative stochastic networks with collaborative shaping. In Proceedings of the 32nd International Conference on Machine Learning,ICML 2015,Lille,France,6-11 July 2015,pages 1964–1972.
Bacon,P.-L.,Bengio,E.,Pineau,J.,and Precup,D.(2015). Conditional computation in neural networks using a decision-theoretic approach. In 2nd Multidisciplinary Conference on Rein-forcement Learning and Decision Making(RLDM 2015).
Bagnell,J. A. and Bradley,D. M.(2009). Differentiable sparse coding. In NIPS'2009,pages 113–120.
Bahdanau,D.,Cho,K.,and Bengio,Y.(2015). Neural machine translation by jointly learning to align and translate. In ICLR'2015,arXiv:1409.0473.
Bahl,L. R.,Brown,P.,de Souza,P. V.,and Mercer,R. L.(1987). Speech recognition with continuous-parameter hidden Markov models. Computer,Speech and Language,2,219–234.
Baldi,P. and Hornik,K.(1989). Neural networks and principal component analysis:Learning from examples without local minima. Neural Networks,2,53–58.
Baldi,P.,Brunak,S.,Frasconi,P.,Soda,G.,and Pollastri,G.(1999). Exploiting the past and the future in protein secondary structure prediction. Bioinformatics,15(11),937–946.
Baldi,P.,Sadowski,P.,and Whiteson,D.(2014). Searching for exotic particles in high-energy physics with deep learning. Nature communications,5.
Ballard,D. H.,Hinton,G. E.,and Sejnowski,T. J.(1983). Parallel vision computation. Nature.
Barlow,H. B.(1989). Unsupervised learning. Neural Computation,1,295–311.
Barron,A. E.(1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory,39,930–945.
Bartholomew,D. J.(1987). Latent variable models and factor analysis. Oxford University Press.
Basilevsky,A.(1994). Statistical Factor Analysis and Related Methods:Theory and Applications. Wiley.
Bastien,F.,Lamblin,P.,Pascanu,R.,Bergstra,J.,Goodfellow,I.,Bergeron,A.,Bouchard,N.,Warde-Farley,D.,and Bengio,Y.(2012a). Theano:new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop,http://www.iro.umontreal.ca/lisa/publications2/index.php/publications/show/551.
Bastien,F.,Lamblin,P.,Pascanu,R.,Bergstra,J.,Goodfellow,I. J.,Bergeron,A.,Bouchard,N.,and Bengio,Y.(2012b). Theano:new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
Basu,S. and Christensen,J.(2013). Teaching classification boundaries to humans. In AAAI'2013.
Baxter,J.(1995). Learning internal representations. In Proceedings of the 8th International Conference on Computational Learning Theory(COLT'95),pages 311–320,Santa Cruz,California. ACM Press.
Bayer,J. and Osendorfer,C.(2014). Learning stochastic recurrent networks. ArXiv e-prints.
Becker,S. and Hinton,G.(1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature,355,161–163.
Behnke,S.(2001). Learning iterative image reconstruction in the neural abstraction pyramid. Int. J. Computational Intelligence and Applications,1(4),427–438.
Beiu,V.,Quintana,J. M.,and Avedillo,M. J.(2003). VLSI implementations of threshold logic-a comprehensive survey. Neural Networks,IEEE Transactions on,14(5),1217–1243.
Belkin,M. and Niyogi,P.(2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich,S. Becker,and Z. Ghahramani,editors,Advances in Neural Information Processing Systems 14(NIPS'01),Cambridge,MA. MIT Press.
Belkin,M. and Niyogi,P.(2003a). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation,15(6),1373–1396.
Belkin,M. and Niyogi,P.(2003b). Using manifold structure for partially labeled classification. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),Cambridge,MA. MIT Press.
Bengio,E.,Bacon,P.-L.,Pineau,J.,and Precup,D.(2015a). Conditional computation in neural networks for faster models. arXiv:1511.06297.
Bengio,S. and Bengio,Y.(2000a). Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks,special issue on Data Mining and Knowledge Discovery,11(3),550–557.
Bengio,S.,Vinyals,O.,Jaitly,N.,and Shazeer,N.(2015b). Scheduled sampling for sequence prediction with recurrent neural networks. Technical report,arXiv:1506.03099.
Bengio,Y.(1991). Artificial Neural Networks and their Application to Sequence Recognition. Ph.D. thesis,McGill University,(Computer Science),Montreal,Canada.
Bengio,Y.(2000). Gradient-based optimization of hyperparameters. Neural Computation,12(8),1889–1900.
Bengio,Y.(2002). New distributed probabilistic language models. Technical Report 1215,Dept. IRO,Université de Montréal.
Bengio,Y.(2009). Learning deep architectures for AI. Now Publishers.
Bengio,Y.(2013). Deep learning of representations: looking forward. In Statistical Language and Speech Processing,volume 7978 of Lecture Notes in Computer Science,pages 1–37. Springer,also in arXiv at http://arxiv.org/abs/1305.0445.
Bengio,Y.(2015). Early inference in energy-based models approximates back-propagation. Technical Report arXiv:1510.02777,Universite de Montreal.
Bengio,Y. and Bengio,S.(2000b). Modeling high-dimensional discrete data with multi-layer neural networks. In NIPS 12,pages 400–406. MIT Press.
Bengio,Y. and Delalleau,O.(2009). Justifying and generalizing contrastive divergence. Neural Computation,21(6),1601–1621.
Bengio,Y. and Grandvalet,Y.(2004). No unbiased estimator of the variance of k-fold cross-validation. In JML(1),pages 1089–1105.
Bengio,Y. and LeCun,Y.(2007a). Scaling learning algorithms towards AI. In Large Scale Kernel Machines.
Bengio,Y. and LeCun,Y.(2007b). Scaling learning algorithms towards AI. In L. Bottou,O. Chapelle,D. DeCoste,and J. Weston,editors,Large Scale Kernel Machines. MIT Press.
Bengio,Y. and Monperrus,M.(2005). Non-local manifold tangent learning. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04),pages 129–136. MIT Press.
Bengio,Y. and Sénécal,J.-S.(2003). Quick training of probabilistic neural nets by importance sampling. In Proceedings of AISTATS 2003.
Bengio,Y. and Sénécal,J.-S.(2008). Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Networks,19(4),713–722.
Bengio,Y.,De Mori,R.,Flammia,G.,and Kompe,R.(1991). Phonetically motivated acoustic parameters for continuous speech recognition using artificial neural networks. In Proceedings of EuroSpeech'91.
Bengio,Y.,De Mori,R.,Flammia,G.,and Kompe,R.(1992). Neural network-Gaussian mix-ture hybrid for speech recognition or density estimation. In NIPS 4,pages 175–182. Morgan Kaufmann.
Bengio,Y.,Frasconi,P.,and Simard,P.(1993). The problem of learning long-term dependencies in recurrent networks. In IEEE International Conference on Neural Networks,pages 1183–1195,San Francisco. IEEE Press.(invited paper).
Bengio,Y.,Simard,P.,and Frasconi,P.(1994a). Learning long-term dependencies with gradient descent is difficult. IEEE Tr. Neural Nets.
Bengio,Y.,Simard,P.,and Frasconi,P.(1994b). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks,5(2),157–166.
Bengio,Y.,Simard,P.,and Frasconi,P.(1994c). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks,5(2),157–166.
Bengio,Y.,Latendresse,S.,and Dugas,C.(1999). Gradient-based learning of hyper-parameters. In Learning Conference.
Bengio,Y.,Ducharme,R.,and Vincent,P.(2001a). A neural probabilistic language model. In T. Leen,T. Dietterich,and V. Tresp,editors,Advances in Neural Information Processing Systems 13(NIPS'00),pages 933–938. MIT Press.
Bengio,Y.,Ducharme,R.,and Vincent,P.(2001b). A neural probabilistic language model. In T. K. Leen,T. G. Dietterich,and V. Tresp,editors,NIPS'2000,pages 932–938. MIT Press.
Bengio,Y.,Ducharme,R.,Vincent,P.,and Jauvin,C.(2003). A neural probabilistic language model. JMLR,3,1137–1155.
Bengio,Y.,Delalleau,O.,and Le Roux,N.(2006a). The curse of highly variable functions for local kernel machines. In NIPS'2005.
Bengio,Y.,Larochelle,H.,and Vincent,P.(2006b). Non-local manifold Parzen windows. In NIPS'2005. MIT Press.
Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007a). Greedy layer-wise training of deep networks. In NIPS'2006.
Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007b). Greedy layer-wise training of deep networks. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 153–160. MIT Press.
Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007c). Greedy layer-wise training of deep networks. In Adv. Neural Inf. Proc. Sys. 19,pages 153–160.
Bengio,Y.,Lamblin,P.,Popovici,D.,and Larochelle,H.(2007d). Greedy layer-wise training of deep networks. In NIPS 19,pages 153–160. MIT Press.
Bengio,Y.,Louradour,J.,Collobert,R.,and Weston,J.(2009). Curriculum learning. In ICML'09. ACM.
Bengio,Y.,Mesnil,G.,Dauphin,Y.,and Rifai,S.(2013a). Better mixing via deep representa-tions. In ICML'2013.
Bengio,Y.,Léonard,N.,and Courville,A.(2013b). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432.
Bengio,Y.,Yao,L.,Alain,G.,and Vincent,P.(2013c). Generalized denoising auto-encoders as generative models. In NIPS'2013.
Bengio,Y.,Courville,A.,and Vincent,P.(2013d). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence,IEEE Transactions on,35(8),1798–1828.
Bengio,Y.,Thibodeau-Laufer,E.,Alain,G.,and Yosinski,J.(2014). Deep generative stochastic networks trainable by backprop. In ICML'2014.
Bennett,C.(1976). Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics,22(2),245–268.
Bennett,J. and Lanning,S.(2007). The Netflix prize.
Berger,A. L.,Della Pietra,V. J.,and Della Pietra,S. A.(1996). A maximum entropy approach to natural language processing. Computational Linguistics,22,39–71.
Berglund,M. and Raiko,T.(2013). Stochastic gradient estimate variance in contrastive diver-gence and persistent contrastive divergence. CoRR,abs/1312.6002.
Bergstra,J.(2011). Incorporating Complex Cells into Neural Networks for Pattern Classification. Ph.D. thesis,Université de Montréal.
Bergstra,J. and Bengio,Y.(2009). Slow,decorrelated features for pretraining complex cell-like networks. In NIPS 22,pages 99–107. MIT Press.
Bergstra,J. and Bengio,Y.(2011). Random search for hyper-parameter optimization. The Learning Workshop,Fort Lauderdale,Florida.
Bergstra,J. and Bengio,Y.(2012). Random search for hyper-parameter optimization. J. Machine Learning Res.,13,281–305.
Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010a). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference(SciPy). Oral Presentation.
Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010b). Theano: a CPU and GPU math expression com-piler. In Proc. SciPy.
Bergstra,J.,Breuleux,O.,Bastien,F.,Lamblin,P.,Pascanu,R.,Desjardins,G.,Turian,J.,Warde-Farley,D.,and Bengio,Y.(2010c). Theano:a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference(SciPy).
Bergstra,J.,Bardenet,R.,Bengio,Y.,and Kégl,B.(2011). Algorithms for hyper-parameter optimization. In NIPS'2011.
Berkes,P. and Wiskott,L.(2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision,5(6),579–602.
Bertsekas,D. P. and Tsitsiklis,J.(1996). Neuro-Dynamic Programming. Athena Scientific.
Besag,J.(1975). Statistical analysis of non-lattice data. The Statistician,24(3),179–195.
Bishop,C. M.(1994). Mixture density networks.
Bishop,C. M.(1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN'95,volume 1,page 141–148.
Bishop,C. M.(1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation,7(1),108–116.
Bishop,C. M.(2006). Pattern Recognition and Machine Learning. Springer.
Blum,A. L. and Rivest,R. L.(1992). Training a 3-node neural network is NP-complete.
Blumer,A.,Ehrenfeucht,A.,Haussler,D.,and Warmuth,M. K.(1989). Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM,36(4),865–929.
Bonnet,G.(1964). Transformations des signaux aléatoires à travers les systèmes non linéaires sans mémoire. Annales des Télécommunications,19(9–10),203–220.
Bordes,A.,Weston,J.,Collobert,R.,and Bengio,Y.(2011). Learning structured embeddings of knowledge bases. In AAAI 2011.
Bordes,A.,Glorot,X.,Weston,J.,and Bengio,Y.(2012). Joint learning of words and meaning representations for open-text semantic parsing. AISTATS'2012.
Bordes,A.,Glorot,X.,Weston,J.,and Bengio,Y.(2013a). A semantic matching energy func-tion for learning with multi-relational data. Machine Learning: Special Issue on Learning Semantics.
Bordes,A.,Usunier,N.,Garcia-Duran,A.,Weston,J.,and Yakhnenko,O.(2013b). Translating embeddings for modeling multi-relational data. In C. Burges,L. Bottou,M. Welling,Z. Ghahramani,and K. Weinberger,editors,Advances in Neural Information Processing Systems 26,pages 2787–2795. Curran Associates,Inc.
Bornschein,J. and Bengio,Y.(2015). Reweighted wake-sleep. In ICLR'2015,arXiv:1406.2751.
Bornschein,J.,Shabanian,S.,Fischer,A.,and Bengio,Y.(2015). Training bidirectional Helmholtz machines. Technical report,arXiv:1506.03877.
Boser,B. E.,Guyon,I. M.,and Vapnik,V. N.(1992). A training algorithm for optimal margin classifiers. In COLT '92: Proceedings of thefifth annual workshop on Computational learning theory,pages 144–152,New York,NY,USA. ACM.
Bottou,L.(1998). Online algorithms and stochastic approximations. In D. Saad,editor,Online Learning in Neural Networks. Cambridge University Press,Cambridge,UK.
Bottou,L.(2011). From machine learning to machine reasoning. Technical report,arXiv.1102.1808.
Bottou,L.(2015). Multilayer neural networks. Deep Learning Summer School.
Bottou,L. and Bousquet,O.(2008a). The tradeoffs of large scale learning. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),volume 20. MIT Press,Cambridge,MA.
Bottou,L. and Bousquet,O.(2008b). The tradeoffs of large scale learning. In NIPS'2008.
Boulanger-Lewandowski,N.,Bengio,Y.,and Vincent,P.(2012). Modeling temporal dependen-cies in high-dimensional sequences: Application to polyphonic music generation and transcrip-tion. In ICML'12.
Boureau,Y.,Ponce,J.,and LeCun,Y.(2010). A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning(ICML'10).
Boureau,Y.,Le Roux,N.,Bach,F.,Ponce,J.,and LeCun,Y.(2011). Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer Vision(ICCV'11). IEEE.
Bourlard,H. and Kamp,Y.(1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics,59,291–294.
Bourlard,H. and Wellekens,C.(1989). Speech pattern discrimination and multi-layered percep-trons. Computer Speech and Language,3,1–19.
Boyd,S. and Vandenberghe,L.(2004). Convex Optimization. Cambridge University Press,New York,NY,USA.
Brady,M. L.,Raghavan,R.,and Slawny,J.(1989). Back-propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems,36(5),665–674.
Brakel,P.,Stroobandt,D.,and Schrauwen,B.(2013). Training energy-based models for time-series imputation. Journal of Machine Learning Research,14,2771–2797.
Brand,M.(2003a). Charting a manifold. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 961–968. MIT Press.
Brand,M.(2003b). Charting a manifold. In NIPS'2002,pages 961–968. MIT Press.
Breiman,L.(1994). Bagging predictors. Machine Learning,24(2),123–140.
Breiman,L.,Friedman,J. H.,Olshen,R. A.,and Stone,C. J.(1984). Classification and Regression Trees. Wadsworth International Group,Belmont,CA.
Bridle,J. S.(1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden Markov model interpretation. Speech Communication,9(1),83–92.
Briggman,K.,Denk,W.,Seung,S.,Helmstaedter,M. N.,and Turaga,S. C.(2009). Maximin affinity learning of image segmentation. In NIPS'2009,pages 1865–1873.
Brown,P. F.,Cocke,J.,Pietra,S. A. D.,Pietra,V. J. D.,Jelinek,F.,Lafferty,J. D.,Mercer,R. L.,and Roossin,P. S.(1990). A statistical approach to machine translation. Computational linguistics,16(2),79–85.
Brown,P. F.,Pietra,V. J. D.,DeSouza,P. V.,Lai,J. C.,and Mercer,R. L.(1992). Class-based n-gram models of natural language. Computational Linguistics,18,467–479.
Bryson,A. and Ho,Y.(1969). Applied optimal control: optimization,estimation,and control. Blaisdell Pub. Co.
Bryson,Jr.,A. E. and Denham,W. F.(1961). A steepest-ascent method for solving optimum programming problems. Technical Report BR-1303,Raytheon Company,Missle and Space Division.
Buciluǎ,C.,Caruana,R.,and Niculescu-Mizil,A.(2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,pages 535–541. ACM.
Burda,Y.,Grosse,R.,and Salakhutdinov,R.(2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
Cai,M.,Shi,Y.,and Liu,J.(2013). Deep maxout neural networks for speech recognition. In Automatic Speech Recognition and Understanding(ASRU),2013 IEEE Workshop on,pages 291–296. IEEE.
Carreira-Perpiñan,M. A. and Hinton,G. E.(2005). On contrastive divergence learning. In AISTATS'2005,pages 33–40.
Caruana,R.(1993). Multitask connectionist learning. In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379.
Cauchy,A.(1847). Méthode générale pour la résolution de systèmes d'équations simultanées. In Compte rendu des séances de l'académie des sciences,pages 536–538.
Cayton,L.(2005). Algorithms for manifold learning. Technical Report CS2008-0923,UCSD.
Chandola,V.,Banerjee,A.,and Kumar,V.(2009). Anomaly detection: A survey. ACM computing surveys(CSUR),41(3),15.
Chapelle,O.,Weston,J.,and Schölkopf,B.(2003). Cluster kernels for semi-supervised learning. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 585–592,Cambridge,MA. MIT Press.
Chapelle,O.,Schölkopf,B.,and Zien,A.,editors(2006). Semi-Supervised Learning. MIT Press,Cambridge,MA.
Chellapilla,K.,Puri,S.,and Simard,P.(2006). High Performance Convolutional Neural Net-works for Document Processing. In Guy Lorette,editor,Tenth International Workshop on Frontiers in Handwriting Recognition,La Baule(France). Université de Rennes 1,Suvisoft. http://www.suvisoft.com.
Chen,B.,Ting,J.-A.,Marlin,B. M.,and de Freitas,N.(2010). Deep learning of invariant spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop.
Chen,S. F. and Goodman,J. T.(1999). An empirical study of smoothing techniques for language modeling. Computer,Speech and Language,13(4),359–393.
Chen,T.,Du,Z.,Sun,N.,Wang,J.,Wu,C.,Chen,Y.,and Temam,O.(2014a). DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems,pages 269–284. ACM.
Chen,T.,Li,M.,Li,Y.,Lin,M.,Wang,N.,Wang,M.,Xiao,T.,Xu,B.,Zhang,C.,and Zhang,Z.(2015). MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
Chen,Y.,Luo,T.,Liu,S.,Zhang,S.,He,L.,Wang,J.,Li,L.,Chen,T.,Xu,Z.,Sun,N.,et al.(2014b). DaDianNao: A machine-learning supercomputer. In Microarchitecture(MICRO),2014 47th Annual IEEE/ACM International Symposium on,pages 609–622. IEEE.
Chilimbi,T.,Suzue,Y.,Apacible,J.,and Kalyanaraman,K.(2014). Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation(OSDI'14).
Cho,K.,Raiko,T.,and Ilin,A.(2010a). Parallel tempering is efficient for learning restricted Boltzmann machines. In Proceedings of the International Joint Conference on Neural Networks(IJCNN 2010),Barcelona,Spain.
Cho,K.,Raiko,T.,and Ilin,A.(2010b). Parallel tempering is efficient for learning restricted Boltzmann machines. In IJCNN'2010.
Cho,K.,Raiko,T.,and Ilin,A.(2011). Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. In ICML'2011,pages 105–112.
Cho,K.,Van Merriënboer,B.,Gülçehre,Ç.,Bahdanau,D.,Bougares,F.,Schwenk,H.,and Bengio,Y.(2014a). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP),pages 1724–1734. Association for Computational Linguistics.
Cho,K.,van Merriënboer,B.,Gulcehre,C.,Bougares,F.,Schwenk,H.,and Bengio,Y.(2014b). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing(EMNLP 2014).
Cho,K.,Van Merriënboer,B.,Bahdanau,D.,and Bengio,Y.(2014c). On the properties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints,abs/1409.1259.
Choromanska,A.,Henaff,M.,Mathieu,M.,Arous,G. B.,and LeCun,Y.(2014). The loss surface of multilayer networks.
Chorowski,J.,Bahdanau,D.,Cho,K.,and Bengio,Y.(2014). End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602.
Christianson,B.(1992). Automatic Hessians by reverse accumulation. IMA Journal of Numerical Analysis,12(2),135–150.
Chrupala,G.,Kadar,A.,and Alishahi,A.(2015). Learning language through pictures. arXiv 1506.03694.
Chung,J.,Gulcehre,C.,Cho,K.,and Bengio,Y.(2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS'2014 Deep Learning workshop,arXiv 1412.3555.
Chung,J.,Gülçehre,Ç.,Cho,K.,and Bengio,Y.(2015a). Gated feedback recurrent neural networks. In ICML'15.
Chung,J.,Kastner,K.,Dinh,L.,Goel,K.,Courville,A.,and Bengio,Y.(2015b). A recurrent latent variable model for sequential data. In NIPS'2015.
Ciresan,D.,Meier,U.,Masci,J.,and Schmidhuber,J.(2012). Multi-column deep neural network for traffic sign classification. Neural Networks,32,333–338.
Ciresan,D. C.,Meier,U.,Gambardella,L. M.,and Schmidhuber,J.(2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation,22,1–14.
Coates,A. and Ng,A. Y.(2011). The importance of encoding versus training with sparse coding and vector quantization. In ICML'2011.
Coates,A.,Lee,H.,and Ng,A. Y.(2011). An analysis of single-layer networks in unsuper-vised feature learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS 2011).
Coates,A.,Huval,B.,Wang,T.,Wu,D.,Catanzaro,B.,and Andrew,N.(2013). Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester,editors,Proceedings of the 30th International Conference on Machine Learning(ICML-13),volume 28(3),pages 1337–1345. JMLR Workshop and Conference Proceedings.
Cohen,N.,Sharir,O.,and Shashua,A.(2015). On the expressive power of deep learning: A tensor analysis. arXiv:1509.05009.
Collobert,R.(2004). Large Scale Machine Learning. Ph.D. thesis,Université de Paris VI,LIP6.
Collobert,R.(2011). Deep learning for efficient discriminative parsing. In AISTATS'2011.
Collobert,R. and Weston,J.(2008a). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.
Collobert,R. and Weston,J.(2008b). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML'2008.
Collobert,R.,Bengio,S.,and Bengio,Y.(2001). A parallel mixture of SVMs for very large scale problems. Technical Report 12,IDIAP.
Collobert,R.,Bengio,S.,and Bengio,Y.(2002). Parallel mixture of SVMs for very large scale problem. Neural Computation.
Collobert,R.,Weston,J.,Bottou,L.,Karlen,M.,Kavukcuoglu,K.,and Kuksa,P.(2011a). Natural language processing(almost) from scratch. The Journal of Machine Learning Research,12,2493–2537.
Collobert,R.,Kavukcuoglu,K.,and Farabet,C.(2011b). Torch7: A Matlab-like environment for machine learning. In BigLearn,NIPS Workshop.
Comon,P.(1994). Independent component analysis-a new concept?Signal Processing,36,287–314.
Cortes,C. and Vapnik,V.(1995). Support vector networks. Machine Learning,20,273–297.
Couprie,C.,Farabet,C.,Najman,L.,and LeCun,Y.(2013). Indoor semantic segmentation using depth information. In International Conference on Learning Representations(ICLR2013).
Courbariaux,M.,Bengio,Y.,and David,J.-P.(2015). Low precision arithmetic for deep learning. In Arxiv:1412.7024,ICLR'2015 Workshop.
Courville,A.,Bergstra,J.,and Bengio,Y.(2011a). Unsupervised models of images by spike-and-slab RBMs. In ICML'2011.
Courville,A.,Bergstra,J.,and Bengio,Y.(2011b). Unsupervised models of images by spike-and-slab RBMs. In ICM(1b).
Courville,A.,Desjardins,G.,Bergstra,J.,and Bengio,Y.(2014). The spike-and-slab RBM and extensions to discrete and sparse data distributions. Pattern Analysis and Machine Intelligence,IEEE Transactions on,36(9),1874–1887.
Cover,T. M. and Thomas,J. A.(2006). Elements of Information Theory,2nd Edition. Wiley-Interscience.
Cox,D. and Pinto,N.(2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face & Gesture Recognition and Workshops(FG 2011),2011 IEEE International Conference on,pages 8–15. IEEE.
Cramér,H.(1946). Mathematical methods of statistics. Princeton University Press.
Crick,F. H. C. and Mitchison,G.(1983). The function of dream sleep. Nature,304,111–114.
Cybenko,G.(1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control,Signals,and Systems,2,303–314.
Dahl,G. E.,Ranzato,M.,Mohamed,A.,and Hinton,G. E.(2010). Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems(NIPS).
Dahl,G. E.,Yu,D.,Deng,L.,and Acero,A.(2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio,Speech,and Language Processing,20(1),33–42.
Dahl,G. E.,Sainath,T. N.,and Hinton,G. E.(2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP'2013.
Dahl,G. E.,Jaitly,N.,and Salakhutdinov,R.(2014). Multi-task neural networks for QSAR predictions. arXiv:1406.1231.
Dauphin,Y. and Bengio,Y.(2013). Stochastic ratio matching of RBMs for sparse high-dimensional inputs. In NIP(1).
Dauphin,Y.,Glorot,X.,and Bengio,Y.(2011). Large-scale learning of embeddings with recon-struction sampling. In ICML'2011.
Dauphin,Y.,Pascanu,R.,Gulcehre,C.,Cho,K.,Ganguli,S.,and Bengio,Y.(2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS'2014.
Davis,A.,Rubinstein,M.,Wadhwa,N.,Mysore,G.,Durand,F.,and Freeman,W. T.(2014). The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics(Proc. SIGGRAPH),33(4),79:1–79:10.
Dayan,P.(1990). Reinforcement comparison. In Connectionist Models: Proceedings of the 1990 Connectionist Summer School,San Mateo,CA.
Dayan,P. and Hinton,G. E.(1996). Varieties of Helmholtz machine. Neural Networks,9(8),1385–1403.
Dayan,P.,Hinton,G. E.,Neal,R. M.,and Zemel,R. S.(1995). The Helmholtz machine. Neural computation,7(5),889–904.
Dean,J.,Corrado,G.,Monga,R.,Chen,K.,Devin,M.,Le,Q.,Mao,M.,Ranzato,M.,Senior,A.,Tucker,P.,Yang,K.,and Ng,A. Y.(2012). Large scale distributed deep networks. In NIPS'2012.
Dean,T. and Kanazawa,K.(1989). A model for reasoning about persistence and causation. Computational Intelligence,5(3),142–150.
Deerwester,S.,Dumais,S. T.,Furnas,G. W.,Landauer,T. K.,and Harshman,R.(1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science,41(6),391–407.
Delalleau,O. and Bengio,Y.(2011). Shallow vs. deep sum-product networks. In NIPS.
Deng,J.,Dong,W.,Socher,R.,Li,L.-J.,Li,K.,and Fei-Fei,L.(2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
Deng,J.,Berg,A. C.,Li,K.,and Fei-Fei,L.(2010a). What does classifying more than 10,000 image categories tell us? In Proceedings of the 11th European Conference on Computer Vision: Part V,ECCV'10,pages 71–84,Berlin,Heidelberg. Springer-Verlag.
Deng,L. and Yu,D.(2014). Deep learning–methods and applications. Foundations and Trends in Signal Processing.
Deng,L.,Seltzer,M.,Yu,D.,Acero,A.,Mohamed,A.,and Hinton,G.(2010b). Binary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010,Makuhari,Chiba,Japan.
Denil,M.,Bazzani,L.,Larochelle,H.,and de Freitas,N.(2012). Learning where to attend with deep architectures for image tracking. Neural Computation,24(8),2151–2184.
Denton,E.,Chintala,S.,Szlam,A.,and Fergus,R.(2015). Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS.
Desjardins,G. and Bengio,Y.(2008). Empirical evaluation of convolutional RBMs for vision. Technical Report 1327,Département d'Informatique et de Recherche Opérationnelle,Université de Montréal.
Desjardins,G.,Courville,A. C.,Bengio,Y.,Vincent,P.,and Delalleau,O.(2010). Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In International Conference on Artificial Intelligence and Statistics,pages 145–152.
Desjardins,G.,Courville,A.,and Bengio,Y.(2011). On tracking the partition function. In NIPS'2011.
Devlin,J.,Zbib,R.,Huang,Z.,Lamar,T.,Schwartz,R.,and Makhoul,J.(2014). Fast and robust neural network joint models for statistical machine translation. In Proc. ACL'2014.
Devroye,L.(2013). Non-Uniform Random Variate Generation. SpringerLink: Bücher. Springer New York.
DiCarlo,J. J.(2013). Mechanisms underlying visual object recognition:Humans vs. neurons vs. machines. NIPS Tutorial.
Dinh,L.,Krueger,D.,and Bengio,Y.(2014). NICE: Non-linear independent components esti-mation. arXiv:1410.8516.
Donahue,J.,Hendricks,L. A.,Guadarrama,S.,Rohrbach,M.,Venugopalan,S.,Saenko,K.,and Darrell,T.(2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.
Donoho,D. L. and Grimes,C.(2003). Hessian eigenmaps: new locally linear embedding tech-niques for high-dimensional data. Technical Report 2003-08,Dept. Statistics,Stanford University.
Dosovitskiy,A.,Springenberg,J. T.,and Brox,T.(2015). Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1538–1546.
Doya,K.(1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on Neural Networks,1,75–80.
Dreyfus,S. E.(1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications,5(1),30–45.
Dreyfus,S. E.(1973). The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control,18(4),383–385.
Drucker,H. and LeCun,Y.(1992). Improving generalisation performance using double back-propagation. IEEE Transactions on Neural Networks,3(6),991–997.
Duchi,J.,Hazan,E.,and Singer,Y.(2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.
Dudik,M.,Langford,J.,and Li,L.(2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine learning,ICML '11.
Dugas,C.,Bengio,Y.,Bélisle,F.,and Nadeau,C.(2001). Incorporating second-order functional knowledge for better option pricing. In T. Leen,T. Dietterich,and V. Tresp,editors,Advances in Neural Information Processing Systems 13(NIPS'00),pages 472–478. MIT Press.
Dziugaite,G. K.,Roy,D. M.,and Ghahramani,Z.(2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.
El Hihi,S. and Bengio,Y.(1996). Hierarchical recurrent neural networks for long-term depen-dencies. In NIPS 8. MIT Press.
Elkahky,A. M.,Song,Y.,and He,X.(2015). A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web,pages 278–288.
Elman,J. L.(1993). Learning and development in neural networks: The importance of starting small. Cognition,48,781–799.
Erhan,D.,Manzagol,P.-A.,Bengio,Y.,Bengio,S.,and Vincent,P.(2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. In AISTATS'2009,pages 153–160.
Erhan,D.,Bengio,Y.,Courville,A.,Manzagol,P.,Vincent,P.,and Bengio,S.(2010). Why does unsupervised pre-training help deep learning? J. Machine Learning Res.
Fahlman,S. E.,Hinton,G. E.,and Sejnowski,T. J.(1983). Massively parallel architectures for AI: NETL,thistle,and Boltzmann machines. In Proceedings of the National Conference on Artificial Intelligence AAAI-83.
Fang,H.,Gupta,S.,Iandola,F.,Srivastava,R.,Deng,L.,Dollár,P.,Gao,J.,He,X.,Mitchell,M.,Platt,J. C.,Zitnick,C. L.,and Zweig,G.(2015). From captions to visual concepts and back. arXiv:1411.4952.
Farabet,C.,LeCun,Y.,Kavukcuoglu,K.,Culurciello,E.,Martini,B.,Akselrod,P.,and Talay,S.(2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman,M. Bilenko,and J. Langford,editors,Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press.
Farabet,C.,Couprie,C.,Najman,L.,and LeCun,Y.(2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(8),1915–1929.
Fei-Fei,L.,Fergus,R.,and Perona,P.(2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence,28(4),594–611.
Finn,C.,Tan,X. Y.,Duan,Y.,Darrell,T.,Levine,S.,and Abbeel,P.(2015). Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv preprint arXiv:1509.06113.
Fisher,R. A.(1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics,7,179–188.
Földiák,P.(1989). Adaptive network for optimal linear feature extraction. In International Joint Conference on Neural Networks(IJCNN),volume 1,pages 401–405,Washington 1989. IEEE,New York.
Franzius,M.,Sprekeler,H.,and Wiskott,L.(2007). Slowness and sparseness lead to place,head-direction,and spatial-view cells.
Franzius,M.,Wilbert,N.,and Wiskott,L.(2008). Invariant object recognition with slow feature analysis. In Proceedings of the 18th international conference on Artificial Neural Networks,Part I,ICANN '08,pages 961–970,Berlin,Heidelberg. Springer-Verlag.
Frasconi,P.,Gori,M.,and Sperduti,A.(1997). On the efficient classification of data structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence.
Frasconi,P.,Gori,M.,and Sperduti,A.(1998). A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks,9(5),768–786.
Freund,Y. and Schapire,R. E.(1996a). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference,pages 148–156,USA. ACM.
Freund,Y. and Schapire,R. E.(1996b). Game theory,on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory,pages 325–332.
Frey,B. J.(1998). Graphical models for machine learning and digital communication. MIT Press.
Frey,B. J.,Hinton,G. E.,and Dayan,P.(1996). Does the wake-sleep algorithm learn good density estimators? In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95),pages 661–670. MIT Press,Cambridge,MA.
Frobenius,G.(1908). Über matrizen aus positiven elementen,s. B. Preuss. Akad. Wiss. Berlin,Germany.
Fukushima,K.(1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernetics,20,121–136.
Fukushima,K.(1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics,36,193–202.
Gal,Y. and Ghahramani,Z.(2015). Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158.
Gallinari,P.,LeCun,Y.,Thiria,S.,and Fogelman-Soulie,F.(1987). Memoires associatives distribuees. In Proceedings of COGNITIVA 87,Paris,La Villette.
Garcia-Duran,A.,Bordes,A.,Usunier,N.,and Grandvalet,Y.(2015). Combining two and three-way embeddings models for link prediction in knowledge bases. arXiv preprint arXiv:1506.00999.
Garofolo,J. S.,Lamel,L. F.,Fisher,W. M.,Fiscus,J. G.,and Pallett,D. S.(1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon Technical Report N,93,27403.
Garson,J.(1900). The metric system of identification of criminals,as used in Great Britain and Ireland. The Journal of the Anthropological Institute of Great Britain and Ireland,(2),177–227.
Gers,F. A.,Schmidhuber,J.,and Cummins,F.(2000). Learning to forget: Continual prediction with LSTM. Neural computation,12(10),2451–2471.
Ghahramani,Z. and Hinton,G. E.(1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1,Dpt. of Comp. Sci.,Univ. of Toronto.
Gillick,D.,Brunk,C.,Vinyals,O.,and Subramanya,A.(2015). Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103.
Girshick,R.,Donahue,J.,Darrell,T.,and Malik,J.(2015). Region-based convolutional networks for accurate object detection and segmentation.
Giudice,M. D.,Manera,V.,and Keysers,C.(2009). Programmed to learn? The ontogeny of mirror neurons. Dev. Sci.,12(2),350–363.
Glorot,X. and Bengio,Y.(2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS'2010.
Glorot,X.,Bordes,A.,and Bengio,Y.(2011a). Deep sparse rectifier neural networks. In AISTATS'2011.
Glorot,X.,Bordes,A.,and Bengio,Y.(2011b). Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML'2011.
Glorot,X.,Bordes,A.,and Bengio,Y.(2011c). Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICM(1b),pages 97–110.
Goldberger,J.,Roweis,S.,Hinton,G. E.,and Salakhutdinov,R.(2005). Neighbourhood components analysis. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04). MIT Press.
Gong,S.,McKenna,S.,and Psarrou,A.(2000). Dynamic Vision: From Images to Face Recognition. Imperial College Press.
Goodfellow,I.,Le,Q.,Saxe,A.,and Ng,A.(2009). Measuring invariances in deep networks. In Y. Bengio,D. Schuurmans,C. Williams,J. Lafferty,and A. Culotta,editors,Advances in Neural Information Processing Systems 22(NIPS'09),pages 646–654.
Goodfellow,I.,Koenig,N.,Muja,M.,Pantofaru,C.,Sorokin,A.,and Takayama,L.(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction(HRI),Osaka,Japan. ACM Press,ACM Press.
Goodfellow,I.,Mirza,M.,Xiao,D.,Courville,A.,and Bengio,Y.(2014a). An empirical inves-tigation of catastrophic forgetting in gradient-based neural networks. In ICLR'14.
Goodfellow,I. J.(2010). Technical report:Multidimensional,downsampled convolution for autoencoders. Technical report,Université de Montréal.
Goodfellow,I. J.(2014). On distinguishability criteria for estimating generative models. In International Conference on Learning Representations,Workshops Track.
Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2011). Spike-and-slab sparse coding for unsu-pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.
Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013a). Maxout networks. In ICML'2013.
Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013b). Maxout networks. In ICM(1c),pages 1319–1327.
Goodfellow,I. J.,Warde-Farley,D.,Mirza,M.,Courville,A.,and Bengio,Y.(2013c). Maxout networks. Technical Report arXiv:1302.4389,Université de Montréal.
Goodfellow,I. J.,Mirza,M.,Courville,A.,and Bengio,Y.(2013d). Multi-prediction deep Boltzmann machines. In NIP(1).
Goodfellow,I. J.,Warde-Farley,D.,Lamblin,P.,Dumoulin,V.,Mirza,M.,Pascanu,R.,Bergstra,J.,Bastien,F.,and Bengio,Y.(2013e). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2013f). Scaling up spike-and-slab models for unsupervised feature learning. IEEE T. PAMI,pages 1902–1914.
Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2013g). Scaling up spike-and-slab models for un-supervised feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(8),1902–1914.
Goodfellow,I. J.,Shlens,J.,and Szegedy,C.(2014b). Explaining and harnessing adversarial examples. CoRR,abs/1412.6572.
Goodfellow,I. J.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,and Bengio,Y.(2014c). Generative adversarial networks. In NIPS'2014.
Goodfellow,I. J.,Bulatov,Y.,Ibarz,J.,Arnoud,S.,and Shet,V.(2014d). Multi-digit number recognition from Street View imagery using deep convolutional neural networks. In International Conference on Learning Representations.
Goodfellow,I. J.,Vinyals,O.,and Saxe,A. M.(2015). Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations.
Goodman,J.(2001). Classes for fast maximum entropy training. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),Utah.
Gori,M. and Tesi,A.(1992). On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-14(1),76–86.
Gosset,W. S.(1908). The probable error of a mean. Biometrika,6(1),1–25. Originally published under the pseudonym“Student”.
Gouws,S.,Bengio,Y.,and Corrado,G.(2014). BilBOWA: Fast bilingual distributed representations without word alignments. Technical report,arXiv:1410.2455.
Graf,H. P. and Jackel,L. D.(1989). Analog electronic neural network circuits. Circuits and Devices Magazine,IEEE,5(4),44–49.
Graves,A.(2011). Practical variational inference for neural networks. In NIPS'2011.
Graves,A.(2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer.
Graves,A.(2013). Generating sequences with recurrent neural networks. Technical report,arXiv:1308.0850.
Graves,A. and Jaitly,N.(2014). Towards end-to-end speech recognition with recurrent neural networks. In ICML'2014.
Graves,A. and Schmidhuber,J.(2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks,18(5),602–610.
Graves,A. and Schmidhuber,J.(2009). Offine handwriting recognition with multidimensional recurrent neural networks. In D. Koller,D. Schuurmans,Y. Bengio,and L. Bottou,editors,NIPS'2008,pages 545–552.
Graves,A.,Fernández,S.,Gomez,F.,and Schmidhuber,J.(2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML'2006,pages 369–376,Pittsburgh,USA.
Graves,A.,Liwicki,M.,Bunke,H.,Schmidhuber,J.,and Fernández,S.(2008). Unconstrained on-line handwriting recognition with recurrent neural networks. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,NIPS'2007,pages 577–584.
Graves,A.,Liwicki,M.,Fernández,S.,Bertolami,R.,Bunke,H.,and Schmidhuber,J.(2009). A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on,31(5),855–868.
Graves,A.,Mohamed,A.,and Hinton,G.(2013). Speech recognition with deep recurrent neural networks. In ICASSP'2013,pages 6645–6649.
Graves,A.,Wayne,G.,and Danihelka,I.(2014). Neural Turing machines. arXiv:1410.5401.
Grefenstette,E.,Hermann,K. M.,Suleyman,M.,and Blunsom,P.(2015). Learning to transduce with unbounded memory. In NIPS'2015.
Greff,K.,Srivastava,R. K.,Koutník,J.,Steunebrink,B. R.,and Schmidhuber,J.(2015). LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069.
Gregor,K. and LeCun,Y.(2010a). Emergence of complex-like cells in a temporal product network with local receptivefields. Technical report,arXiv:1006.0448.
Gregor,K. and LeCun,Y.(2010b). Learning fast approximations of sparse coding. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-seventh International Conference on Machine Learning(ICML-10). ACM.
Gregor,K.,Danihelka,I.,Mnih,A.,Blundell,C.,and Wierstra,D.(2014). Deep autoregressive networks. In International Conference on Machine Learning(ICML'2014).
Gregor,K.,Danihelka,I.,Graves,A.,and Wierstra,D.(2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
Gretton,A.,Borgwardt,K. M.,Rasch,M. J.,Schölkopf,B.,and Smola,A.(2012). A kernel two-sample test. The Journal of Machine Learning Research,13(1),723–773.
Guillaume Desjardins,Karen Simonyan,R. P. K. K.(2015). Natural neural networks. Technical report,arXiv:1507.00210.
Gulcehre,C. and Bengio,Y.(2013). Knowledge matters: Importance of prior information for optimization. Technical Report arXiv:1301.4083,Universite de Montreal.
Guo,H. and Gelfand,S. B.(1992). Classification trees with neural network feature extraction. Neural Networks,IEEE Transactions on,3(6),923–933.
Gupta,S.,Agrawal,A.,Gopalakrishnan,K.,and Narayanan,P.(2015). Deep learning with limited numerical precision. CoRR,abs/1502.02551.
Gutmann,M. and Hyvarinen,A.(2010). Noise-contrastive estimation: A new estimation princi-ple for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS'10).
Hadsell,R.,Sermanet,P.,Ben,J.,Erkan,A.,Han,J.,Muller,U.,and LeCun,Y.(2007). Online learning for offroad robots: Spatial label propagation to learn long-range traversability. In Proceedings of Robotics: Science and Systems,Atlanta,GA,USA.
Hajnal,A.,Maass,W.,Pudlak,P.,Szegedy,M.,and Turan,G.(1993). Threshold circuits of bounded depth. J. Comput. System. Sci.,46,129–154.
Håstad,J.(1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing,pages 6–20,Berkeley,California. ACM Press.
Håstad,J. and Goldmann,M.(1991). On the power of small-depth threshold circuits. Computational Complexity,1,113–129.
Hastie,T.,Tibshirani,R.,and Friedman,J.(2001). The elements of statistical learning: data mining,inference and prediction. Springer Series in Statistics. Springer Verlag.
He,K.,Zhang,X.,Ren,S.,and Sun,J.(2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv preprint arXiv:1502.01852.
Hebb,D. O.(1949). The Organization of Behavior. Wiley,New York.
Henaff,M.,Jarrett,K.,Kavukcuoglu,K.,and LeCun,Y.(2011). Unsupervised learning of sparse features for scalable audio classification. In ISMIR'11.
Henderson,J.(2003). Inducing history representations for broad coverage statistical parsing. In HLT-NAACL,pages 103–110.
Henderson,J.(2004). Discriminative training of a neural network statistical parser. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,page 95.
Henniges,M.,Puertas,G.,Bornschein,J.,Eggert,J.,and Lücke,J.(2010). Binary sparse coding. In Latent Variable Analysis and Signal Separation,pages 450–457. Springer.
Herault,J. and Ans,B.(1984). Circuits neuronaux à synapses modifiables: Décodage de messages composites par apprentissage non supervisé. Comptes Rendus de l'Académie des Sciences,299(III-13),525–528.
Hinton,G.,Deng,L.,Dahl,G. E.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T.,and Kingsbury,B.(2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,29(6),82–97.
Hinton,G.,Vinyals,O.,and Dean,J.(2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Hinton,G. E.(1989). Connectionist learning procedures. Artificial Intelligence,40,185–234.
Hinton,G. E.(1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence,46(1),47–75.
Hinton,G. E.(1999). Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks(ICANN),volume 1,pages 1–6,Edinburgh,Scotland. IEE.
Hinton,G. E.(2000). Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004,Gatsby Unit,University College London.
Hinton,G. E.(2006). To recognize shapes,first learn to generate images. Technical Report UTML TR 2006-003,University of Toronto.
Hinton,G. E.(2007a). How to do backpropagation in a brain. Invited talk at the NIPS'2007 Deep Learning Workshop.
Hinton,G. E.(2007b). Learning multiple layers of representation. Trends in cognitive sciences,11(10),428–434.
Hinton,G. E.(2010). A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003,Comp. Sc.,University of Toronto.
Hinton,G. E.(2012). Tutorial on deep learning. IPAM Graduate Summer School: Deep Learning,Feature Learning.
Hinton,G. E. and Ghahramani,Z.(1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London.
Hinton,G. E. and McClelland,J. L.(1988). Learning representations by recirculation. In NIPS'1987,pages 358–366.
Hinton,G. E. and Roweis,S.(2003). Stochastic neighbor embedding. In NIPS'2002.
Hinton,G. E. and Salakhutdinov,R.(2006). Reducing the dimensionality of data with neural networks. Science,313(5786),504–507.
Hinton,G. E. and Sejnowski,T. J.(1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 7,pages 282–317. MIT Press,Cambridge.
Hinton,G. E. and Sejnowski,T. J.(1999). Unsupervised learning: foundations of neural computation. MIT press.
Hinton,G. E. and Shallice,T.(1991). Lesioning an attractor network: investigations of acquired dyslexia. Psychological review,98(1),74.
Hinton,G. E. and Zemel,R. S.(1994). Autoencoders,minimum description length,and Helmholtz free energy. In NIPS'1993.
Hinton,G. E.,Sejnowski,T. J.,and Ackley,D. H.(1984a). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119,Carnegie-Mellon Uni-versity,Dept. of Computer Science.
Hinton,G. E.,Sejnowski,T. J.,and Ackley,D. H.(1984b). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119,Carnegie-Mellon Uni-versity,Dept. of Computer Science.
Hinton,G. E.,McClelland,J.,and Rumelhart,D.(1986). Distributed representations. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing: Explorations in the Microstructure of Cognition,volume 1,pages 77–109. MIT Press,Cambridge.
Hinton,G. E.,Revow,M.,and Dayan,P.(1995a). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro,D. Touretzky,and T. Leen,editors,Advances in Neural Information Processing Systems 7(NIPS'94),pages 1015–1022. MIT Press,Cambridge,MA.
Hinton,G. E.,Dayan,P.,Frey,B. J.,and Neal,R. M.(1995b). The wake-sleep algorithm for unsupervised neural networks. Science,268,1558–1161.
Hinton,G. E.,Dayan,P.,and Revow,M.(1997). Modelling the manifolds of images of hand-written digits. IEEE Transactions on Neural Networks,8,65–74.
Hinton,G. E.,Welling,M.,Teh,Y. W.,and Osindero,S.(2001). A new view of ICA. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation(ICA'01),pages 746–751,San Diego,CA.
Hinton,G. E.,Osindero,S.,and Teh,Y.(2006a). A fast learning algorithm for deep belief nets. Neural Computation,18,1527–1554.
Hinton,G. E.,Osindero,S.,and Teh,Y.-W.(2006b). A fast learning algorithm for deep belief nets. Neural Computation,18,1527–1554.
Hinton,G. E.,Deng,L.,Yu,D.,Dahl,G. E.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T. N.,and Kingsbury,B.(2012b). Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Process. Mag.,29(6),82–97.
Hinton,G. E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report,arXiv:1207.0580.
Hinton,G. E.,Srivastava,N.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2012d). Improving neural networks by preventing co-adaptation of feature detectors. Technical report,arXiv:1207.0580.
Hinton,G. E.,Vinyals,O.,and Dean,J.(2014). Dark knowledge. Invited talk at the BayLearn Bay Area Machine Learning Symposium.
Hochreiter,S.(1991a). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,T.U. München.
Hochreiter,S.(1991b). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,Institut für Informatik,Lehrstuhl Prof. Brauer,Technische Universität München.
Hochreiter,S. and Schmidhuber,J.(1995). Simplifying neural nets by discoveringflat minima. In Advances in Neural Information Processing Systems 7,pages 529–536. MIT Press.
Hochreiter,S. and Schmidhuber,J.(1997). Long short-term memory. Neural Computation,9(8),1735–1780.
Hochreiter,S.,Bengio,Y.,and Frasconi,P.(2001). Gradientflow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer,editors,Field Guide to Dynamical Recurrent Networks. IEEE Press.
Holi,J. L. and Hwang,J.-N.(1993). Finite precision error analysis of neural network hardware implementations. Computers,IEEE Transactions on,42(3),281–290.
Holt,J. L. and Baker,T. E.(1991). Back propagation simulations using limited precision calculations. In Neural Networks,1991.,IJCNN-91-Seattle International Joint Conference on,volume 2,pages 121–126. IEEE.
Hornik,K.,Stinchcombe,M.,and White,H.(1989). Multilayer feedforward networks are universal approximators. Neural Networks,2,359–366.
Hornik,K.,Stinchcombe,M.,and White,H.(1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks,3(5),551–560.
Hsu,F.-H.(2002). Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton University Press,Princeton,NJ,USA.
Huang,F. and Ogata,Y.(2002). Generalized pseudo-likelihood estimates for Markov random fields on lattice. Annals of the Institute of Statistical Mathematics,54(1),1–18.
Huang,P.-S.,He,X.,Gao,J.,Deng,L.,Acero,A.,and Heck,L.(2013). Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management,pages 2333–2338. ACM.
Hubel,D. and Wiesel,T.(1968). Receptivefields and functional architecture of monkey striate cortex. Journal of Physiology(London),195,215–243.
Hubel,D. H. and Wiesel,T. N.(1959). Receptivefields of single neurons in the cat's striate cortex. Journal of Physiology,148,574–591.
Hubel,D. H. and Wiesel,T. N.(1962). Receptivefields,binocular interaction,and functional architecture in the cat's visual cortex. Journal of Physiology(London),160,106–154.
Huszar,F.(2015). How(not) to train your generative model: schedule sampling,likelihood,adversary? arXiv:1511.05101.
Hutter,F.,Hoos,H.,and Leyton-Brown,K.(2011). Sequential model-based optimization for general algorithm configuration. In LION-5. Extended version as UBC Tech report TR-2010-10.
Hyotyniemi,H.(1996). Turing machines are recurrent neural networks. In STeP'96,pages 13–24.
Hyvärinen,A.(1999). Survey on independent component analysis. Neural Computing Surveys,2,94–128.
Hyvärinen,A.(2005a). Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research,6,695–709.
Hyvärinen,A.(2005b). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res.,6.
Hyvärinen,A.(2007a). Connections between score matching,contrastive divergence,and pseu-dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks,18,1529–1531.
Hyvärinen,A.(2007b). Some extensions of score matching. Computational Statistics and Data Analysis,51,2499–2512.
Hyvärinen,A. and Hoyer,P. O.(1999). Emergence of topography and complex cell properties from natural images using extensions of ica. In NIPS,pages 827–833.
Hyvärinen,A. and Pajunen,P.(1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks,12(3),429–439.
Hyvärinen,A.,Karhunen,J.,and Oja,E.(2001a). Independent Component Analysis. Wiley-Interscience.
Hyvärinen,A.,Hoyer,P. O.,and Inki,M. O.(2001b). Topographic independent component analysis. Neural Computation,13(7),1527–1558.
Hyvärinen,A.,Hurri,J.,and Hoyer,P. O.(2009). Natural Image Statistics: A probabilistic approach to early computational vision. Springer-Verlag.
Iba,Y.(2001). Extended ensemble Monte Carlo. International Journal of Modern Physics,C12,623–656.
Inayoshi,H. and Kurita,T.(2005). Improved generalization by adding both auto-association and hidden-layer noise to neural-network-based-classifiers. IEEE Workshop on Machine Learning for Signal Processing,pages 141–146.
Ioffe,S. and Szegedy,C.(2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.
Jacobs,R. A.(1988). Increased rates of convergence through learning rate adaptation. Neural networks,1(4),295–307.
Jacobs,R. A.,Jordan,M. I.,Nowlan,S. J.,and Hinton,G. E.(1991). Adaptive mixtures of local experts. Neural Computation,3,79–87.
Jaeger,H.(2003). Adaptive nonlinear system identification with echo state networks. In Advances in Neural Information Processing Systems 15.
Jaeger,H.(2007a). Discovering multiscale dynamical features with hierarchical echo state networks. Technical report,Jacobs University.
Jaeger,H.(2007b). Echo state network. Scholarpedia,2(9),2330.
Jaeger,H.(2012). Long short-term memory in echo state networks: Details of a simulation study. Technical report,Technical report,Jacobs University Bremen.
Jaeger,H. and Haas,H.(2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science,304(5667),78–80.
Jaeger,H.,Lukosevicius,M.,Popovici,D.,and Siewert,U.(2007). Optimization and applications of echo state networks with leaky-integrator neurons. Neural Networks,20(3),335–352.
Jain,V.,Murray,J. F.,Roth,F.,Turaga,S.,Zhigulin,V.,Briggman,K. L.,Helmstaedter,M. N.,Denk,W.,and Seung,H. S.(2007). Supervised learning of image restoration with convolutional networks. In Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on,pages 1–8. IEEE.
Jaitly,N. and Hinton,G.(2011). Learning a better representation of speech soundwaves using restricted Boltzmann machines. In Acoustics,Speech and Signal Processing(ICASSP),2011 IEEE International Conference on,pages 5884–5887. IEEE.
Jaitly,N. and Hinton,G. E.(2013). Vocal tract length perturbation(VTLP) improves speech recognition. In ICML'2013.
Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2009a). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision(ICCV'09),pages 2146–2153. IEEE.
Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2009b). What is the best multi-stage architecture for object recognition? In ICCV'09.
Jarzynski,C.(1997). Nonequilibrium equality for free energy differences. Phys. Rev. Lett.,78,2690–2693.
Jaynes,E. T.(2003). Probability Theory: The Logic of Science. Cambridge University Press.
Jean,S.,Cho,K.,Memisevic,R.,and Bengio,Y.(2014). On using very large target vocabulary for neural machine translation. arXiv:1412.2007.
Jelinek,F. and Mercer,R. L.(1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal,editors,Pattern Recognition in Practice. North-Holland,Amsterdam.
Jia,Y.(2013). Caffe:An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
Jia,Y.,Huang,C.,and Darrell,T.(2012). Beyond spatial pyramids: Receptivefield learning for pooled image features. In Computer Vision and Pattern Recognition(CVPR),2012 IEEE Conference on,pages 3370–3377. IEEE.
Jim,K.-C.,Giles,C. L.,and Horne,B. G.(1996). An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks,7(6),1424–1438.
Jordan,M. I.(1998). Learning in Graphical Models. Kluwer,Dordrecht,Netherlands.
Joulin,A. and Mikolov,T.(2015). Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv preprint arXiv:1503.01007.
Jozefowicz,R.,Zaremba,W.,and Sutskever,I.(2015). An empirical evaluation of recurrent network architectures. In ICML'2015.
Judd,J. S.(1989). Neural Network Design and the Complexity of Learning. MIT press.
Jutten,C. and Herault,J.(1991). Blind separation of sources,part I: an adaptive algorithm based on neuromimetic architecture. Signal Processing,24,1–10.
Kahou,S. E.,Pal,C.,Bouthillier,X.,Froumenty,P.,Gülçehre,c.,Memisevic,R.,Vincent,P.,Courville,A.,Bengio,Y.,Ferrari,R. C.,Mirza,M.,Jean,S.,Carrier,P. L.,Dauphin,Y.,Boulanger-Lewandowski,N.,Aggarwal,A.,Zumer,J.,Lamblin,P.,Raymond,J.-P.,Des-jardins,G.,Pascanu,R.,Warde-Farley,D.,Torabi,A.,Sharma,A.,Bengio,E.,Côté,M.,Konda,K. R.,and Wu,Z.(2013). Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction.
Kalchbrenner,N. and Blunsom,P.(2013). Recurrent continuous translation models. In EMNLP'2013.
Kalchbrenner,N.,Danihelka,I.,and Graves,A.(2015). Grid long short-term memory. arXiv preprint arXiv:1507.01526.
Kamyshanska,H. and Memisevic,R.(2015). The potential energy of an autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Karpathy,A. and Li,F.-F.(2015). Deep visual-semantic alignments for generating image de-scriptions. In CVPR'2015. arXiv:1412.2306.
Karpathy,A.,Toderici,G.,Shetty,S.,Leung,T.,Sukthankar,R.,and Fei-Fei,L.(2014). Large-scale video classification with convolutional neural networks. In CVPR.
Karush,W.(1939). Minima of Functions of Several Variables with Inequalities as Side Constraints. Master's thesis,Dept. of Mathematics,Univ. of Chicago.
Katz,S. M.(1987). Estimation of probabilities from sparse data for the language model compo-nent of a speech recognizer. IEEE Transactions on Acoustics,Speech,and Signal Processing,ASSP-35(3),400–401.
Kavukcuoglu,K.,Ranzato,M.,and LeCun,Y.(2008). Fast inference in sparse coding algorithms with applications to object recognition. Technical report,Computational and Biological Learn-ing Lab,Courant Institute,NYU. Tech Report CBLL-TR-2008-12-01.
Kavukcuoglu,K.,Ranzato,M.-A.,Fergus,R.,and LeCun,Y.(2009). Learning invariant features through topographicfilter maps. In CVPR'2009.
Kavukcuoglu,K.,Sermanet,P.,Boureau,Y.-L.,Gregor,K.,Mathieu,M.,and LeCun,Y.(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS'2010.
Kelley,H. J.(1960). Gradient theory of optimalflight paths. ARS Journal,30(10),947–954.
Khan,F.,Zhu,X.,and Mutlu,B.(2011). How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems 24(NIPS'11),pages 1449–1457.
Kim,S. K.,McAfee,L. C.,McMahon,P. L.,and Olukotun,K.(2009). A highly scalable restricted Boltzmann machine FPGA implementation. In Field Programmable Logic and Applications,2009. FPL 2009. International Conference on,pages 367–372. IEEE.
Kindermann,R.(1980). Markov Random Fields and Their Applications(Contemporary Mathe-matics;V. 1). American Mathematical Society.
Kingma,D. and Ba,J.(2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma,D. and LeCun,Y.(2010a). Regularized estimation of image statistics by score matching. In NIPS'2010.
Kingma,D. and LeCun,Y.(2010b). Regularized estimation of image statistics by score matching. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1126–1134.
Kingma,D.,Rezende,D.,Mohamed,S.,and Welling,M.(2014). Semi-supervised learning with deep generative models. In NIPS'2014.
Kingma,D. P.(2013). Fast gradient-based inference with continuous latent variable models in auxiliary form. Technical report,arxiv:1306.0733.
Kingma,D. P. and Welling,M.(2014a). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations(ICLR).
Kingma,D. P. and Welling,M.(2014b). Efficient gradient-based inference through transforma-tions between bayes nets and neural nets. Technical report,arxiv:1402.0480.
Kirkpatrick,S.,Jr.,C. D. G.,and Vecchi,M. P.(1983). Optimization by simulated annealing. Science,220,671–680.
Kiros,R.,Salakhutdinov,R.,and Zemel,R.(2014a). Multimodal neural language models. In ICML'2014.
Kiros,R.,Salakhutdinov,R.,and Zemel,R.(2014b). Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs.LG].
Klementiev,A.,Titov,I.,and Bhattarai,B.(2012). Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012.
Knowles-Barley,S.,Jones,T. R.,Morgan,J.,Lee,D.,Kasthuri,N.,Lichtman,J. W.,and Pfister,H.(2014). Deep learning for the connectome. GPU Technology Conference.
Koller,D. and Friedman,N.(2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Konig,Y.,Bourlard,H.,and Morgan,N.(1996). REMAP:Recursive estimation and maxi-mization of a posteriori probabilities–application to transition-based connectionist speech recognition. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95). MIT Press,Cambridge,MA.
Koren,Y.(2009). The BellKor solution to the Netflix grand prize.
Kotzias,D.,Denil,M.,de Freitas,N.,and Smyth,P.(2015). From group to individual labels using deep features. In ACM SIGKDD.
Koutnik,J.,Greff,K.,Gomez,F.,and Schmidhuber,J.(2014). A clockwork RNN. In ICML'2014.
Kočiský,T.,Hermann,K. M.,and Blunsom,P.(2014). Learning Bilingual Word Representations by Marginalizing Alignments. In Proceedings of ACL.
Krause,O.,Fischer,A.,Glasmachers,T.,and Igel,C.(2013). Approximation properties of DBNs with binary hidden units and real-valued visible units. In ICML'2013.
Krizhevsky,A.(2010). Convolutional deep belief networks on CIFAR-10. Technical report,Uni-versity of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/kriz/conv-cifar10-aug2010.pdf.
Krizhevsky,A. and Hinton,G.(2009). Learning multiple layers of features from tiny images. Technical report,University of Toronto.
Krizhevsky,A. and Hinton,G. E.(2011). Using very deep autoencoders for content-based image retrieval. In ESANN.
Krizhevsky,A.,Sutskever,I.,and Hinton,G.(2012a). ImageNet classification with deep convo-lutional neural networks. In NIPS'2012.
Krizhevsky,A.,Sutskever,I.,and Hinton,G.(2012b). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25(NIPS'2012).
Krueger,K. A. and Dayan,P.(2009). Flexible shaping: how learning in small steps helps. Cognition,110,380–394.
Kuhn,H. W. and Tucker,A. W.(1951). Nonlinear programming. In Proceedings of the Sec-ond Berkeley Symposium on Mathematical Statistics and Probability,pages 481–492,Berkeley,Calif. University of California Press.
Kumar,A.,Irsoy,O.,Ondruska,P.,Iyyer,M.,Bradbury,J.,Gulrajani,I.,and Socher,R.(2015a). Ask me anything: Dynamic memory networks for natural language processing. Technical report,arXiv:1506.07285.
Kumar,A.,Irsoy,O.,Su,J.,Bradbury,J.,English,R.,Pierce,B.,Ondruska,P.,Iyyer,M.,Gulrajani,I.,and Socher,R.(2015b). Ask me anything: Dynamic memory networks for natural language processing. arXiv:1506.07285.
Kumar,M. P.,Packer,B.,and Koller,D.(2010). Self-paced learning for latent variable models. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1189–1197.
Lang,K. J. and Hinton,G. E.(1988). The development of the time-delay neural network archi-tecture for speech recognition. Technical Report CMU-CS-88-152,Carnegie-Mellon University.
Lang,K. J.,Waibel,A. H.,and Hinton,G. E.(1990). A time-delay neural network architecture for isolated word recognition. Neural networks,3(1),23–43.
Langford,J. and Zhang,T.(2008). The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS'2008,pages 1096–1103.
Lappalainen,H.,Giannakopoulos,X.,Honkela,A.,and Karhunen,J.(2000). Nonlinear independent component analysis using ensemble learning: Experiments and discussion. In Proc. ICA. Citeseer.
Larochelle,H. and Bengio,Y.(2008a). Classification using discriminative restricted Boltzmann machines. In ICML'2008.
Larochelle,H. and Bengio,Y.(2008b). Classification using discriminative restricted Boltzmann machines. In ICM(1a),pages 536–543.
Larochelle,H. and Hinton,G. E.(2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in Neural Information Processing Systems 23,pages 1243–1251.
Larochelle,H. and Murray,I.(2011). The Neural Autoregressive Distribution Estimator. In AISTATS'2011.
Larochelle,H.,Erhan,D.,and Bengio,Y.(2008). Zero-data learning of new tasks. In AAAI Conference on Artificial Intelligence.
Larochelle,H.,Bengio,Y.,Louradour,J.,and Lamblin,P.(2009). Exploring strategies for training deep neural networks. In JML(1),pages 1–40.
Lasserre,J. A.,Bishop,C. M.,and Minka,T. P.(2006). Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'06),pages 87–94,Washington,DC,USA. IEEE Computer Society.
Le,Q.,Ngiam,J.,Chen,Z.,hao Chia,D. J.,Koh,P. W.,and Ng,A.(2010). Tiled convolutional neural networks. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23(NIPS'10),pages 1279–1287.
Le,Q.,Ngiam,J.,Coates,A.,Lahiri,A.,Prochnow,B.,and Ng,A.(2011). On optimization methods for deep learning. In Proc. ICML'2011. ACM.
Le,Q.,Ranzato,M.,Monga,R.,Devin,M.,Corrado,G.,Chen,K.,Dean,J.,and Ng,A.(2012). Building high-level features using large scale unsupervised learning. In ICML'2012.
Le Roux,N. and Bengio,Y.(2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation,20(6),1631–1649.
Le Roux,N. and Bengio,Y.(2010). Deep belief networks are compact universal approximators. Neural Computation,22(8),2192–2207.
LeCun,Y.(1985). Une procédure d'apprentissage pour Réseau à seuil assymétrique. In Cognitiva 85: A la Frontière de l'Intelligence Artificielle,des Sciences de la Connaissance et des Neurosciences,pages 599–604,Paris 1985. CESTA,Paris.
LeCun,Y.(1986). Learning processes in an asymmetric threshold network. In E. Bienenstock,F. Fogelman-Soulié,and G. Weisbuch,editors,Disordered Systems and Biological Organization,pages 233–240. Springer-Verlag,Berlin,Les Houches 1985.
LeCun,Y.(1987). Modèles connexionistes de l'apprentissage. Ph.D. thesis,Université de Paris VI.
LeCun,Y.(1989). Generalization and network design strategies. Technical Report CRG-TR-89-4,University of Toronto.
LeCun,Y.,Jackel,L. D.,Boser,B.,Denker,J. S.,Graf,H. P.,Guyon,I.,Henderson,D.,Howard,R. E.,and Hubbard,W.(1989). Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine,27(11),41–46.
LeCun,Y.,Bottou,L.,Orr,G. B.,and Müller,K.-R.(1998a). Efficient backprop. In Neural Networks,Tricks of the Trade,Lecture Notes in Computer Science LNCS 1524. Springer Verlag.
LeCun,Y.,Bottou,L.,Orr,G. B.,and Müller,K.(1998b). Efficient backprop. In Neural Networks,Tricks of the Trade.
LeCun,Y.,Bottou,L.,Bengio,Y.,and Haffner,P.(1998c). Gradient based learning applied to document recognition. Proc. IEEE.
LeCun,Y.,Kavukcuoglu,K.,and Farabet,C.(2010). Convolutional networks and applications in vision. In Circuits and Systems(ISCAS),Proceedings of 2010 IEEE International Symposium on,pages 253–256. IEEE.
L'Ecuyer,P.(1994). Efficiency improvement and variance reduction. In Proceedings of the 1994 Winter Simulation Conference,pages 122–132.
Lee,C.-Y.,Xie,S.,Gallagher,P.,Zhang,Z.,and Tu,Z.(2014). Deeply-supervised nets. arXiv preprint arXiv:1409.5185.
Lee,H.,Battle,A.,Raina,R.,and Ng,A.(2007). Efficient sparse coding algorithms. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 801–808. MIT Press.
Lee,H.,Ekanadham,C.,and Ng,A.(2008). Sparse deep belief net model for visual area V2. In NIPS'07.
Lee,H.,Grosse,R.,Ranganath,R.,and Ng,A. Y.(2009). Convolutional deep belief net-works for scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09). ACM,Montreal,Canada.
Lee,Y. J. and Grauman,K.(2011). Learning the easy thingsfirst: self-paced visual category discovery. In CVPR'2011.
Leibniz,G. W.(1676). Memoir using the chain rule.(Cited in TMME 7:2&3 p 321-332,2010).
Lenat,D. B. and Guha,R. V.(1989). Building large knowledge-based systems;representation and inference in the Cyc project. Addison-Wesley Longman Publishing Co.,Inc.
Leshno,M.,Lin,V. Y.,Pinkus,A.,and Schocken,S.(1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks,6,861–867.
Levenberg,K.(1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics,II(2),164–168.
L'Hôpital,G. F. A.(1696). Analyse des infiniment petits,pour l'intelligence des lignes courbes. Paris: L'Imprimerie Royale.
Li,Y.,Swersky,K.,and Zemel,R. S.(2015). Generative moment matching networks. CoRR,abs/1502.02761.
Lin,T.,Horne,B. G.,Tino,P.,and Giles,C. L.(1996). Learning long-term dependencies is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural Networks,7(6),1329–1338.
Lin,Y.,Liu,Z.,Sun,M.,Liu,Y.,and Zhu,X.(2015). Learning entity and relation embeddings for knowledge graph completion. In Proc. AAAI'15.
Linde,N.(1992). The machine that changed the world,episode 3. Documentary miniseries.
Lindsey,C. and Lindblad,T.(1994). Review of hardware neural networks: a user's perspective. In Proc. Third Workshop on Neural Networks: From Biology to High Energy Physics,pages 195–202,Isola d'Elba,Italy.
Linnainmaa,S.(1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics,16(2),146–160.
LISA(2008). Deep learning tutorials:Restricted Boltzmann machines. Technical report,LISA Lab,Université de Montréal.
Long,P. M. and Servedio,R. A.(2010). Restricted Boltzmann machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning(ICML'10).
Lotter,W.,Kreiman,G.,and Cox,D.(2015). Unsupervised learning of visual structure using predictive generative networks. arXiv preprint arXiv:1511.06380.
Lovelace,A.(1842). Notes upon L. F. Menabrea's“Sketch of the Analytical Engine invented by Charles Babbage”.
Lu,L.,Zhang,X.,Cho,K.,and Renals,S.(2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proc. Interspeech.
Lu,T.,Pál,D.,and Pál,M.(2010). Contextual multi-armed bandits. In International Conference on Artificial Intelligence and Statistics,pages 485–492.
Luenberger,D. G.(1984). Linear and Nonlinear Programming. Addison Wesley.
Lukoševičius,M. and Jaeger,H.(2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review,3(3),127–149.
Luo,H.,Shen,R.,Niu,C.,and Ullrich,C.(2011). Learning class-relevant features and class-irrelevant features via a hybrid third-order RBM. In International Conference on Artificial Intelligence and Statistics,pages 470–478.
Luo,H.,Carrier,P. L.,Courville,A.,and Bengio,Y.(2013). Texture modeling with convolutional spike-and-slab RBMs and deep extensions. In AISTATS'2013.
Lyu,S.(2009). Interpretation and generalization of score matching. In Proceedings of the Twenty-fifth Conference in Uncertainty in Artificial Intelligence(UAI'09).
Ma,J.,Sheridan,R. P.,Liaw,A.,Dahl,G. E.,and Svetnik,V.(2015). Deep neural nets as a method for quantitative structure–activity relationships. J. Chemical information and modeling.
Maas,A. L.,Hannun,A. Y.,and Ng,A. Y.(2013). Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio,Speech,and Language Processing.
Maass,W.(1992). Bounds for the computational power and learning complexity of analog neural nets(extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing,pages 335–344.
Maass,W.,Schnitger,G.,and Sontag,E. D.(1994). A comparison of the computational power of sigmoid and Boolean threshold circuits. Theoretical Advances in Neural Computation and Learning,pages 127–151.
Maass,W.,Natschlaeger,T.,and Markram,H.(2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation,14(11),2531–2560.
MacKay,D.(2003). Information Theory,Inference and Learning Algorithms. Cambridge University Press.
Maclaurin,D.,Duvenaud,D.,and Adams,R. P.(2015). Gradient-based hyperparameter optimization through reversible learning. arXiv preprint arXiv:1502.03492.
Mao,J.,Xu,W.,Yang,Y.,Wang,J.,and Yuille,A.(2014). Deep captioning with multimodal recurrent neural networks(m-rnn). arXiv:1412.6632[cs.CV].
Marcotte,P. and Savard,G.(1992). Novel approaches to the discrimination problem. Zeitschrift für Operations Research(Theory),36,517–545.
Marlin,B. and de Freitas,N.(2011). Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In UAI'2011.
Marlin,B.,Swersky,K.,Chen,B.,and de Freitas,N.(2010). Inductive principles for restricted Boltzmann machine learning. In AISTATS'2010,pages 509–516.
Marquardt,D. W.(1963). An algorithm for least-squares estimation of non-linear parameters. Journal of the Society of Industrial and Applied Mathematics,11(2),431–441.
Marr,D. and Poggio,T.(1976). Cooperative computation of stereo disparity. Science,194.
Martens,J.(2010). Deep learning via Hessian-free optimization. In ICML'2010,pages 735–742.
Martens,J. and Medabalimi,V.(2014). On the expressive efficiency of sum product networks. arXiv:1411.7717.
Martens,J. and Sutskever,I.(2011). Learning recurrent neural networks with Hessian-free optimization. In Proc. ICML'2011. ACM.
Mase,S.(1995). Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian processes. The Annals of Applied Probability,5(3),pp. 603–612.
McClelland,J.,Rumelhart,D.,and Hinton,G.(1995). The appeal of parallel distributed processing. In Computation & intelligence,pages 305–341. American Association for Artificial Intelligence.
McCulloch,W. S. and Pitts,W.(1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics,5,115–133.
Mead,C. and Ismail,M.(2012). Analog VLSI implementation of neural systems,volume 80. Springer Science & Business Media.
Melchior,J.,Fischer,A.,and Wiskott,L.(2013). How to center binary deep Boltzmann machines. arXiv preprint arXiv:1311.1354.
Memisevic,R. and Hinton,G. E.(2007). Unsupervised learning of image transformations. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'07).
Memisevic,R. and Hinton,G. E.(2010). Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computation,22(6),1473–1492.
Mesnil,G.,Dauphin,Y.,Glorot,X.,Rifai,S.,Bengio,Y.,Goodfellow,I.,Lavoie,E.,Muller,X.,Desjardins,G.,Warde-Farley,D.,Vincent,P.,Courville,A.,and Bergstra,J.(2011). Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised and Transfer Learning,volume 7.
Mesnil,G.,Rifai,S.,Dauphin,Y.,Bengio,Y.,and Vincent,P.(2012). Surfing on the manifold. Learning Workshop,Snowbird.
Miikkulainen,R. and Dyer,M. G.(1991). Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science,15,343–399.
Mikolov,T.(2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,Brno University of Technology.
Mikolov,T.,Deoras,A.,Kombrink,S.,Burget,L.,and Cernocky,J.(2011a). Empirical evaluation and combination of advanced language modeling techniques. In Proc. 12th annual conference of the international speech communication association(INTERSPEECH 2011).
Mikolov,T.,Deoras,A.,Povey,D.,Burget,L.,and Cernocky,J.(2011b). Strategies for training large scale neural network language models. In Proc. ASRU'2011.
Mikolov,T.,Chen,K.,Corrado,G.,and Dean,J.(2013a). Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.
Mikolov,T.,Le,Q. V.,and Sutskever,I.(2013b). Exploiting similarities among languages for machine translation. Technical report,arXiv:1309.4168.
Minka,T.(2005). Divergence measures and message passing. Microsoft Research Cambridge UK Tech Rep MSRTR2005173,72(TR-2005-173).
Minsky,M. L. and Papert,S. A.(1969). Perceptrons. MIT Press,Cambridge.
Mirza,M. and Osindero,S.(2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
Mishkin,D. and Matas,J.(2015). All you need is a good init. arXiv preprint arXiv:1511.06422.
Misra,J. and Saha,I.(2010). Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing,74(1),239–255.
Mitchell,T. M.(1997). Machine Learning. McGraw-Hill,New York.
Miyato,T.,Maeda,S.,Koyama,M.,Nakae,K.,and Ishii,S.(2015). Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677.
Mnih,A. and Gregor,K.(2014). Neural variational inference and learning in belief networks. In ICML'2014.
Mnih,A. and Hinton,G. E.(2007). Three new graphical models for statistical language mod-elling. In Z. Ghahramani,editor,Proceedings of the Twenty-fourth International Conference on Machine Learning(ICML'07),pages 641–648. ACM.
Mnih,A. and Hinton,G. E.(2009). A scalable hierarchical distributed language model. In D. Koller,D. Schuurmans,Y. Bengio,and L. Bottou,editors,Advances in Neural Information Processing Systems 21(NIPS'08),pages 1081–1088.
Mnih,A. and Kavukcuoglu,K.(2013). Learning word embeddings efficiently with noise-contrastive estimation. In C. Burges,L. Bottou,M. Welling,Z. Ghahramani,and K. Weinberger,editors,Advances in Neural Information Processing Systems 26,pages 2265–2273. Curran Associates,Inc.
Mnih,A. and Teh,Y. W.(2012). A fast and simple algorithm for training neural probabilistic language models. In ICML'2012,pages 1751–1758.
Mnih,V. and Hinton,G.(2010). Learning to detect roads in high-resolution aerial images. In Proceedings of the 11th European Conference on Computer Vision(ECCV).
Mnih,V.,Larochelle,H.,and Hinton,G.(2011). Conditional restricted Boltzmann machines for structure output prediction. In Proc. Conf. on Uncertainty in Artificial Intelligence(UAI).
Mnih,V.,Kavukcuoglo,K.,Silver,D.,Graves,A.,Antonoglou,I.,and Wierstra,D.(2013). Playing Atari with deep reinforcement learning. Technical report,arXiv:1312.5602.
Mnih,V.,Heess,N.,Graves,A.,and Kavukcuoglu,K.(2014). Recurrent models of visual attention. In Z. Ghahramani,M. Welling,C. Cortes,N. Lawrence,and K. Weinberger,editors,NIPS'2014,pages 2204–2212.
Mnih,V.,Kavukcuoglo,K.,Silver,D.,Rusu,A. A.,Veness,J.,Bellemare,M. G.,Graves,A.,Riedmiller,M.,Fidgeland,A. K.,Ostrovski,G.,Petersen,S.,Beattie,C.,Sadik,A.,Antonoglou,I.,King,H.,Kumaran,D.,Wierstra,D.,Legg,S.,and Hassabis,D.(2015). Human-level control through deep reinforcement learning. Nature,518,529–533.
Mobahi,H. and Fisher,III,J. W.(2015). A theoretical analysis of optimization by Gaussian continuation. In AAAI'2015.
Mobahi,H.,Collobert,R.,and Weston,J.(2009). Deep learning from temporal coherence in video. In L. Bottou and M. Littman,editors,Proceedings of the 26th International Conference on Machine Learning,pages 737–744,Montreal. Omnipress.
Mohamed,A.,Dahl,G.,and Hinton,G.(2009). Deep belief networks for phone recognition.
Mohamed,A.,Sainath,T. N.,Dahl,G.,Ramabhadran,B.,Hinton,G. E.,and Picheny,M. A.(2011). Deep belief networks using discriminative features for phone recognition. In Acoustics,Speech and Signal Processing(ICASSP),2011 IEEE International Conference on,pages 5060–5063. IEEE.
Mohamed,A.,Dahl,G.,and Hinton,G.(2012a). Acoustic modeling using deep belief networks. IEEE Trans. on Audio,Speech and Language Processing,20(1),14–22.
Mohamed,A.,Hinton,G.,and Penn,G.(2012b). Understanding how deep belief networks perform acoustic modelling. In Acoustics,Speech and Signal Processing(ICASSP),2012 IEEE International Conference on,pages 4273–4276. IEEE.
Moller,M.(1993). Efficient Training of Feed-Forward Neural Networks. Ph.D. thesis,Aarhus University,Aarhus,Denmark.
Montavon,G. and Muller,K.-R.(2012). Deep Boltzmann machines and the centering trick. In G. Montavon,G. Orr,and K.-R. Müller,editors,Neural Networks: Tricks of the Trade,volume 7700 of Lecture Notes in Computer Science,pages 621–637. Preprint: http://arxiv.org/abs/1203.3783.
Montúfar,G.(2014). Universal approximation depth and errors of narrow belief networks with discrete units. Neural Computation,26.
Montúfar,G. and Ay,N.(2011). Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation,23(5),1306–1319.
Montufar,G. F.,Pascanu,R.,Cho,K.,and Bengio,Y.(2014). On the number of linear regions of deep neural networks. In NIPS'2014.
Mor-Yosef,S.,Samueloff,A.,Modan,B.,Navot,D.,and Schenker,J. G.(1990). Ranking the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,75(6),944–7.
Morin,F. and Bengio,Y.(2005). Hierarchical probabilistic neural network language model. In AISTATS'2005.
Mozer,M. C.(1992). The induction of multiscale temporal structure. In J. M. S. Hanson and R. Lippmann,editors,Advances in Neural Information Processing Systems 4(NIPS'91),pages 275–282,San Mateo,CA. Morgan Kaufmann.
Murphy,K. P.(2012). Machine Learning: a Probabilistic Perspective. MIT Press,Cambridge,MA,USA.
Murray,B. U. I. and Larochelle,H.(2014). A deep and tractable density estimator. In ICML'2014.
Nair,V. and Hinton,G.(2010a). Rectified linear units improve restricted Boltzmann machines. In ICML'2010.
Nair,V. and Hinton,G. E.(2009). 3d object recognition with deep belief nets. In Y. Bengio,D. Schuurmans,J. D. Lafferty,C. K. I. Williams,and A. Culotta,editors,Advances in Neural Information Processing Systems 22,pages 1339–1347. Curran Associates,Inc.
Nair,V. and Hinton,G. E.(2010b). Rectified linear units improve restricted Boltzmann machines. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-seventh International Conference on Machine Learning(ICML-10),pages 807–814. ACM.
Narayanan,H. and Mitter,S.(2010). Sample complexity of testing the manifold hypothesis. In J. Lafferty,C. K. I. Williams,J. Shawe-Taylor,R. Zemel,and A. Culotta,editors,Advances in Neural Information Processing Systems 23,pages 1786–1794.
Naumann,U.(2008). Optimal Jacobian accumulation is NP-complete. Mathematical Programming,112(2),427–441.
Navigli,R. and Velardi,P.(2005). Structural semantic interconnections: a knowledge-based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and Machine Intelligence,27(7),1075–1086.
Neal,R. and Hinton,G.(1999). A view of the EM algorithm that justifies incremental,sparse,and other variants. In M. I. Jordan,editor,Learning in Graphical Models. MIT Press,Cambridge,MA.
Neal,R. M.(1990). Learning stochastic feedforward networks. Technical report.
Neal,R. M.(1993). Probabilistic inference using Markov chain Monte-Carlo methods. Technical Report CRG-TR-93-1,Dept. of Computer Science,University of Toronto.
Neal,R. M.(1994). Sampling from multimodal distributions using tempered transitions. Technical Report 9421,Dept. of Statistics,University of Toronto.
Neal,R. M.(1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer.
Neal,R. M.(2001). Annealed importance sampling. Statistics and Computing,11(2),125–139.
Neal,R. M.(2005). Estimating ratios of normalizing constants using linked importance sampling.
Nesterov,Y.(1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady,27,372–376.
Nesterov,Y.(2004). Introductory lectures on convex optimization: a basic course. Applied optimization. Kluwer Academic Publ.,Boston,Dordrecht,London.
Netzer,Y.,Wang,T.,Coates,A.,Bissacco,A.,Wu,B.,and Ng,A. Y.(2011). Reading digits in natural images with unsupervised feature learning. Deep Learning and Unsupervised Feature Learning Workshop,NIPS.
Ney,H. and Kneser,R.(1993). Improved clustering techniques for class-based statistical language modelling. In European Conference on Speech Communication and Technology(Eurospeech),pages 973–976,Berlin.
Ng,A.(2015). Advice for applying machine learning. https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf.
Niesler,T. R.,Whittaker,E. W. D.,and Woodland,P. C.(1998). Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 177–180.
Ning,F.,Delhomme,D.,LeCun,Y.,Piano,F.,Bottou,L.,and Barbano,P. E.(2005). To-ward automatic phenotyping of developing embryos from videos. Image Processing,IEEE Transactions on,14(9),1360–1371.
Nocedal,J. and Wright,S.(2006). Numerical Optimization. Springer.
Norouzi,M. and Fleet,D. J.(2011). Minimal loss hashing for compact binary codes. In ICML'2011.
Nowlan,S. J.(1990). Competing experts: An experimental investigation of associative mixture models. Technical Report CRG-TR-90-5,University of Toronto.
Nowlan,S. J. and Hinton,G. E.(1992). Adaptive soft weight tying using Gaussian mixtures. In J. M. S. Hanson and R. Lippmann,editors,Advances in Neural Information Processing Systems 4(NIPS'91),pages 993–1000,San Mateo,CA. Morgan Kaufmann.
Olshausen,B. and Field,D. J.(2005). How close are we to understanding V1? Neural Computation,17,1665–1699.
Olshausen,B. A. and Field,D. J.(1996). Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature,381,607–609.
Olshausen,B. A.,Anderson,C. H.,and Van Essen,D. C.(1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci.,13(11),4700–4719.
Opper,M. and Archambeau,C.(2009). The variational Gaussian approximation revisited. Neural computation,21(3),786–792.
Oquab,M.,Bottou,L.,Laptev,I.,and Sivic,J.(2014). Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition(CVPR),2014 IEEE Conference on,pages 1717–1724. IEEE.
Osindero,S. and Hinton,G. E.(2008). Modeling image patches with a directed hierarchy of Markov randomfields. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),pages 1121–1128,Cambridge,MA. MIT Press.
Ovid and Martin,C.(2004). Metamorphoses. W.W. Norton.
Paccanaro,A. and Hinton,G. E.(2000). Extracting distributed representations of concepts and relations from positive and negative propositions. In International Joint Conference on Neural Networks(IJCNN),Como,Italy. IEEE,New York.
Paine,T. L.,Khorrami,P.,Han,W.,and Huang,T. S.(2014). An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597.
Palatucci,M.,Pomerleau,D.,Hinton,G. E.,and Mitchell,T. M.(2009). Zero-shot learning with semantic output codes. In Y. Bengio,D. Schuurmans,J. D. Lafferty,C. K. I. Williams,and A. Culotta,editors,Advances in Neural Information Processing Systems 22,pages 1410–1418. Curran Associates,Inc.
Parker,D. B.(1985). Learning-logic. Technical Report TR-47,Center for Comp. Research in Economics and Management Sci.,MIT.
Pascanu,R.,Mikolov,T.,and Bengio,Y.(2013a). On the difficulty of training recurrent neural networks. In ICML'2013.
Pascanu,R.,Mikolov,T.,and Bengio,Y.(2013b). On the difficulty of training recurrent neural networks. In ICM(1c).
Pascanu,R.,Gulcehre,C.,Cho,K.,and Bengio,Y.(2014a). How to construct deep recurrent neural networks. In ICLR.
Pascanu,R.,Montufar,G.,and Bengio,Y.(2014b). On the number of inference regions of deep feed forward networks with piece-wise linear activations. In ICL(1).
Pati,Y.,Rezaiifar,R.,and Krishnaprasad,P.(1993). Orthogonal matching pursuit:Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals,Systems,and Computers,pages 40–44.
Pearl,J.(1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society,University of California,Irvine,pages 329–334.
Pearl,J.(1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.
Perron,O.(1907). Zur theorie der matrices. Mathematische Annalen,64(2),248–263.
Petersen,K. B. and Pedersen,M. S.(2006). The matrix cookbook. Version 20051003.
Peterson,G. B.(2004). A day of great illumination: B. F. Skinner's discovery of shaping. Journal of the Experimental Analysis of Behavior,82(3),317–328.
Pham,D.-T.,Garat,P.,and Jutten,C.(1992). Separation of a mixture of independent sources through a maximum likelihood approach. In EUSIPCO,pages 771–774.
Pham,P.-H.,Jelaca,D.,Farabet,C.,Martini,B.,LeCun,Y.,and Culurciello,E.(2012). Neu-Flow: dataflow vision processing system-on-a-chip. In Circuits and Systems(MWSCAS),2012 IEEE 55th International Midwest Symposium on,pages 1044–1047. IEEE.
Pinheiro,P. H. O. and Collobert,R.(2014). Recurrent convolutional neural networks for scene labeling. In ICML'2014.
Pinheiro,P. H. O. and Collobert,R.(2015). From image-level to pixel-level labeling with con-volutional networks. In Conference on Computer Vision and Pattern Recognition(CVPR).
Pinto,N.,Cox,D. D.,and DiCarlo,J. J.(2008). Why is real-world visual object recognition hard? PLoS Comput Biol,4.
Pinto,N.,Stone,Z.,Zickler,T.,and Cox,D.(2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops(CVPRW),2011 IEEE Computer Society Conference on,pages 35–42. IEEE.
Pollack,J. B.(1990). Recursive distributed representations. Artificial Intelligence,46(1),77–105.
Polyak,B. and Juditsky,A.(1992). Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization,30(4),838–855.
Polyak,B. T.(1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics,4(5),1–17.
Poole,B.,Sohl-Dickstein,J.,and Ganguli,S.(2014). Analyzing noise in autoencoders and deep networks. CoRR,abs/1406.1831.
Poon,H. and Domingos,P.(2011). Sum-product networks for deep learning. In Learning Workshop,Fort Lauderdale,FL.
Presley,R. K. and Haggard,R. L.(1994). Afixed point implementation of the backpropaga-tion learning algorithm. In Southeastcon '94. Creative Technology Transfer-A Global Affair.,Proceedings of the 1994 IEEE,pages 136–138. IEEE.
Price,R.(1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE Transactions on Information Theory,4(2),69–72.
Quiroga,R. Q.,Reddy,L.,Kreiman,G.,Koch,C.,and Fried,I.(2005). Invariant visual representation by single neurons in the human brain. Nature,435(7045),1102–1107.
Radford,A.,Metz,L.,and Chintala,S.(2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Raiko,T.,Yao,L.,Cho,K.,and Bengio,Y.(2014). Iterative neural autoregressive distribution estimator(NADE-k). Technical report,arXiv:1406.1485.
Raina,R.,Madhavan,A.,and Ng,A. Y.(2009a). Large-scale deep unsupervised learning using graphics processors. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09),pages 873–880,New York,NY,USA. ACM.
Raina,R.,Madhavan,A.,and Ng,A. Y.(2009b). Large-scale deep unsupervised learning using graphics processors. In ICML'2009.
Ramsey,F. P.(1926). Truth and probability. In R. B. Braithwaite,editor,The Foundations of Mathematics and other Logical Essays,chapter 7,pages 156–198. McMaster University Archive for the History of Economic Thought.
Ranzato,M. and Hinton,G. H.(2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In CVPR'2010,pages 2551–2558.
Ranzato,M.,Poultney,C.,Chopra,S.,and LeCun,Y.(2007a). Efficient learning of sparse representations with an energy-based model. In NIPS'2006.
Ranzato,M.,Poultney,C.,Chopra,S.,and LeCun,Y.(2007b). Efficient learning of sparse representations with an energy-based model. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 1137–1144. MIT Press.
Ranzato,M.,Huang,F.,Boureau,Y.,and LeCun,Y.(2007c). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR'07.
Ranzato,M.,Boureau,Y.,and LeCun,Y.(2008). Sparse feature learning for deep belief networks. In NIPS'2007.
Ranzato,M.,Krizhevsky,A.,and Hinton,G. E.(2010a). Factored 3-way restricted Boltzmann machines for modeling natural images. In Proceedings of AISTATS 2010.
Ranzato,M.,Mnih,V.,and Hinton,G.(2010b). Generating more realistic images using gated MRFs. In NIPS'2010.
Rao,C.(1945). Information and the accuracy attainable in the estimation of statistical param-eters. Bulletin of the Calcutta Mathematical Society,37,81–89.
Rasmus,A.,Valpola,H.,Honkala,M.,Berglund,M.,and Raiko,T.(2015). Semi-supervised learning with ladder network. arXiv preprint arXiv:1507.02672.
Recht,B.,Re,C.,Wright,S.,and Niu,F.(2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS'2011.
Reichert,D. P.,Seriès,P.,and Storkey,A. J.(2011). Neuronal adaptation for sampling-based probabilistic inference in perceptual bistability. In Advances in Neural Information Processing Systems,pages 2357–2365.
Rezende,D. J.,Mohamed,S.,and Wierstra,D.(2014). Stochastic backpropagation and approx-imate inference in deep generative models. In ICML'2014. Preprint:arXiv:1401.4082.
Rifai,S.,Vincent,P.,Muller,X.,Glorot,X.,and Bengio,Y.(2011a). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML'2011.
Rifai,S.,Mesnil,G.,Vincent,P.,Muller,X.,Bengio,Y.,Dauphin,Y.,and Glorot,X.(2011b). Higher order contractive auto-encoder. In ECML PKDD.
Rifai,S.,Dauphin,Y.,Vincent,P.,Bengio,Y.,and Muller,X.(2011c). The manifold tangent classifier. In NIPS'2011.
Rifai,S.,Dauphin,Y.,Vincent,P.,Bengio,Y.,and Muller,X.(2011d). The manifold tangent classifier. In NIPS'2011. Student paper award.
Rifai,S.,Bengio,Y.,Dauphin,Y.,and Vincent,P.(2012). A generative process for sampling contractive auto-encoders. In ICML'2012.
Ringach,D. and Shapley,R.(2004). Reverse correlation in neurophysiology. Cognitive Science,28(2),147–166.
Roberts,S. and Everson,R.(2001). Independent component analysis: principles and practice. Cambridge University Press.
Robinson,A. J. and Fallside,F.(1991). A recurrent error propagation network speech recognition system. Computer Speech and Language,5(3),259–274.
Rockafellar,R. T.(1997). Convex analysis. princeton landmarks in mathematics.
Romero,A.,Ballas,N.,Ebrahimi Kahou,S.,Chassang,A.,Gatta,C.,and Bengio,Y.(2015). Fitnets:Hints for thin deep nets. In ICLR'2015,arXiv:1412.6550.
Rosen,J. B.(1960). The gradient projection method for nonlinear programming. part i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,8(1),pp. 181–217.
Rosenblatt,F.(1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,65,386–408.
Rosenblatt,F.(1962). Principles of Neurodynamics. Spartan,New York.
Rosenblatt,M.(1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics,27(3),832–837.
Roweis,S. and Saul,L. K.(2000). Nonlinear dimensionality reduction by locally linear embedding. Science,290(5500).
Roweis,S.,Saul,L.,and Hinton,G.(2002). Global coordination of local linear models. In T. Dietterich,S. Becker,and Z. Ghahramani,editors,Advances in Neural Information Processing Systems 14(NIPS'01),Cambridge,MA. MIT Press.
Rubin,D. B. et al.(1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics,12(4),1151–1172.
Rumelhart,D.,Hinton,G.,and Williams,R.(1986a). Learning representations by back-propagating errors. Nature,323,533–536.
Rumelhart,D. E.,Hinton,G. E.,and Williams,R. J.(1986b). Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 8,pages 318–362. MIT Press,Cambridge.
Rumelhart,D. E.,Hinton,G. E.,and Williams,R. J.(1986c). Learning representations by back-propagating errors. Nature,323,533–536.
Rumelhart,D. E.,McClelland,J. L.,and the PDP Research Group(1986d). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,Cambridge.
Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z.,Karpathy,A.,Khosla,A.,Bernstein,M.,Berg,A. C.,and Fei-Fei,L.(2014a). ImageNet Large Scale Visual Recognition Challenge.
Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z.,Karpathy,A.,Khosla,A.,Bernstein,M.,et al.(2014b). Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575.
Russel,S. J. and Norvig,P.(2003). Artificial Intelligence:a Modern Approach. Prentice Hall.
Rust,N.,Schwartz,O.,Movshon,J. A.,and Simoncelli,E.(2005). Spatiotemporal elements of macaque V1 receptivefields. Neuron,46(6),945–956.
Sainath,T.,Mohamed,A.,Kingsbury,B.,and Ramabhadran,B.(2013). Deep convolutional neural networks for LVCSR. In ICASSP 2013.
Salakhutdinov,R.(2010). Learning in Markov randomfields using tempered transitions. In Y. Bengio,D. Schuurmans,C. Williams,J. Lafferty,and A. Culotta,editors,Advances in Neural Information Processing Systems 22(NIPS'09).
Salakhutdinov,R. and Hinton,G.(2009a). Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics,volume 5,pages 448–455.
Salakhutdinov,R. and Hinton,G.(2009b). Semantic hashing. In International Journal of Approximate Reasoning.
Salakhutdinov,R. and Hinton,G. E.(2007a). Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AISTATS-2007.
Salakhutdinov,R. and Hinton,G. E.(2007b). Semantic hashing. In SIGIR'2007.
Salakhutdinov,R. and Hinton,G. E.(2008). Using deep belief nets to learn covariance kernels for Gaussian processes. In J. Platt,D. Koller,Y. Singer,and S. Roweis,editors,Advances in Neural Information Processing Systems 20(NIPS'07),pages 1249–1256,Cambridge,MA. MIT Press.
Salakhutdinov,R. and Larochelle,H.(2010). Efficient learning of deep Boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(AISTATS 2010),JMLR W&CP,volume 9,pages 693–700.
Salakhutdinov,R. and Mnih,A.(2008). Probabilistic matrix factorization. In NIPS'2008.
Salakhutdinov,R. and Murray,I.(2008). On the quantitative analysis of deep belief networks. In W. W. Cohen,A. McCallum,and S. T. Roweis,editors,Proceedings of the Twenty-fifth International Conference on Machine Learning(ICML'08),volume 25,pages 872–879. ACM.
Salakhutdinov,R.,Mnih,A.,and Hinton,G.(2007). Restricted Boltzmann machines for collab-orativefiltering. In ICML.
Sanger,T. D.(1994). Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Transactions on Robotics and Automation,10(3).
Saul,L. K. and Jordan,M. I.(1996). Exploiting tractable substructures in intractable networks. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95). MIT Press,Cambridge,MA.
Saul,L. K.,Jaakkola,T.,and Jordan,M. I.(1996). Meanfield theory for sigmoid belief networks. Journal of Artificial Intelligence Research,4,61–76.
Savich,A. W.,Moussa,M.,and Areibi,S.(2007). The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. Neural Networks,IEEE Transactions on,18(1),240–252.
Saxe,A. M.,Koh,P. W.,Chen,Z.,Bhand,M.,Suresh,B.,and Ng,A.(2011). On random weights and unsupervised feature learning. In Proc. ICML'2011. ACM.
Saxe,A. M.,McClelland,J. L.,and Ganguli,S.(2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.
Schaul,T.,Antonoglou,I.,and Silver,D.(2014). Unit tests for stochastic optimization. In International Conference on Learning Representations.
Schmidhuber,J.(1992). Learning complex,extended sequences using the principle of history compression. Neural Computation,4(2),234–242.
Schmidhuber,J.(1996). Sequential neural text compression. IEEE Transactions on Neural Networks,7(1),142–146.
Schmidhuber,J.(2012). Self-delimiting neural networks. arXiv preprint arXiv:1210.0118.
Schölkopf,B. and Smola,A. J.(2002). Learning with kernels: Support vector machines,regular-ization,optimization,and beyond. MIT press.
Schölkopf,B.,Burges,C. J. C.,and Smola,A. J.(1998a). Advances in kernel methods: support vector learning. MIT Press,Cambridge,MA.
Schölkopf,B.,Smola,A.,and Müller,K.-R.(1998b). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation,10,1299–1319.
Schölkopf,B.,Burges,C. J. C.,and Smola,A. J.(1999). Advances in Kernel Methods—Support Vector Learning. MIT Press,Cambridge,MA.
Schölkopf,B.,Janzing,D.,Peters,J.,Sgouritsa,E.,Zhang,K.,and Mooij,J.(2012). On causal and anticausal learning. In ICML'2012,pages 1255–1262.
Schuster,M.(1999). On supervised learning from sequential data with applications for speech recognition.
Schuster,M. and Paliwal,K.(1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing,45(11),2673–2681.
Schwenk,H.(2007). Continuous space language models. Computer speech and language,21,492–518.
Schwenk,H.(2010). Continuous space language models for statistical machine translation. The Prague Bulletin of Mathematical Linguistics,93,137–146.
Schwenk,H.(2014). Cleaned subset of WMT '14 dataset.
Schwenk,H. and Bengio,Y.(1998). Training methods for adaptive boosting of neural networks. In M. Jordan,M. Kearns,and S. Solla,editors,Advances in Neural Information Processing Systems 10(NIPS'97),pages 647–653. MIT Press.
Schwenk,H. and Gauvain,J.-L.(2002). Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 765–768,Orlando,Florida.
Schwenk,H.,Costa-jussà,M. R.,and Fonollosa,J. A. R.(2006). Continuous space language models for the IWSLT 2006 task. In International Workshop on Spoken Language Translation,pages 166–173.
Seide,F.,Li,G.,and Yu,D.(2011). Conversational speech transcription using context-dependent deep neural networks. In Interspeech 2011,pages 437–440.
Sejnowski,T.(1987). Higher-order Boltzmann machines. In AIP Conference Proceedings 151 on Neural Networks for Computing,pages 398–403. American Institute of Physics Inc.
Series,P.,Reichert,D. P.,and Storkey,A. J.(2010). Hallucinations in Charles Bonnet syndrome induced by homeostasis: a deep Boltzmann machine model. In Advances in Neural Information Processing Systems,pages 2020–2028.
Sermanet,P.,Chintala,S.,and LeCun,Y.(2012). Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition(ICPR 2012).
Sermanet,P.,Kavukcuoglu,K.,Chintala,S.,and LeCun,Y.(2013). Pedestrian detection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition(CVPR'13). IEEE.
Shilov,G.(1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications.
Siegelmann,H.(1995). Computation beyond the Turing limit. Science,268(5210),545–548.
Siegelmann,H. and Sontag,E.(1991). Turing computability with neural nets. Applied Mathe-matics Letters,4(6),77–80.
Siegelmann,H. T. and Sontag,E. D.(1995). On the computational power of neural nets. Journal of Computer and Systems Sciences,50(1),132–150.
Sietsma,J. and Dow,R.(1991). Creating artificial neural networks that generalize. Neural Networks,4(1),67–79.
Simard,D.,Steinkraus,P. Y.,and Platt,J. C.(2003). Best practices for convolutional neural networks. In ICDAR'2003.
Simard,P. and Graf,H. P.(1994). Backpropagation without multiplication. In Advances in Neural Information Processing Systems,pages 232–239.
Simard,P.,Victorri,B.,LeCun,Y.,and Denker,J.(1992). Tangent prop-A formalism for specifying selected invariances in an adaptive network. In NIPS'1991.
Simard,P. Y.,LeCun,Y.,and Denker,J.(1993). Efficient pattern recognition using a new transformation distance. In NIPS'92.
Simard,P. Y.,LeCun,Y. A.,Denker,J. S.,and Victorri,B.(1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation. Lecture Notes in Computer Science,1524.
Simons,D. J. and Levin,D. T.(1998). Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review,5(4),644–649.
Simonyan,K. and Zisserman,A.(2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Sjöberg,J. and Ljung,L.(1995). Overtraining,regularization and searching for a minimum,with application to neural networks. International Journal of Control,62(6),1391–1407.
Skinner,B. F.(1958). Reinforcement today. American Psychologist,13,94–99.
Smolensky,P.(1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland,editors,Parallel Distributed Processing,volume 1,chapter 6,pages 194–281. MIT Press,Cambridge.
Snoek,J.,Larochelle,H.,and Adams,R. P.(2012). Practical Bayesian optimization of machine learning algorithms. In NIPS'2012.
Socher,R.,Huang,E. H.,Pennington,J.,Ng,A. Y.,and Manning,C. D.(2011a). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS'2011.
Socher,R.,Manning,C.,and Ng,A. Y.(2011b). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference on Machine Learning(ICML'2011).
Socher,R.,Pennington,J.,Huang,E. H.,Ng,A. Y.,and Manning,C. D.(2011c). Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP'2011.
Socher,R.,Perelygin,A.,Wu,J. Y.,Chuang,J.,Manning,C. D.,Ng,A. Y.,and Potts,C.(2013a). Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP'2013.
Socher,R.,Ganjoo,M.,Manning,C. D.,and Ng,A. Y.(2013b). Zero-shot learning through cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems(NIPS 2013).
Sohl-Dickstein,J.,Weiss,E. A.,Maheswaranathan,N.,and Ganguli,S.(2015). Deep unsuper-vised learning using nonequilibrium thermodynamics.
Sohn,K.,Zhou,G.,and Lee,H.(2013). Learning and selecting features jointly with point-wise gated Boltzmann machines. In ICML'2013.
Solomonoff,R. J.(1989). A system for incremental learning based on algorithmic probability.
Sontag,E. D.(1998). VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences,168,69–96.
Sontag,E. D. and Sussman,H. J.(1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems,3,91–106.
Sparkes,B.(1996). The Red and the Black: Studies in Greek Pottery. Routledge.
Spitkovsky,V. I.,Alshawi,H.,and Jurafsky,D.(2010). From baby steps to leapfrog: how“less is more”in unsupervised dependency parsing. In HLT'10.
Squire,W. and Trapp,G.(1998). Using complex variables to estimate derivatives of real functions. SIAM Rev.,40(1),110–112.
Srebro,N. and Shraibman,A.(2005). Rank,trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory,pages 545–560. Springer-Verlag.
Srivastava,N.(2013). Improving Neural Networks With Dropout. Master's thesis,U. Toronto.
Srivastava,N. and Salakhutdinov,R.(2012). Multimodal learning with deep Boltzmann machines. In NIPS'2012.
Srivastava,N.,Salakhutdinov,R. R.,and Hinton,G. E.(2013). Modeling documents with deep Boltzmann machines. arXiv preprint arXiv:1309.6865.
Srivastava,N.,Hinton,G.,Krizhevsky,A.,Sutskever,I.,and Salakhutdinov,R.(2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,15,1929–1958.
Srivastava,R. K.,Greff,K.,and Schmidhuber,J.(2015). Highway networks. arXiv:1505.00387.
Steinkrau,D.,Simard,P. Y.,and Buck,I.(2005). Using GPUs for machine learning algorithms. 2013 12th International Conference on Document Analysis and Recognition,0,1115–1119.
Stoyanov,V.,Ropson,A.,and Eisner,J.(2011). Empirical risk minimization of graphical model parameters given approximate inference,decoding,and model structure. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics(AISTATS),volume 15 of JMLR Workshop and Conference Proceedings,pages 725–733,Fort Lauderdale. Supplemen-tary material(4 pages) also available.
Sukhbaatar,S.,Szlam,A.,Weston,J.,and Fergus,R.(2015). Weakly supervised memory networks. arXiv preprint arXiv:1503.08895.
Supancic,J. and Ramanan,D.(2013). Self-paced learning for long-term tracking. In CVPR'2013.
Sussillo,D.(2014). Random walks:Training very deep nonlinear feed-forward networks with smart initialization. CoRR,abs/1412.6558.
Sutskever,I.(2012). Training Recurrent Neural Networks. Ph.D. thesis,Department of computer science,University of Toronto.
Sutskever,I. and Hinton,G. E.(2008). Deep narrow sigmoid belief networks are universal approximators. Neural Computation,20(11),2629–2636.
Sutskever,I. and Tieleman,T.(2010). On the Convergence Properties of Contrastive Divergence. In AISTATS'2010.
Sutskever,I.,Hinton,G.,and Taylor,G.(2009). The recurrent temporal restricted Boltzmann machine. In NIPS'2008.
Sutskever,I.,Martens,J.,and Hinton,G. E.(2011). Generating text with recurrent neural networks. In ICML'2011,pages 1017–1024.
Sutskever,I.,Martens,J.,Dahl,G.,and Hinton,G.(2013). On the importance of initialization and momentum in deep learning. In ICML.
Sutskever,I.,Vinyals,O.,and Le,Q. V.(2014). Sequence to sequence learning with neural networks. In NIPS'2014,arXiv:1409.3215.
Sutton,R. and Barto,A.(1998). Reinforcement Learning: An Introduction. MIT Press.
Sutton,R. S.,Mcallester,D.,Singh,S.,and Mansour,Y.(2000). Policy gradient methods for reinforcement learning with function approximation. In NIPS'1999,pages 1057–1063. MIT Press.
Swersky,K.,Ranzato,M.,Buchman,D.,Marlin,B.,and de Freitas,N.(2011). On autoencoders and score matching for energy based models. In ICML'2011. ACM.
Swersky,K.,Snoek,J.,and Adams,R. P.(2014). Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896.
Szegedy,C.,Liu,W.,Jia,Y.,Sermanet,P.,Reed,S.,Anguelov,D.,Erhan,D.,Vanhoucke,V.,and Rabinovich,A.(2014a). Going deeper with convolutions. Technical report,arXiv:1409.4842.
Szegedy,C.,Zaremba,W.,Sutskever,I.,Bruna,J.,Erhan,D.,Goodfellow,I. J.,and Fergus,R.(2014b). Intriguing properties of neural networks. ICLR,abs/1312.6199.
Szegedy,C.,Vanhoucke,V.,Ioffe,S.,Shlens,J.,and Wojna,Z.(2015). Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints.
Taigman,Y.,Yang,M.,Ranzato,M.,and Wolf,L.(2014). DeepFace: Closing the gap to human-level performance in face verification. In CVPR'2014.
Tandy,D. W.(1997). Works and Days: A Translation and Commentary for the Social Sciences. University of California Press.
Tang,Y. and Eliasmith,C.(2010). Deep networks for robust visual recognition. In Proceedings of the 27th International Conference on Machine Learning,June 21-24,2010,Haifa,Israel.
Tang,Y.,Salakhutdinov,R.,and Hinton,G.(2012). Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635.
Taylor,G. and Hinton,G.(2009). Factored conditional restricted Boltzmann machines for modeling motion style. In L. Bottou and M. Littman,editors,Proceedings of the Twenty-sixth International Conference on Machine Learning(ICML'09),pages 1025–1032,Montreal,Quebec,Canada. ACM.
Taylor,G.,Hinton,G. E.,and Roweis,S.(2007). Modeling human motion using binary latent variables. In B. Schölkopf,J. Platt,and T. Hoffman,editors,Advances in Neural Information Processing Systems 19(NIPS'06),pages 1345–1352. MIT Press,Cambridge,MA.
Teh,Y.,Welling,M.,Osindero,S.,and Hinton,G. E.(2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research,4,1235–1260.
Tenenbaum,J.,de Silva,V.,and Langford,J. C.(2000). A global geometric framework for nonlinear dimensionality reduction. Science,290(5500),2319–2323.
Theis,L.,van den Oord,A.,and Bethge,M.(2015). A note on the evaluation of generative models. arXiv:1511.01844.
Thompson,J.,Jain,A.,LeCun,Y.,and Bregler,C.(2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS'2014.
Thrun,S.(1995). Learning to play the game of chess. In NIPS'1994.
Tibshirani,R. J.(1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B,58,267–288.
Tieleman,T.(2008). Training restricted Boltzmann machines using approximations to the like-lihood gradient. In ICML'2008,pages 1064–1071.
Tieleman,T. and Hinton,G.(2009). Using fast weights to improve persistent contrastive diver-gence. In ICML'2009.
Tipping,M. E. and Bishop,C. M.(1999). Probabilistic principal components analysis. Journal of the Royal Statistical Society B,61(3),611–622.
Torralba,A.,Fergus,R.,and Weiss,Y.(2008). Small codes and large databases for recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'08),pages 1–8.
Touretzky,D. S. and Minton,G. E.(1985). Symbols among the neurons: Details of a con-nectionist inference architecture. In Proceedings of the 9th International Joint Conference on Artificial Intelligence-Volume 1,IJCAI'85,pages 238–243,San Francisco,CA,USA. Morgan Kaufmann Publishers Inc.
Tu,K. and Honavar,V.(2011). On the utility of curricula in unsupervised learning of probabilistic grammars. In IJCAI'2011.
Turaga,S. C.,Murray,J. F.,Jain,V.,Roth,F.,Helmstaedter,M.,Briggman,K.,Denk,W.,and Seung,H. S.(2010). Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation,22,511–538.
Turian,J.,Ratinov,L.,and Bengio,Y.(2010). Word representations: A simple and general method for semi-supervised learning. In Proc. ACL'2010,pages 384–394.
Töscher,A.,Jahrer,M.,and Bell,R. M.(2009). The BigChaos solution to the Netflix grand prize.
Uria,B.,Murray,I.,and Larochelle,H.(2013). Rnade: The real-valued neural autoregressive density-estimator. In NIPS'2013.
van den Oörd,A.,Dieleman,S.,and Schrauwen,B.(2013). Deep content-based music recom-mendation. In NIPS'2013.
van der Maaten,L. and Hinton,G. E.(2008). Visualizing data using t-SNE. J. Machine Learning Res.,9.
Vanhoucke,V.,Senior,A.,and Mao,M. Z.(2011). Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.
Vapnik,V. N.(1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,Berlin.
Vapnik,V. N.(1995). The Nature of Statistical Learning Theory. Springer,New York.
Vapnik,V. N. and Chervonenkis,A. Y.(1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications,16,264–280.
Vincent,P.(2011). A connection between score matching and denoising autoencoders. Neural Computation,23(7).
Vincent,P. and Bengio,Y.(2003). Manifold Parzen windows. In NIPS'2002. MIT Press.
Vincent,P.,Larochelle,H.,Bengio,Y.,and Manzagol,P.-A.(2008a). Extracting and composing robust features with denoising autoencoders. In ICM(1a),pages 1096–1103.
Vincent,P.,Larochelle,H.,Bengio,Y.,and Manzagol,P.-A.(2008b). Extracting and composing robust features with denoising autoencoders. In ICML 2008.
Vincent,P.,Larochelle,H.,Lajoie,I.,Bengio,Y.,and Manzagol,P.-A.(2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res.,11.
Vincent,P.,de Brébisson,A.,and Bouthillier,X.(2015). Efficient exact gradient update for training deep networks with very large sparse targets. In C. Cortes,N. D. Lawrence,D. D. Lee,M. Sugiyama,and R. Garnett,editors,Advances in Neural Information Processing Systems 28,pages 1108–1116. Curran Associates,Inc.
Vinyals,O.,Kaiser,L.,Koo,T.,Petrov,S.,Sutskever,I.,and Hinton,G.(2014a). Grammar as a foreign language. arXiv preprint arXiv:1412.7449.
Vinyals,O.,Toshev,A.,Bengio,S.,and Erhan,D.(2014b). Show and tell:a neural image caption generator. arXiv 1411.4555.
Vinyals,O.,Fortunato,M.,and Jaitly,N.(2015a). Pointer networks. arXiv preprint arXiv:1506.03134.
Vinyals,O.,Toshev,A.,Bengio,S.,and Erhan,D.(2015b). Show and tell:a neural image caption generator. In CVPR'2015. arXiv:1411.4555.
Viola,P. and Jones,M.(2001). Robust real-time object detection. In International Journal of Computer Vision.
Visin,F.,Kastner,K.,Cho,K.,Matteucci,M.,Courville,A.,and Bengio,Y.(2015). ReNet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393.
Von Melchner,L.,Pallas,S. L.,and Sur,M.(2000). Visual behaviour mediated by retinal projections directed to the auditory pathway. Nature,404(6780),871–876.
Wager,S.,Wang,S.,and Liang,P.(2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26,pages 351–359.
Waibel,A.,Hanazawa,T.,Hinton,G. E.,Shikano,K.,and Lang,K.(1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics,Speech,and Signal Processing,37,328–339.
Wan,L.,Zeiler,M.,Zhang,S.,LeCun,Y.,and Fergus,R.(2013). Regularization of neural networks using dropconnect. In ICML'2013.
Wang,S. and Manning,C.(2013). Fast dropout training. In ICML'2013.
Wang,Z.,Zhang,J.,Feng,J.,and Chen,Z.(2014a). Knowledge graph and text jointly embedding. In Proc. EMNLP'2014.
Wang,Z.,Zhang,J.,Feng,J.,and Chen,Z.(2014b). Knowledge graph embedding by translating on hyperplanes. In Proc. AAAI'2014.
Warde-Farley,D.,Goodfellow,I. J.,Courville,A.,and Bengio,Y.(2014). An empirical analysis of dropout in piecewise linear networks. In ICL(1).
Wawrzynek,J.,Asanovic,K.,Kingsbury,B.,Johnson,D.,Beck,J.,and Morgan,N.(1996). Spert-II: A vector microprocessor system. Computer,29(3),79–86.
Weaver,L. and Tao,N.(2001). The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001,pages 538–545.
Weinberger,K. Q. and Saul,L. K.(2004a). Unsupervised learning of image manifolds by semidefi-nite programming. In Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR'04),volume 2,pages 988–995,Washington D.C.
Weinberger,K. Q. and Saul,L. K.(2004b). Unsupervised learning of image manifolds by semidefinite programming. In CVPR'2004,pages 988–995.
Weiss,Y.,Torralba,A.,and Fergus,R.(2008). Spectral hashing. In NIPS,pages 1753–1760.
Welling,M.,Zemel,R. S.,and Hinton,G. E.(2002). Self supervised boosting. In Advances in Neural Information Processing Systems,pages 665–672.
Welling,M.,Hinton,G. E.,and Osindero,S.(2003a). Learning sparse topographic representa-tions with products of Student-t distributions. In NIPS'2002.
Welling,M.,Zemel,R.,and Hinton,G. E.(2003b). Self-supervised boosting. In S. Becker,S. Thrun,and K. Obermayer,editors,Advances in Neural Information Processing Systems 15(NIPS'02),pages 665–672. MIT Press.
Welling,M.,Rosen-Zvi,M.,and Hinton,G. E.(2005). Exponential family harmoniums with an application to information retrieval. In L. Saul,Y. Weiss,and L. Bottou,editors,Advances in Neural Information Processing Systems 17(NIPS'04),volume 17,Cambridge,MA. MIT Press.
Werbos,P. J.(1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference,31.8-4.9,NYC,pages 762–770.
Weston,J.,Bengio,S.,and Usunier,N.(2010). Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning,81(1),21–35.
Weston,J.,Chopra,S.,and Bordes,A.(2014). Memory networks. arXiv preprint arXiv:1410.3916.
Widrow,B. and Hoff,M. E.(1960). Adaptive switching circuits. In 1960 IRE WESCON Convention Record,volume 4,pages 96–104. IRE,New York.
Wikipedia(2015). List of animals by number of neurons—Wikipedia,the free encyclopedia. [Online;accessed 4-March-2015].
Williams,C. K. I. and Agakov,F. V.(2002). Products of Gaussians and Probabilistic Minor Component Analysis. Neural Computation,14(5),1169–1182.
Williams,C. K. I. and Rasmussen,C. E.(1996). Gaussian processes for regression. In D. Touretzky,M. Mozer,and M. Hasselmo,editors,Advances in Neural Information Processing Systems 8(NIPS'95),pages 514–520. MIT Press,Cambridge,MA.
Williams,R. J.(1992). Simple statistical gradient-following algorithms connectionist reinforcement learning. Machine Learning,8,229–256.
Williams,R. J. and Zipser,D.(1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation,1,270–280.
Wilson,D. R. and Martinez,T. R.(2003). The general inefficiency of batch training for gradient descent learning. Neural Networks,16(10),1429–1451.
Wilson,J. R.(1984). Variance reduction techniques for digital simulation. American Journal of Mathematical and Management Sciences,4(3),277–312.
Wiskott,L. and Sejnowski,T. J.(2002). Slow feature analysis: Unsupervised learning of invari-ances. Neural Computation,14(4),715–770.
Wolpert,D. and MacReady,W.(1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation,1,67–82.
Wolpert,D. H.(1996). The lack of a priori distinction between learning algorithms. Neural Computation,8(7),1341–1390.
Wu,R.,Yan,S.,Shan,Y.,Dang,Q.,and Sun,G.(2015). Deep image: Scaling up image recognition. arXiv:1501.02876.
Wu,Z.(1997). Global continuation for distance geometry problems. SIAM Journal of Optimization,7,814–836.
Xiong,H. Y.,Barash,Y.,and Frey,B. J.(2011). Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics,27(18),2554–2562.
Xu,K.,Ba,J. L.,Kiros,R.,Cho,K.,Courville,A.,Salakhutdinov,R.,Zemel,R. S.,and Bengio,Y.(2015). Show,attend and tell: Neural image caption generation with visual attention. In ICML'2015,arXiv:1502.03044.
Yildiz,I. B.,Jaeger,H.,and Kiebel,S. J.(2012). Re-visiting the echo state property. Neural networks,35,1–9.
Yosinski,J.,Clune,J.,Bengio,Y.,and Lipson,H.(2014). How transferable are features in deep neural networks? In NIPS 27,pages 3320–3328. Curran Associates,Inc.
Younes,L.(1998). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. In Stochastics and Stochastics Models,pages 177–228.
Yu,D.,Wang,S.,and Deng,L.(2010). Sequential labeling using deep-structured conditional randomfields. IEEE Journal of Selected Topics in Signal Processing.
Zaremba,W. and Sutskever,I.(2014). Learning to execute. arXiv 1410.4615.
Zaremba,W. and Sutskever,I.(2015). Reinforcement learning neural Turing machines. arXiv:1505.00521.
Zaslavsky,T.(1975). Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical Society. American Mathematical Society.
Zeiler,M. D. and Fergus,R.(2014). Visualizing and understanding convolutional networks. In ECCV'14.
Zeiler,M. D.,Ranzato,M.,Monga,R.,Mao,M.,Yang,K.,Le,Q.,Nguyen,P.,Senior,A.,Vanhoucke,V.,Dean,J.,and Hinton,G. E.(2013). On rectified linear units for speech processing. In ICASSP 2013.
Zhou,B.,Khosla,A.,Lapedriza,A.,Oliva,A.,and Torralba,A.(2015). Object detectors emerge in deep scene CNNs. ICLR'2015,arXiv:1412.6856.
Zhou,J. and Troyanskaya,O. G.(2014). Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In ICML'2014.
Zhou,Y. and Chellappa,R.(1988). Computation of opticalflow using a neural network. In Neural Networks,1988.,IEEE International Conference on,pages 71–78. IEEE.
Zöhrer,M. and Pernkopf,F.(2014). General stochastic networks for classification. In NIPS'2014.
绝对值整流absolute value rectification
准确率accuracy
声学acoustic
激活函数activation function
AdaGrad AdaGrad
对抗adversarial
对抗样本adversarial example
对抗训练adversarial training
几乎处处almost everywhere
几乎必然almost sure
几乎必然收敛almost sure convergence
选择性剪接数据集alternative splicing dataset
原始采样ancestral sampling
退火重要采样annealed importance sampling
专用集成电路application-specific integrated circuit
近似贝叶斯计算approximate Bayesian computa-tion
近似推断approximate inference
架构architecture
人工智能artificial intelligence
人工神经网络artificial neural network
渐近无偏asymptotically unbiased
异步随机梯度下降Asynchoronous Stochastic Gradient Descent
异步asynchronous
注意力机制attention mechanism
属性attribute
自编码器autoencoder
自动微分automatic differentiation
自动语音识别Automatic Speech Recognition
自回归网络auto-regressive network
反向传播back propagation
回退back-off
反向传播backprop
通过时间反向传播back-propagation through time
词袋bag of words
Bagging bootstrap aggregating
bandit bandit
批量batch
批标准化batch normalization
贝叶斯误差Bayes error
贝叶斯规则Bayes' rule
贝叶斯推断Bayesian inference
贝叶斯网络Bayesian network
贝叶斯概率Bayesian probability
贝叶斯统计Bayesian statistics
基准bechmark
信念网络belief network
Bernoulli分布Bernoulli distribution
基准baseline
BFGS BFGS
偏置bias in affine function
偏差bias in statistics
有偏biased
有偏重要采样biased importance sampling
偏差biass
二元语法bigram
二元关系binary relation
二值稀疏编码binary sparse coding
比特bit
块坐标下降block coordinate descent
块吉布斯采样block Gibbs Sampling
玻尔兹曼分布Boltzmann distribution
玻尔兹曼机Boltzmann Machine
Boosting Boosting
桥式采样bridge sampling
广播broadcasting
磨合Burning-in
变分法calculus of variations
容量capacity
级联cascade
灾难遗忘catastrophic forgetting
范畴分布categorical distribution
因果因子causal factor
因果模型causal modeling
中心差分centered difference
中心极限定理central limit theorem
链式法则chain rule
混沌chaos
弦chord
弦图chordal graph
梯度截断clip gradient
截断梯度clipping the gradient
团clique
团势能clique potential
闭式解closed form solution
级联coalesced
编码code
协同过滤collaborativefiltering
列column
列空间column space
共因common cause
完全图complete graph
复杂细胞complex cell
计算图computational graph
计算机视觉Computer Vision
概念漂移concept drift
条件计算conditional computation
条件概率conditional probability
条件独立的conditionally independent
共轭conjugate
共轭方向conjugate directions
共轭梯度conjugate gradient
联结主义connectionism
一致性consistency
约束优化constrained optimization
特定环境下的独立context-specific independences
contextual bandit contextual bandit
延拓法continuation method
收缩contractive
收缩自编码器contractive autoencoder
对比散度contrastive divergence
凸优化Convex optimization
卷积convolution
卷积玻尔兹曼机Convolutional Boltzmann Machine
卷积网络convolutional net
卷积神经网络convolutional neural network
坐标上升coordinate ascent
坐标下降coordinate descent
共父coparent
相关系数correlation
代价cost
代价函数cost function
协方差covariance
协方差矩阵covariance matrix
协方差RBM covariance RBM
覆盖coverage
准则criterion
临界点critical point
临界温度critical temperatures
互相关函数cross-correlation
交叉熵cross-entropy
累积函数cumulative function
课程学习curriculum learning
维数灾难curse of dimensionality
曲率curvature
控制论cybernetics
衰减damping
数据生成分布data generating distribution
数据生成过程data generating process
数据并行data parallelism
数据点data point
数据集dataset
数据集增强dataset augmentation
决策树decision tree
解码器decoder
分解decompose
深度信念网络deep belief network
深度玻尔兹曼机Deep Boltzmann Machine
深度回路deep circuit
深度前馈网络deep feedforward network
深度生成模型deep generative model
深度学习deep learning
深度模型deep model
深度网络deep network
点积dot product
双反向传播double backprop
双重分块循环矩阵doubly block circulant matrix
降采样downsampling
Dropout Dropout
Dropout Boosting Dropout Boosting
d-分离d-separation
动态规划dynamic programming
动态结构dynamic structure
提前终止early stopping
回声状态网络echo state network
有效容量effective capacity
特征分解eigendecomposition
特征值eigenvalue
特征向量eigenvector
基本单位向量elementary basis vectors
元素对应乘积element-wise product
嵌入embedding
经验分布empirical distribution
经验频率empirical frequency
经验风险empirical risk
经验风险最小化empirical risk minimization
编码器encoder
端到端的end-to-end
能量函数energy function
基于能量的模型Energy-based model
集成ensemble
集成学习ensemble learning
轮epoch
轮数epochs
等式约束equality constraint
均衡分布Equilibrium Distribution
等变equivariance
等变表示equivariant representations
误差条error bar
误差函数error function
误差度量error metric
错误率error rate
估计量estimator
欧几里得范数Euclidean norm
欧拉-拉格朗日方程Euler-Lagrange Equation
证据下界evidence lower bound
样本example
额外误差excess error
期望expectation
期望最大化expectation maximization
E步expectation step
期望值expected value
经验experience,E
专家网络expert network
相消解释explaining away
相消解释作用explaining away effect
解释因子explanatory factort
梯度爆炸exploding gradient
开发exploitation
探索exploration
指数分布exponential distribution
因子factor
因子分析factor analysis
因子图factor graph
因子factorial
分解factorization
分解的factorized
变差因素factors of variation
快速Dropout fast dropout
快速持续性对比散度fast persistent contrastive di-vergence
可行feasible
特征feature
特征提取器feature extractor
特征映射feature map
特征选择feature selection
反馈feedback
前向feedforward
前馈分类器feedforward classifier
前馈网络feedforward network
前馈神经网络feedforward neural network
现场可编程门阵列field programmable gated array
精调fine-tune
精调fine-tuning
有限差分finite difference
第一层first layer
不动点方程fixed point equation
定点运算fixed-point arithmetic
翻转flip
浮点运算float-point arithmetic
遗忘门forget gate
前向传播forward propagation
傅里叶变换Fourier transform
中央凹fovea
自由能free energy
频率派概率frequentist probability
频率派统计frequentist statistics
Frobenius范数Frobenius norm
F分数F-score
全full
泛函functional
泛函导数functional derivative
Gabor函数Gabor function
Gamma分布Gamma distribution
门控gated
门控循环网络gated recurrent net
门控循环单元gated recurrent unit
门控RNN gated RNN
选通器gater
高斯分布Gaussian distribution
高斯核Gaussian kernel
高斯混合模型Gaussian Mixture Model
高斯混合体Gaussian mixtures
高斯输出分布Gaussian output distribution
高斯RBM Gaussian RBM
Gaussian-Bernoulli RBM Gaussian-Bernoulli RBM
通用GPU general purpose GPU
泛化generalization
泛化误差generalization error
广义函数generalized function
广义Lagrange函数generalized Lagrange function
广义Lagrangian generalized Lagrangian
广义伪似然generalized pseudolikelihood
广义伪似然估计generalized pseudolikelihood esti-mator
广义得分匹配generalized score matching
生成式对抗框架generative adversarial framework
生成式对抗网络generative adversarial network
生成模型generative model
生成式建模generative modeling
生成矩匹配网络generative moment matching net-work
生成随机网络generative stochastic network
生成器网络generator network
吉布斯分布Gibbs distribution
Gibbs采样Gibbs Sampling
吉布斯步数Gibbs steps
全局对比度归一化Global contrast normalization
全局极小值global minima
全局最小点global minimum
梯度gradient
梯度上升gradient ascent
梯度截断gradient clipping
梯度下降gradient descent
图模型graphical model
图形处理器Graphics Processing Unit
贪心greedy
贪心算法greedy algorithm
贪心逐层预训练greedy layer-wise pretraining
贪心逐层训练greedy layer-wise training
贪心逐层无监督预训练greedy layer-wise unsuper-vised pretraining
贪心监督预训练greedy supervised pretraining
贪心无监督预训练greedy unsupervised pretraining
网格搜索grid search
Hadamard乘积Hadamard product
汉明距离Hamming distance
硬专家混合体hard mixture of experts
硬双曲正切函数hard tanh
簧风琴harmonium
哈里斯链Harris Chain
Helmholtz机Helmholtz machine
Hessian Hessian
异方差heteroscedastic
隐藏层hidden layer
隐马尔可夫模型Hidden Markov Model
隐藏单元hidden unit
隐藏变量hidden variable
爬山hill climbing
超参数hyperparameter
超参数优化hyperparameter optimization
假设空间hypothesis space
同分布的identically distributed
可辨认的identifiable
单位矩阵identity matrix
独立同分布假设i.i.d. assumption
病态ill conditioning
不道德immorality
重要采样Importance Sampling
相互独立的independent
独立成分分析independent component analysis
独立同分布independent identically distributed
独立子空间分析independent subspace analysis
索引index of matrix
不等式约束inequality constraint
推断inference
无限infinite
信息检索information retrieval
内积inner product
输入input
输入分布input distribution
干预查询intervention query
不变invariant
求逆invert
Isomap Isomap
各向同性isotropic
Jacobian Jacobian
Jacobian矩阵Jacobian matrix
联合概率分布joint probability distribution
Karush-Kuhn-Tucker Karush-Kuhn-Tucker
核函数kernel function
核机器kernel machine
核方法kernel method
核技巧kernel trick
KL散度KL divergence
知识库knowledge base
知识图谱knowledge graph
Krylov方法Krylov method
KL散度Kullback-Leibler(KL) divergence
标签label
标注labeled
拉格朗日乘子Lagrange multiplier
语言模型language model
Laplace分布Laplace distribution
大学习步骤large learning step
潜在latent
潜层latent layer
潜变量latent variable
大数定理Law of large number
逐层的layer-wise
L-BFGS L-BFGS
渗漏整流线性单元Leaky ReLU
渗漏单元leaky unit
学成learned
学习近似推断learned approximate inference
学习器learner
学习率learning rate
勒贝格可积Lebesgue-integrable
左特征向量left eigenvector
左奇异向量left singular vector
莱布尼兹法则Leibniz's rule
似然likelihood
线搜索line search
线性自回归网络linear auto-regressive network
线性分类器linear classifier
线性组合linear combination
线性相关linear dependence
线性因子模型linear factor model
线性模型linear model
线性回归linear regression
线性阈值单元linear threshold units
线性无关linearly independent
链接预测link prediction
链接重要采样linked importance sampling
Lipschitz Lipschitz
Lipschitz常数Lipschitz constant
Lipschitz连续Lipschitz continuous
流体状态机liquid state machine
局部条件概率分布local conditional probability dis-tribution
局部不变性先验local constancy prior
局部对比度归一化local contrast normalization
局部下降local descent
局部核local kernel
局部极大值local maxima
局部极大点local maximum
局部极小值local minima
局部极小点local minimum
对数尺度logarithmic scale
逻辑回归logistic regression
logistic sigmoid logistic sigmoid
分对数logit
对数线性模型log-linear model
长短期记忆long short-term memory
长期依赖long-term dependency
环loop
环状信念传播loopy belief propagation
损失loss
损失函数loss function
机器学习machine learning
机器学习模型machine learning model
机器翻译machine translation
主对角线main diagonal
流形manifold
流形假设manifold hypothesis
流形学习manifold learning
边缘概率分布marginal probability distribution
马尔可夫链Markov Chain
马尔可夫链蒙特卡罗Markov Chain Monte Carlo
马尔可夫网络Markov network
马尔可夫随机场Markov randomfield
掩码mask
矩阵matrix
矩阵逆matrix inversion
矩阵乘积matrix product
最大范数max norm
池pool
最大池化max pooling
极大值maxima
M步maximization step
最大后验Maximum A Posteriori
最大似然maximum likelihood
最大似然估计maximum likelihood estimation
最大平均偏差maximum mean discrepancy
maxout maxout
maxout单元maxout unit
平均绝对误差mean absolute error
均值和协方差RBM mean and covariance RBM
学生t分布均值乘积mean product of Student t-distribution
均方误差mean squared error
均值-协方差RBM mean-covariance restricted Boltzmann machine
均匀场meanfield
均值场mean-field
测度论measure theory
零测度measure zero
记忆网络memory network
信息传输message passing
小批量minibatch
小批量随机minibatch stochastic
极小值minima
极小点minimum
混合Mixing
混合时间Mixing Time
混合密度网络mixture density network
混合分布mixture distribution
专家混合体mixture of experts
模态modality
峰值mode
模型model
模型平均model averaging
模型压缩model compression
模型可辨识性model identifiability
模型并行model parallelism
矩moment
矩匹配moment matching
动量momentum
蒙特卡罗Monte Carlo
Moore-Penrose伪逆Moore-Penrose pseudoinverse
道德化moralization
道德图moralized graph
多层感知机multilayer perceptron
多峰值multimodal
多模态学习multimodal learning
多项式分布multinomial distribution
Multinoulli分布multinoulli distribution
多预测深度玻尔兹曼机multi-prediction deep Boltzmann machine
多任务学习multitask learning
多维正态分布multivariate normal distribution
朴素贝叶斯naive Bayes
奈特nats
自然语言处理Natural Language Processing
最近邻nearest neighbor
最近邻图nearest neighbor graph
最近邻回归nearest neighbor regression
负定negative definite
负部函数negative part function
负相negative phase
半负定negative semidefinite
Nesterov动量Nesterov momentum
网络network
神经自回归密度估计器neural auto-regressive den-sity estimator
神经自回归网络neural auto-regressive network
神经语言模型Neural Language Model
神经机器翻译Neural Machine Translation
神经网络neural network
神经网络图灵机neural Turing machine
牛顿法Newton's method
n-gram n-gram
没有免费午餐定理no free lunch theorem
噪声noise
噪声分布noise distribution
噪声对比估计noise-contrastive estimation
非凸nonconvex
非分布式nondistributed
非分布式表示nondistributed representation
非线性共轭梯度nonlinear conjugate gradients
非线性独立成分估计nonlinear independent com-ponents estimation
非参数non-parametric
范数norm
正态分布normal distribution
正规方程normal equation
归一化的normalized
标准初始化normalized initialization
数值numeric value
数值优化numerical optimization
对象识别object recognition
目标objective
目标函数objective function
奥卡姆剃刀Occam's razor
one-hot one-hot
一次学习one-shot learning
在线online
在线学习online learning
操作operation
最佳容量optimal capacity
原点origin
正交orthogonal
正交矩阵orthogonal matrix
标准正交orthonormal
输出output
输出层output layer
过完备overcomplete
过估计overestimation
过拟合overfitting
过拟合机制overfitting regime
上溢overflow
并行分布式处理Parallel Distributed Processing
并行回火parallel tempering
参数parameter
参数服务器parameter server
参数共享parameter sharing
有参情况parametric case
参数化整流线性单元parametric ReLU
偏导数partial derivative
配分函数Partition Function
性能度量performance measures
性能度量performance metrics
置换不变性permutation invariant
持续性对比散度persistent contrastive divergence
音素phoneme
语音phonetic
分段piecewise
点估计point estimator
策略policy
策略梯度policy gradient
池化pooling
池化函数pooling function
病态条件poor conditioning
正定positive definite
正部函数positive part function
正相positive phase
半正定positive semidefinite
后验概率posterior probability
幂方法power method
PR曲线PR curve
精度precision
精度矩阵precision matrix
预测稀疏分解predictive sparse decomposition
预训练pretraining
初级视觉皮层primary visual cortex
主成分分析principal components analysis
先验概率prior probability
先验概率分布prior probability distribution
概率PCA probabilistic PCA
概率密度函数probability density function
概率分布probability distribution
概率质量函数probability mass function
专家之积product of expert
乘法法则product rule
成比例proportional
提议分布proposal distribution
伪似然pseudolikelihood
象限对quadrature pair
量子力学quantum mechanics
径向基函数radial basis function
随机搜索random search
随机变量random variable
值域range
比率匹配ratio matching
召回率recall
接受域receptivefield
再循环recirculation
推荐系统recommender system
重构reconstruction
重构误差reconstruction error
整流线性rectified linear
整流线性变换rectified linear transformation
整流线性单元rectified linear unit
整流网络rectifier network
循环recurrence
循环卷积网络recurrent convolutional network
循环网络recurrent network
循环神经网络recurrent neural network
回归regression
正则化regularization
正则化regularize
正则化项regularizer
强化学习reinforcement learning
关系relation
关系型数据库relational database
重参数化reparametrization
重参数化技巧reparametrization trick
表示representation
表示学习representation learning
表示容量representational capacity
储层计算reservoir computing
受限玻尔兹曼机Restricted Boltzmann Machine
反向相关reverse correlation
反向模式累加reverse mode accumulation
岭回归ridge regression
右特征向量right eigenvector
右奇异向量right singular vector
风险risk
行row
扫视saccade
鞍点saddle point
无鞍牛顿法saddle-free Newton method
相同same
样本均值sample mean
样本方差sample variance
饱和saturate
标量scalar
得分score
得分匹配score matching
二阶导数second derivative
二阶导数测试second derivative test
第二层second layer
二阶方法second-order method
自对比估计self-contrastive estimation
自信息self-information
语义哈希semantic hashing
半受限波尔兹曼机semi-restricted Boltzmann Ma-chine
半监督semi-supervised
半监督学习semi-supervised learning
可分离的separable
分离的separate
分离separation
情景setting
浅度回路shadow circuit
香农熵Shannon entropy
香农shannons
塑造shaping
短列表shortlist
sigmoid sigmoid
sigmoid信念网络sigmoid Belief Network
简单细胞simple cell
奇异的singular
奇异值singular value
奇异值分解singular value decomposition
奇异向量singular vector
跳跃连接skip connection
慢特征分析slow feature analysis
慢性原则slowness principle
平滑smoothing
平滑先验smoothness prior
softmax softmax
softmax函数softmax function
softmax单元softmax unit
softplus softplus
softplus函数softplus function
生成子空间span
稀疏sparse
稀疏激活sparse activation
稀疏编码sparse coding
稀疏连接sparse connectivity
稀疏初始化sparse initialization
稀疏交互sparse interactions
稀疏权重sparse weights
谱半径spectral radius
语音识别Speech Recognition
sphering sphering
尖峰和平板spike and slab
尖峰和平板RBM spike and slab RBM
虚假模态spurious modes
方阵square
标准差standard deviation
标准差standard error
标准正态分布standard normal distribution
声明statement
平稳的stationary
平稳分布Stationary Distribution
驻点stationary point
统计效率statistic efficiency
统计学习理论statistical learning theory
统计量statistics
最陡下降steepest descent
随机stochastic
随机课程stochastic curriculum
随机梯度上升Stochastic Gradient Ascent
随机梯度下降stochastic gradient descent
随机矩阵Stochastic Matrix
随机最大似然stochastic maximum likelihood
流stream
步幅stride
结构学习structure learning
结构化概率模型structured probabilistic model
结构化变分推断structured variational inference
亚原子subatomic
子采样subsample
求和法则sum rule
和–积网络sum-product network
监督supervised
监督学习supervised learning
监督学习算法supervised learning algorithm
监督模型supervised model
监督预训练supervised pretraining
支持向量support vector
代理损失函数surrogate loss function
符号symbol
符号表示symbolic representation
对称symmetric
切面距离tangent distance
切平面tangent plane
正切传播tangent prop
泰勒taylor
导师驱动过程teacher forcing
温度temperature
回火转移tempered transition
回火tempering
张量tensor
测试误差test error
测试集test set
碰撞情况the collider case
绑定的权重tied weights
Tikhonov正则Tikhonov regularization
平铺卷积tiled convolution
时延神经网络time delay neural network
时间步time step
Toeplitz矩阵Toeplitz matrix
标记token
容差tolerance
地质ICA topographic ICA
训练误差training error
训练集training set
转录transcribe
转录系统transcription system
迁移学习transfer learning
转移transition
转置transpose
三角不等式triangle inequality
三角形化triangulate
三角形化图triangulated graph
三元语法trigram
无偏unbiased
无偏样本方差unbiased sample variance
欠完备undercomplete
欠定的underdetermined
欠估计underestimation
欠拟合underfitting
欠拟合机制underfitting regime
下溢underflow
潜在underlying
潜在成因underlying cause
无向undirected
无向模型undirected model
展开图unfolded graph
展开unfolding
均匀分布uniform distribution
一元语法unigram
单峰值unimodal
单元unit
单位范数unit norm
单位向量unit vector
万能近似定理universal approximation theorem
万能近似器universal approximator
万能函数近似器universal function approximator
未标注unlabeled
未归一化概率函数unnormalized probability func-tion
非共享卷积unshared convolution
无监督unsupervised
无监督学习unsupervised learning
无监督学习算法unsupervised learning algorithm
无监督预训练unsupervised pretraining
有效valid
验证集validation set
梯度消失与爆炸问题vanishing and exploding gra-dient problem
梯度消失vanishing gradient
Vapnik-Chervonenkis维度Vapnik-Chervonenkis dimension
变量消去variable elimination
方差variance
方差减小variance reduction
变分自编码器variational auto-encoder
变分导数variational derivative
变分自由能variational free energy
变分推断variational inference
向量vector
虚拟对抗样本virtual adversarial example
虚拟对抗训练virtual adversarial training
可见层visible layer
V-结构V-structure
醒眠wake sleep
warp warp
支持向量机support vector machine
无向图模型undirected graphical model
权重weight
权重衰减weight decay
权重比例推断规则weight scaling inference rule
权重空间对称性weight space symmetry
条件概率分布conditional probability distribution
白化whitening
宽度width
赢者通吃winner-take-all
正切传播tangent propagation
流形正切分类器manifold tangent classifier
词嵌入word embedding
词义消歧word-sense disambiguation
零数据学习zero-data learning
零次学习zero-shot learning
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论