INTERVIEWS by Stephen Ibaraki
Chat with Juergen Schmidhuber: Artificial Intelligence & Deep Learning - Now & Future
This week, Stephen Ibaraki has an exclusive interview with Juergen Schmidhuber. (Photo Credits: FAZ/Bieber)
The Deep Learning Neural Networks developed since the early 1990s in Prof. Jürgen Schmidhuber's group at TU Munich and the Swiss AI Lab IDSIA (USI and SUPSI) have revolutionized machine learning and AI, and are now available to billions of users through Google, Apple, Microsoft, IBM, Baidu, and many other companies. His research group also established the field of mathematically rigorous universal AI and optimal universal problem solvers. His formal theory of creativity and curiosity and fun explains art, science, music, and humor. He is the recipient of numerous awards including the 2016 IEEE Neural Networks Pioneer Award, and is president of NNAISENSE, which aims at building the first practical general purpose AI.
|Q:|| ||Before we talk about the 21st century: What was the most influential innovation of the previous one?
A: "In 1999, the journal Nature made a list of the most influential inventions of the 20th century. Number 1 was the invention that made the century stand out among all centuries, by "detonating the population explosion" (V. Smil), from 1.6 billion people in 1900 to soon 10 billion. This invention was the Haber process, which extracts nitrogen from thin air, to make artificial fertilizer. Without it, 1 in 2 persons would not even exist. Soon this ratio will be 2 in 3. Billions and billions would never have lived without it. Nothing else has had so much existential impact. (And nothing in the past 2 billion years had such an effect on the global nitrogen cycle.)"
|Q:|| ||How about the 21st century?
A: "The Grand Theme of the 21st century is even grander: True Artificial Intelligence (AI). AIs will learn to do almost everything that humans can do, and more. There will be an AI explosion, and the human population explosion will pale in comparison."
|Q:|| ||Early on, how was AI obvious to you as this great new development for the world?
A: "This seemed obvious to me as a teenager in the 1970s, when my goal became to build a self-improving AI much smarter than myself, then retire and watch AIs start to colonize and transform the solar system and the galaxy and the rest of the universe in a way infeasible for humans. So I studied maths and computer science. For the cover of my 1987 diploma thesis, I drew a robot that bootstraps itself in seemingly impossible fashion . "
|Q:|| ||Can you share more about this thesis that foreshadows where AI is going?
A: "The thesis was very ambitious and described first concrete research on a self-rewriting "meta-program" which not only learns to improve its performance in some limited domain but also learns to improve the learning algorithm itself, and the way it meta-learns the way it learns etc. This was the first in a decades-spanning series of papers on concrete algorithms for recursive self-improvement, with the goal of laying the foundations for super-intelligences. "
|Q:|| ||And you thought that the basic ultimate AI algorithm will be very elegant and simple?
A: "I predicted that in hindsight the ultimate self-improver will seem so simple that high school students will be able understand and implement it. I said it's the last significant thing a man can create, because all else follows from that. I am still saying the same thing. The only difference is that more people are listening. Why? Because methods we have developed on the way to this goal are now massively used by the world's most valuable public companies. "
|Q:|| ||What kind of computational device should we use to build practical AIs?
A: "Physics dictates that future efficient computational hardware will look a lot like a brain-like recurrent neural network (RNN), a general purpose computer with many processors packed in a compact 3-dimensional volume, connected by many short and few long wires, to minimize communication costs . Your cortex has over 10 billion neurons, each connected to 10,000 other neurons on average. Some are input neurons that feed the rest with data (sound, vision, tactile, pain, hunger). Others are output neurons that move muscles. Most are hidden in between, where thinking takes place. All learn by changing the connection strengths, which determine how strongly neurons influence each other, and which seem to encode all your lifelong experience. Same for our artificial RNNs, which learn better than previous methods to recognize speech or handwriting or video, minimize pain, maximize pleasure, drive simulated cars, etc."
|Q:|| ||How did your early work on neural networks differ from the one of others?
A: "The difference between our neural networks (NNs) and others is that we figured out ways of making NNs deeper and more powerful, especially recurrent NNs (RNNs), the most general and deepest NNs, which have feedback connections and can, in principle, run arbitrary algorithms or programs interacting with the environment. In 1991, I published first "Very Deep Learners" , systems much deeper than the 8-layer nets of the Ukrainian mathematician Ivakhnenko, the father of "Deep Learning" in the 1960s . By the early 1990s, our systems could learn to solve many previously unlearnable problems. But this was just the beginning."
|Q:|| ||From the 1990s’ how was this just the beginning? How does this relate to Moore’s Law and new future developments in compute power?
A: "Back then it was already clear that every 5 years computers are getting roughly 10 times faster per dollar. Unlike Moore's Law (which recently broke), this trend has held since Konrad Zuse built the first working program-controlled computer 1935-1941. Today, 75 years later, hardware is roughly a million billion times faster per unit price. We have greatly profited from this acceleration. Soon we'll have cheap devices with the raw computational power of a human brain, a few decades later, of all 10 billion human brains together, which collectively probably cannot execute more than 10^30 meaningful elementary operations per second. And it won't stop there: Bremermann's physical limit (1982)  for 1 kg of computational substrate is still over 10^20 times bigger than that. Even if the trend holds up, this limit won't be approached before the next century, which is still "soon" though - a century is just 1 percent of 10,000 years of human civilization."
|Q:|| ||You talked earlier about Neural Networks (NNs). Please go further about developments in NNs such as LSTM, which your team pioneered?
A: "Most current commercial NNs need teachers. They rely on a method called backpropagation, whose present elegant and efficient form was first formulated by Seppo Linnainmaa in 1970  (extending earlier work in control theory [5a-c]), and applied to teacher-based supervised learning NNs in 1982 by Paul Werbos  - see survey . However, backpropagation did not work well for deep NNs and RNNs. In 1991, Sepp Hochreiter, my first student ever (now professor) working on my first deep learning project, identified the reason for this failure, namely, the problem of vanishing or exploding gradients . This was then overcome by the now widely used Deep Learning RNN called "Long Short-Term Memory (LSTM)" (first peer-reviewed publication in 1997 ), which was further developed with the help of my outstanding students and postdocs including Felix Gers , Alex Graves, Santi Fernandez, Faustino Gomez, Daan Wierstra, Justin Bayer, Marijn Stollenga, Wonmin Byeon, Rupesh Srivastava, Klaus Greff and others. The LSTM principle has become a basis of much of what's now called deep learning, especially for sequential data. (BTW, today's largest LSTMs have a billion connections or so. That is, in 25 years, 100 years after Zuse, for the same price we should have human brain-sized LSTMs with 100,000 billion connections, extrapolating the trend mentioned above.)"
|Q:|| ||Do you have an LSTM demo or examples that we can relate to?
A: "Do you have a smartphone? Because since mid 2015, Google's speech recognition is based on LSTM  with forget gates for recurrent units  trained by our "Connectionist Temporal Classification (CTC)" (2006) . This approach dramatically improved Google Voice not only by 5% or 10% (which already would have been great) but by almost 50% - now available to billions of smartphone users."
|Q:|| ||How was this research funded?
A: "Funding for the development of LSTM was provided by European tax payers, in particular, through my long series of Swiss SNF grants for LSTM at the Swiss AI Lab IDSIA (USI & SUPSI) since 1995. "
|Q:|| ||Can you provide other success stories based on LSTM and related concepts?
A: "In 2009, LSTM became the first RNN to win international pattern recognition contests, through the efforts of my former PhD student and postdoc Alex Graves. Microsoft’s recent ImageNet 2015 winner  is very similar to our "highway networks" , the first very deep feedforward networks with hundreds of layers, also based on the LSTM principle. The Chinese search giant Baidu is building on our methods such as CTC , and announced this in Forbes Magazine. Apple explained at its recent WWDC 2016 developer conference how it is using LSTM to improve its operating system iOS. Google is applying the rather universal LSTM not only to speech recognition but also to natural language processing, machine translation, image caption generation, automated email answering, and other fields. Eventually they'll end up as one huge LSTM."
|Q:|| ||Your team also has made advancements in deep supervised feedforward NNs. Can you talk more about that and examples of competitions?
A: "A less fundamental but very practical contribution of our lab was to greatly speed up deep supervised feedforward NNs on fast graphics processors originally developed for the video game industry, in particular, convolutional NN architectures (Fukushima 1979 ; Weng 1993 ) trained (LeCun et al 1989 , Ranzato et al 2007 ) by Linnainmaa's old back-propagation technique  mentioned above. In 2009, many still believed that unsupervised pretraining is required to train deep NNs. However, without any such pretraining, my team headed by Dan Ciresan [18, 19] could win a whole string of machine learning competitions, dramatically outperforming previous systems, achieving the first superhuman image recognition in 2011, the first deep learners to win object detection and image segmentation contests in 2012, the best cancer detection in medical images (2012), the victory in the MICCAI Grand Challenge (2013), etc. Today, many famous companies are using this approach for a multitude of applications ."
|Q:|| ||What was your path to success during the so-called “Neural Network Winter” of the 1990s and how did others perceive your pioneering work?
A: "In hindsight it is funny that for a long time, even well-known neural net experts in Canada and the US and other places failed to realize the potential of the very deep and recurrent NNs developed in our little pre-Alpine labs since the early 1990s. For example, in 1995, the first paper on LSTM got rejected by the well-known NIPS conference. But eventually those scientists have come around - today, they (and their companies) are heavily using our methods. "
|Q:|| ||Can you talk about unsupervised learning and what it might have to do with consciousness?
A: "True AI goes far beyond merely imitating teachers through deep NNs. This explains the interest in UNsupervised learning (UL). There are two types of UL: passive and active. Passive UL is simply about detecting regularities in observation streams. This means learning to encode data with fewer computational resources such as space and time and energy, e.g., data compression through predictive coding, which can be achieved to a certain extent by backpropagation, and can facilitate subsequent goal-directed learning, as shown by the very deep learner of 1991 mentioned earlier . It could deal with hundreds of subsequent neural processing stages. A variant thereof even emulates aspects of consciousness, through a recurrent "automatiser RNN" that absorbs or distills formerly "conscious" but later "subconscious" compressive subprograms learned earlier by a "conscious chunker RNN." Generally speaking, consciousness and self-awareness are overrated. I have always viewed them as natural by-products of compressing the observation history of a problem solver, by efficiently encoding frequent observations, including self-observations of the problem solver itself.
|Q:|| ||You have introduced a simple theory of curiosity and creativity. Can you talk about this pioneering work and have you already created simple AIs illustrating it?
A: "Our active UL or "artificial curiosity" goes beyond passive UL: it is about learning to shape the observation stream through action sequences or experiments that help the learning agent figure out how the world works and what can be done in it. Which unsupervised experiments should some AI's reward-maximizing controller C conduct to achieve self-invented goals and to collect data that quickly improves its predictive world model M, which learns to predict what will happen if C does this or that, and which can be used to plan future goal-directed actions? M could be an unsupervised RNN trained on the entire history of actions and observations so far. My simple formal theory of curiosity and creativity (developed since 1991 - see survey ) says: Use the learning progress of M (in particular, compression progress) as an extra intrinsic reward (or fun) for C, to motivate C to come up with additional promising experiments. I have argued that this simple but general principle explains all kinds of curious and creative behaviour in art and music and science and comedy, and we have indeed already built simple artificial "scientists" based on it. There is no reason why machines cannot be curious and creative."
|Q:|| ||You have done pioneering and foundational work in AI and continue to do so. Your work is used by all the major technology companies. You are also the top team in AI and some of your lab's former students are co-founders of AI companies; many of them are doing incredible work. For example, at Deep Mind which was heavily influenced by your lab's former students including DeepMind's first PhDs in AI & Machine Learning, and two of DeepMind's first four members. DeepMind got international fame as a Google investment and most recently with AlphGo. Beating the top Go player, an event predicted to happen in 2026 and not 2016 by many. It has been said, the next major breakthrough instantiated in a startup can create 10 Microsofts, meaning a company with more than 4 Trillion USD in market capitalization. Can you talk about commercializing your work?
A: "Although our work has influenced many companies large and small, most pioneers of the basic learning algorithms and methods for Artificial General Intelligence (AGI) are still based in Switzerland or affiliated with our own company NNAISENSE, pronounced like "nascence," because it’s about the birth of a general purpose Neural Network-based Artificial Intelligence (NNAI). It has 5 co-founders, several employees, revenues through ongoing state-of-the-art applications in industry and finance, but is also talking to strategic and financial investors. We believe we can go far beyond what's possible today, and pull off the big practical breakthrough that will change everything, in line with my old motto since the 1970s: "build an AI smarter than myself such that I can retire." "
|Q:|| ||There is a lot of attention to brain research and its impact on AI. For example, we have the EU brain project, the US brain initiative, and DARPA funded-brain projects. Can your AI research benefit from advances in brain research?
A: "Hardly. The last time neuroscience contributed useful inspiration to AI was many decades ago. Recent successes of deep learning are mostly due to insights of math and engineering quite disconnected from neuroscience. Since the mathematically optimal universal AIs and problem solvers developed in my lab at IDSIA in the early 2000s (e.g. Marcus Hutter's AIXI(tl) model , or my self-referential Gödel Machine ) consist of just a few formulas, I believe that it will be much easier to synthesize practical intelligence from first principles, rather than analyzing the existing role model, namely, the brain. In lectures since the 1990s, I have always used the example of a 19th century engineer who already knows something about electricity. How would he study the intelligence of a modern cell phone? Perhaps he’d poke needles into the chip to measure characteristic curves of transistors (akin to neuroscientists who measure details of calcium channels in neurons), without realizing that the transistor's main raison d'etre is its value as a simple binary switch. Perhaps he’d monitor the time-varying heat distribution of the microprocessor (akin to neuroscientists study large scale phenomena such as brain region activity during thought), without realizing the simple nature of the address-sorting program running on it. Understanding the principles of intelligence does not require neurophysiology and electrical engineering, but math and algorithms, in particular, machine learning and techniques for program search."
|Q:|| ||What do you see as the nearer-term future in AI advancements and where will this lead?
A: "Kids and even certain little animals are still smarter than our best self-learning robots. But I think that within not so many years we'll be able to build an NN-based AI (an NNAI) that incrementally LEARNS to become at least as smart as a little animal, curiously and creatively and continually learning to plan and reason and decompose a wide variety of problems into quickly solvable (or already solved) subproblems, in a very general way.
Once animal-level AI has been achieved, the next step towards human-level AI may be small: it took billions of years to evolve smart animals, but only a few millions of years on top of that to evolve humans. Technological evolution is much faster than biological evolution, because dead ends are weeded out much faster. That is, once we have animal-level AI, a few years or decades later we may have human-level AI, with truly limitless applications, and every business will change, and all of civilization will change, and EVERYTHING will change."
|Q:|| ||Which will be the near-term social implications of AI?
A: "Smart robots and/or their owners will have to pay sufficient taxes to prevent social revolutions. What remains to be done for humans? Freed from hard work, "Homo Ludens" (the playing man) will (as always) invent new ways of professionally interacting with other humans. Already today, most people (probably you too) are working in "luxury jobs" which unlike farming are not really necessary for the survival of our species. Machines are much faster than Usain Bolt, but he still can make hundreds of millions by defeating other humans on the race track. In South Korea, the most wired country, new jobs emerged, such as the professional video game player. Remarkably, countries with many robots per capita (Japan, Germany, Korea, Switzerland) have relatively low unemployment rates. My old statement from the 1980s is still valid: It’s easy to predict which jobs will disappear, but hard to predict which new jobs will be created. "
|Q:|| ||Shouldn't we be afraid of AIs?
A: "Many talk about AIs. Few build them. Prominent entrepreneurs, philosophers, physicists and others with not so much AI expertise have recently warned of the dangers of AI. I have tried to allay their fears, pointing out that there is immense commercial pressure to use artificial neural networks such as our LSTM to build friendly AIs that make their users healthier and happier. Nevertheless, one cannot deny that armies use clever robots, too. Here is my old trivial example from 1994 when Ernst Dickmanns had the first truly self-driving cars in highway traffic: similar machines can also be used by the military as self-driving land mine seekers.
We should be much more afraid, however, of half century-old tech in form of H-bomb rockets. A single H-bomb can have more destructive power than all conventional weapons (or all weapons of WW-II) combined. Many forgot that despite the dramatic nuclear disarmament since the 1980s, there are still enough H-bomb rockets to wipe out civilization within a few hours, without any AI. AI does not introduce a new quality of existential threat."
|Q:|| ||Which will be the long-term implications of AI and what about the “technological singularity” and your concept of "Omega"?
A: "In the 1950s, Stanislaw Ulam introduced the concept of exponentially accelerating technical progress converging within finite time in a technological "singularity." Vernor Vinge popularized it through science fiction novels in the 1980s (that's how I learned about it). Most of its proponents have focused on the apparent technological acceleration in recent decades, e.g., Moore's Law. In 2014, however, I discovered a much older, beautiful, incredibly precise exponential acceleration pattern that reaches all the way back to the Big Bang: The history of the perhaps most important events from a human perspective suggests that human-dominated history might "converge" around the year Omega = 2050 or so. (I like to call the convergence point "Omega," because that’s what Teilhard de Chardin called humanity's next level 100 years ago, and because "Omega" sounds better than "singularity" - a bit like "Oh my God.") The error bars on most dates below seem less than 10% or so. I have no idea why crucial historic events have kept hitting 1/4 points so precisely:
Ω = 2050 or so
Ω - 13.8 B years: Big Bang
Ω - 1/4 of this time: Ω - 3.5 B years: first life on Earth
Ω - 1/4 of this time: Ω - 0.9 B years: first animal-like mobile life
Ω - 1/4 of this time: Ω - 220 M years: first mammals (your ancestors)
Ω - 1/4 of this time: Ω - 55 M years: first primates (your ancestors)
Ω - 1/4 of this time: Ω - 13 M years: first hominids (your ancestors)
Ω - 1/4 of this time: Ω - 3.5 M years: first stone tools ("dawn of technology")
Ω - 1/4 of this time: Ω - 850 K years: controlled fire (next big tech breakthrough)
Ω - 1/4 of this time: Ω - 210 K years: first anatomically modern man (your ancestors)
Ω - 1/4 of this time: Ω - 50 K years: behaviorally modern man colonizing earth
Ω - 1/4 of this time: Ω - 13 K years: neolithic revolution, agriculture, domesticated animals
Ω - 1/4 of this time: Ω - 3.3 K years: onset of 1st population explosion, iron age
Ω - 1/4 of this time: Ω - 800 years: first guns & rockets (in China)
Ω - 1/4 of this time: Ω - 200 years: onset of 2nd population explosion, industrial revolution
Ω - 1/4 of this time: Ω - 50 years: digital nervous system, WWW, cell phones for all
Ω - 1/4 of this time: Ω - 12 years: human-competitive AIs?
Ω - 1/4 of this time: Ω - 3 years: ??
Ω - 1/4 of this time: Ω - 9 months:????
Ω - 1/4 of this time: Ω - 2 months:????????
Ω - 1/4 of this time: Ω - 2 weeks: ????????????????"
|Q:|| ||So what will happen after Omega?
A: "Of course, time won't stop. My kids were born around 2000. Some insurance mathematicians expect them to see the year 2100, because they are girls. For a substantial fraction of their lives, the smartest and most important decision makers might not be human. "
|Q:|| ||And what will those AIs do?
A: "Space is hostile to humans but friendly to appropriately designed robots, and offers many more resources than the thin film of biosphere around the earth, which gets less than a billionth of the sun's light. While some AIs will remain fascinated with life, at least as long as they don't fully understand it, most will be more interested in the incredible new opportunities for robots and software life out there in space. Through innumerable self-replicating robot factories in the asteroid belt and elsewhere they will transform the rest of the solar system and then within a few million years the entire galaxy and within billions of years the rest of the reachable universe, held back only by the light speed limit. (AIs or parts thereof like to travel by radio from senders to receivers, whose first establishment takes time though.)
Many SF novels of the past century have featured a single AI dominating everything. I have argued that it seems much more realistic to expect an incredibly diverse variety of AIs trying to optimize all kinds of partially conflicting (and quickly evolving) utility functions, many of them generated automatically (we already evolved utility functions in the previous millennium), where each AI is continually trying to survive and adapt to rapidly changing niches in AI ecologies driven by intense competition and collaboration beyond current imagination."
|Q:|| ||Should our children and young people worry about future AIs pursuing their own goals, being curious and creative, in a way similar to the way humans and other mammals are creative, but on a much grander scale?
A: "I think they may hope that unlike in Schwarzenegger movies there won't be many goal conflicts between "us" and "them." Humans and others are interested in those they can compete and collaborate with, because they share the same goals. Politicians are mostly interested in other politicians, kids in other kids of the same age, goats in other goats. Supersmart AIs will be mostly interested in other supersmart AIs, not in humans. Just like humans are mostly interested in other humans, not in ants. Note that the weight of all ants is still comparable to the weight of all humans.
Humans won't play a significant role in the spreading of intelligence across the universe. But that's ok. Don’t think of humans as the crown of creation. Instead view human civilization as part of a much grander scheme, an important step (but not the last one) on the path of the universe towards more and more unfathomable complexity. Now it seems ready to make its next step, a step comparable to the invention of life itself over 3 billion years ago. This is more than just another industrial revolution. This is something new that transcends humankind and even biology. It is a privilege to witness its beginnings, and contribute something to it."
|Q:|| ||Is that your last word for today, or do you have an even more all-encompassing philosophy?
A: "I do. What is the simplest explanation of our universe? Since 1997, I have published [24, 25] on the very simple, asymptotically fastest , optimal, most efficient way of computing ALL logically possible universes, including ours, assuming ours is computable indeed (no physical evidence against this), thus generalizing Everett's many-worlds theory of physics . Any “Great Programmer” with any self-respect should use this optimal method to create and master all logically possible universes (or to search for solutions to sufficiently complex problems), thus generating as by-products many histories of deterministic computable universes, many of them inhabited by observers. Although we cannot know some Great Programmer's goals (or whether there is one at all), due to certain properties of the optimal method, at any given time in its computational process, most of the universes computed so far that contain yourself will be due to one of the shortest and fastest programs computing you. This insight allows for making highly non-trivial and encouraging predictions about your future !"
| || ||Interview References:
 Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diploma thesis, TU Munich, 1987. More: http://people.idsia.ch/~juergen/metalearner.html
 Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242. Based on TR FKI-148-91, TUM, 1991. More: http://people.idsia.ch/~juergen/firstdeeplearner.html
 Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. (See also survey in IEEE TSMC (4):364–378, 1971.)
 Bremermann, H. J. Minimum energy requirements of information transfer and computing. International Journal of Theoretical Physics 21.3 (1982): 203-217.
 Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki. (See also BIT Numerical Mathematics, 16(2):146–160, 1976.)
[5a] Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954.
[5b] Bryson, A. E. (1961). A gradient method for optimizing multi-stage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications.
[5c] Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30–45.
 Werbos, P. J. (1982). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC, pp. 762–770. (Extending thoughts in his 1974 thesis.)
 Hochreiter, S. (1991). Diploma thesis, TU Munich. Advisor: J. Schmidhuber.
 Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Based on TR FKI-207-95, TUM (1995). More: http://people.idsia.ch/~juergen/rnn.html
 Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471.
 Graves, A., Fernandez, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. Proc. ICML'06, pp. 369–376.
 Srivastava, R. K., Greff, K., Schmidhuber, J. "Highway networks." http://arxiv.org/abs/1505.00387 (May 2015). Also at NIPS'2015.
 He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (Dec 2015).
 K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193-202, 1980. http://www.scholarpedia.org/article/Neocognitron
 Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128.
 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
 M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
 D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. ICANN 2010.
 Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–3220.
 Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012c). Multi-column deep neural networks for image classification. Proc. CVPR, June 2012. Long preprint arXiv:1202.2745v1 [cs.CV], Feb 2012.
 Hutter, M. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer, 2005. (On J. Schmidhuber's SNF project 20-61847 http://people.idsia.ch/~juergen/unilearn.html )
 Schmidhuber, J. Gödel machines: Fully Self-Referential Optimal Universal Self-Improvers. In B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence, p. 119-226, 2006. http://people.idsia.ch/~juergen/goedelmachine.html
 Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. http://people.idsia.ch/~juergen/deep-learning-overview.html
Short version at scholarpedia: http://www.scholarpedia.org/article/Deep_Learning
 Schmidhuber, J. (2010). Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. http://people.idsia.ch/~juergen/creativity.html
 Schmidhuber, J. (1997). A Computer Scientist's View of Life, the Universe, and Everything. LNCS 201-288, Springer. http://people.idsia.ch/~juergen/computeruniverse.html
 Schmidhuber, J. (2000). Algorithmic theories of everything. http://arxiv.org/abs/quant-ph/0011122. Also at IJFCS 2002 and COLT 2002.
 Levin, L. A. (1973). Universal sequential search problems. Problemy Peredachi Informatsii 9.3:115-116.
 Everett III, Hugh. "The theory of the universal wave function." The many-worlds interpretation of quantum mechanics. 1973.
 In the beginning was the code. Transcript of TEDx talk, 2013. http://www.kurzweilai.net/in-the-beginning-was-the-code