VTT TECHNICAL RESEARCH CENTRE OF FINLAND LTD Future scientific developments of Machine Learning Datalhtinen liiketoiminta seminaari (17.11.2016) Dr. Tomi RTY Definitions of terminology Artificial intelligence (AI) Intelligence exhibited by machines. In computer science, a flexible rational agent that perceives its environment and takes actions that maximize its chance of success at some goal. Machine learning (ML) Branch of Artificial Intelligence which focuses on systems that learn from their environment (viz. data.) Data science Gartner expects that, within a few years, the term "data science" will gain widespread recognition as an umbrella term for many forms of

sophisticated analytics. Big data Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. 2/3/20 Sources: Hype Cycle for Advanced Analytics and Data Science, 2014. Gartner July 2014 & https://en.wikipedia.org/wiki/Artificial_intelligence 2 Machine learning 101 Machine learning tasks are typically classified into three broad categories Supervised learning: The computer is presented with example inputs and their desired outputs, and the goal is to learn a general rule that maps inputs to outputs. Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input.

Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning). Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal without being explicitly informed whether it has come close to its goal. Source: https://en.wikipedia.org/wiki/Machine_learning 2/3/20 3 Short History of Machine Learning (1/6) 1642 Mechanical adder, known as Pascaline (could perform addition and subtraction,) by Blaise Pascal 1694 Gottfried Wilhelm von Leibniz created a similar machine to Pascaline which could produce all four basic arithmetic operations (addition, subtraction, multiplication and division.) Leibniz also created the binary (0-1) system used by all modern computers.

1801 Joseph Marie Jacquard invented a weaving loom for storing data. Metal cards punched with holes positioned threads. A collection of these cards coded a program which directed the loom. The process could be repeated with a consistent result every time. 2/3/20 4 Short History of Machine Learning (2/6) 1847 George Boole created Boolean logic which used Boolean operators (AND, OR, NOR.) 1890 Herman Hollerith created the first combined system of mechanical calculation and punch cards to rapidly calculate statistics gathered from millions of people 1936

Alan Turing creates the Turing machine which manipulates symbols on a strip of tape according to a table of rules (mathematical model of computation.) 1945 Mark I first combined electric and mechanical computer created by Howard Aiken at IBM. Mark I could store 72 numbers and perform complex multiplication in 6 seconds and division in 16 seconds. 2/3/20 5 Short History of Machine Learning (3/6) 1946 ENIAC (Electronic Numerical Integrator and Computer) was built by John Mauchly and John Eckert. ENIAC was a thousand times faster than Mark I (1945.) ENIAC weighed about 30 000 kilograms. 1952 Arthur Samuel (IBM) used the game of checkers to build a learning

program. Program used supervised learning mode, observed winning strategies and adopted its programming to use those strategies. 1957 Frank Rosenblatt created the perceptron, which is a type of neural network. 2/3/20 6 Short History of Machine Learning (4/6) 1967 First programs able to recognize patterns based on the Nearest-Neighbor algorithm. When the program is given a new object, it compares it with the data from the training set and classifies the new object according to its nearest neighbor. 1981 EBL (Explanation-Based Learning) was presented by Gerald Dejong. Training examples are provided which are then analyzed. Irrelevant information is discarded in the formation of a general rule.

1985 Neural network breakthough multiple people independently discover the backprogation algorithm which allowed more powerful neural networks with hidden layers to be trained. 2/3/20 7 Short History of Machine Learning (5/6) 1990s Widespread application of machine learning in data mining, adaptive software, web applications, text learning and language learning. Both supervised learning and unsupervised learning methods were used. Reinforcement learning was developed. In 1997, IBMs Deep Blue defeats Chess Grandmaster Garry Kasparov 2/3/20 8

Short History of Machine Learning (6/6) 2000s Adaptive programming is capable of recognizing patterns, learning from experience, abstracting new information from data, optimizing efficiency and accuracy of its processing and output. In 2010, Microsoft Kinect can track 20 human features at a rate of 30 times per second, allowing people to interact with a computer through movements and gestures. In 2011, IBMs Watson defeat human competitors (Brad Rutter and Ken Jennings) at Jeopardy In 2011, Google Brain is developed. Its deep neural network can learn to discover and categorize objects. In 2014, Facebook develops DeepFace which is able to recognize and verify individuals on pictures to a similar level as a human being. In 2015, Skype Translator released Near real-time speech translations to and from in English, French, German, Chinese (Mandarin), Italian, Spanish, Portuguese, and Arabic. In 2016, Googles AlphaGo defeats Lee Sedol (9 dan rank) In 2016, Google had test driven their fleet of driverless cars in autonomous mode a total of 2,400,000 km.

2/3/20 9 Machine learning where are we NOW? Guruduth Banavar (IBM Research) 2,5 exabytes of data created per day 90% of the data in the world is created in the last two years Global scientific output doubles every nine years 1100 terabytes of data generated per lifetime More than 90% of the data is unstructured Lots of companies are sitting on data that they havent exploited yet should start with analyzing their datasets Vision for the future Every professional will have a cognitive assistant 2/3/20

10 Machine learning current developments and the future (1/2) Methods for high-dimensional data Tensor methods Non-convex optimization Spectral theory & Spectral theorem Graph theory Spectral graph theory Spectral clustering Evolutionary computation

Deep learning Deep-Belief Networks (DBN) Deep Neural Networks (DNN) Convolutional Neural Networks (CNN) Recurrent Neural Networks (RNN) Long Short-Term Memory (LSTM) Gated Recurrent Units (GRU) 2/3/20 Sources: The Future of Machine Intelligence Perspectives from Leading Practioners by David Beyer. OReilly (2016) 11 Machine learning current developments and the future (2/2)

Causality Fiducial inference 2/3/20 12 Machine learning current developments and the future Methods for high-dimensional data (1/7) Tensor methods Tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors Tensors are higher dimensional extensions of matrices. Matrices can represent pairwise correlations (twodimensional array,) tensors can represent higher order correlations (multidimensional array) Tensors model a much richer class of data. The different modes of the tensor, or the different directions in the tensor, represent different kinds of data. Tensor methods are so effective because they draw on

highly optimized linear algebra libraries and can run on modern systems for large scale computation. 2/3/20 Sources: https://en.wikipedia.org/wiki/Tensor 13 Machine learning current developments and the future Methods for high-dimensional data (2/7) Convex optimization Unique optimum: global/local Non-convex optimization Multiple local optima In higher dimensions, potentially exponential local optima Most optimization problems are not by their nature convex 2/3/20

14 Machine learning current developments and the future Methods for high-dimensional data (3/7) Spectral theory An inclusive term for theories extending the eigenvector and eigenvalue theory of a single square matrix to a much broader theory of the structure of operators in a variety of mathematical spaces. Spectral theory forms the generalizations of linear algebra and the solutions of systems of linear equations. Spectral theorem The spectral theorem provides conditions under which an operator or a matrix can be diagonalized. Diagonal matrices are computationally quite manageable. 2/3/20 15 Machine learning current developments and the

future Methods for high-dimensional data (4/7) Graph theory studies mathematical structures used to model pairwise relations between objects. A graph in this context composes of vertices, nodes, or points which are connected by edges, arcs, or lines. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another Edges may depict probabilities, resulting in Probabilistic graphical models Examples: People on social media, each user is a vertex users connected through an edge. Using GPS/Google Maps locate a route based on shortest route. 2/3/20 16

Machine learning current developments and the future Methods for high-dimensional data (5/7) Spectral graph theory Study of properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated to the graph, such as its adjacency matrix or Laplacian matrix. The representation of a graph in matrix form Laplacian = Degree matrix Adjacency matrix Labeled graph 2/3/20 Degree matrix Adjacency matrix Laplacian matrix (200000 030000

002000 000300 000030 000001) (010010 101010 010100 001011 110100 000100) ( 2 1 0 0 1 0 1 3 1 0 1 0 0 1 2 1 0 0 0 0 1 3 1 1 1 1 0 1 3 0 0 0 0 1 0 1) 17

Machine learning current developments and the future Methods for high-dimensional data (6/7) Spectral clustering Uses the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset. Application of spectral clustering in non-convex optimization Appropriate moment (differentiation) tensors (usually the third or fourth order correlations) are decomposed to estimate the model parameters consistently. 2/3/20 18 Machine learning current developments and the future Methods for high-dimensional data (7/7)

Evolutionary computation Application of trial and error problem and considered a global optimization method with a metaheuristic or stochastic optimization. Utilization of numerous candidate solutions (instead of iterating over one point in the search space). Metaheuristic is a higher-level procedure designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computation capacity The application of recombination and evolutionary strategies makes them less prone to get stuck in local optima than alternative methods. 2/3/20 19 Machine learning current developments and the future Deep learning (1/4)

Deep learning Branch of machine learning based on modelling high-level abstractions in data through a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations. Numerous applications of Deep Learning, including speech recognition, image recognition, etc. 2/3/20 20 Machine learning current developments and the future Deep learning (2/4) Deep Belief Networks (DBN) composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not between units within each layer. Each sub-network's hidden layer serves as the visible layer for the next.

This also leads to a fast, layer-by-layer unsupervised training procedure. With unsupervised training, a DBN can learn to probabilistically reconstruct its inputs. The layers then perform as feature detectors on inputs. After this learning step, a DBN can be trained further in a supervised way to perform classification. 2/3/20 21 Machine learning current developments and the future Deep learning (3/4) Convolutional Neural Networks (CNN) Feed-forward neural network which contains overlapping regions E.g., an image decomposed into overlapping areas Each tile fed into a single neural network (with shared weights) which created it own output (the essential information) (convolutional layer)

Downsample output arrays into individually smaller arrays (subsampling layer) After the required levels of convolutional and subsampling layers, the high-level reasoning is performed on the fully connected layer. The fully connected layer has all the activations of all the previous layers and therefore calculated. Sources: https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721#.7otxt7uww 2/3/20 https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer 22 Machine learning current developments and the future Deep learning (4/4) Recurrent neural networks (RNN) Artificial neural network where connections between units form a directed cycle. Creates an internal state which can perform dynamic temporal behavior.

Unlike feedforward neural networks. RNNs can process arbitrary sequences of inputs in their internal memory. Successful application in e.g., handwriting recognition. Long short-term memory (LSTM) RNN which can exploit very long lags of unknown length between important events Exceptional in handwriting recognition https://www.youtube.com/watch?v=izGl1YSH_JA (8:40->) Gated Recurrent Units (GRU) Similar to LSTM fewer parameters than LSTM lack an output gate Sources: http://deeplearning.net/tutorial/lstm.html, 2/3/20 https://en.wikipedia.org/wiki/Gated_recurrent_unit https://arxiv.org/pdf/1412.3555v1.pdf 23

Machine learning current developments and the future Causality Causality Shifting from statistics to interventions and counterfactuals Five lessons of causal inference Every causal task depends on assumptions (or experiments) Encode assumptions mathematically and test their implications Use mathematical machinery on the assumptions, combine them with the data and derive answers to questions of interest Evaluate scientific plausibility of the assumptions and derive their ramifications Exploit the first four lessons to perform manageable causal inference 2/3/20 24 Machine learning current developments and the future Fiducial inference (1/2) Fiducial inference Background Type of statistical infererence

More popular approaches are frequentist inference, Bayesian inference and decision theory. Current research has re-discovered fiducial inference Basics Attempt to perform inverse probability without using prior probability distributions Quickly received controversy and never widely accepted Currently Studied under the concepts of confidence distributions 2/3/20 https://en.wikipedia.org/wiki/Fiducial_inference 25 Machine learning current developments and the future Fiducial inference (2/2) Comparison to interval estimation in relation to other modes of

statistical inferernce Confidence interval In frequentist inference All confidence intervals which are computed by the same method, with coverage probablity y, will contain the true value which needs to be estimated. In Bayesian inference Credible intervals allow a probability to be given for an event that an interval, once calculated provides the true value, proceeds on the basis that a probability distribution can be associated with the state of knowledge about the true value, before and after the data sample has been acquired. In fiducial inference Provide an inverse probability distribution based on actual data observed. Derive a fiducial distribution 2/3/20 Measure of degree of faith that can be placed on any given value of the

unknown parameter and is faithful to the data (the method uses all available information.) No general definition, examples only for single parameter -> generalizations 26 for multiple parameters. https://en.wikipedia.org/wiki/Fiducial_inference Machine learning current developments and the future AI can be a major driver of economic growth and social progress. Report on Preparing for the Future of Artificial Intelligence, Executive Office of the President, National Science and Technology Council, Committee on Technology (October 2016) Joseph Sirosh, corporate vice president of Data Group and Machine Learning at Microsoft: ACID (Algorithms, Cloud, Internet-of-Things & Data) According to KDNuggets article Machine Learning Trends and the Future of Artificial Intellgence:

Every company can now be a data company Every company can now access algorithmic intelligence Every app can now be an intelligent app Gurjeet Singh, CEO of Ayasdi: This is the golden age of machine learning. Tim OReilly, founder of OReilly: Were just at the beginning of an explosion of intelligent software. 2/3/20 Sources: http://www.geekwire.com/2016/future-machine-learning-5-trends-watch-around-algorithms-cloud-iot-big-data/ http://www.kdnuggets.com/2016/06/machine-learning-trends-future-ai.html 27 Additional material Online courses, training material and books: - Beginner:

Coursera class on Machine Learning by Andrew Ng: https://www.coursera.org/course/ml Pattern Recognition and Machine Learning by Christopher M. Bishop Building Machine Learning Systems with Python (2nd edition) by Willi Richert and Luis Pedro Coelho (programming) Python Machine Learning by Sebastian Raschka (programming) Mastering Machine Learning by Gavin Hackeling (programming) - Intermediate: Pattern Recognition by Sergio Theodoridis and Konstantinos Koutroumbas: The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf - Advanced: 2/3/20 Machine Learning: A probabilistic perspective by Kevin P. Murphy Probabilistic Graphical Models by Dapne Koller and Nir Friedman

28 Contact Dr. Tomi RTY Tel: +358 40 520 7639 Email: [email protected] 2/3/20 29 TECHNOLOGY FOR BUSINESS