Time-Space Hardness for Learning Problems Avishay Tal (Stanford) Based on joint works with Sumegha Garg, Gillat Kol & Ran Raz Learning The Streaming Model Black Box (0,1,1,0,1) (0,0,1,0,1) (1,1,0,1,0)

stream of examples [Shamir2014] [Steinhardt-Valiant-Wager'2015] aka Online Learning 1 0 Learner Examples of Learning Problems Parity Learning: for DNF Learning: is a small size DNF formula Decision Tree Learning:

is a small size decision tree Junta Learning: depends only on of the input bits. Parity Learning Problem is unknown to the learner Given a stream of examples where and , the learner needs to learn with high probability. Parity Learning Problem is chosen uniformly at random

is unknown to the learner Given a stream of examples where and , the learner needs to learn with high probability. :Algorithms for Parity Learning 1. Gaussian Elimination memory bits, samples. 2. Trying all possibilities memory bits, samples. Razs Breakthrough Theorem [Raz16]: Any algorithm for parity learning requires either memory bits or an

exponential number of samples. Sparse Parities Could we learn better if we knew that is -sparse (i.e., )? Note: any -sparse parity is also: size DNF formula, size decision-tree, Junta on variables. Lower bounds for learning -sparse parities Lower bounds for learning all of the above

Upper Bounds 1. Trying all possibilities: samples memory bits 2. Record and Eliminate (like Gaussian Elim.) i. Record equations in memory. ii. Check which of all possible -sparse vectors satisfies the recorded equations. samples memory bits Algorithm #3: O(n) memory and samples.

Can we learn -sparse parities in O(n) memory and polynomial number of samples? !No Theorem [Kol-Raz-T17] Any algorithm for -sparse parity learning requires either memory bits or samples. -sparse parity learning requires either memory or samples. Motivation: Cryptography [Raz 16, Valiant-Valiant 16] Applications to Bounded Storage Crypto:

Encryption/Decryption scheme with: Keys length: Encryption/Decryption time: Unconditional security, if the attackers memory size is at most Previous works assumed that the attackers memory size is at most linear in the time needed for encryption/decryption Motivation: Cryptography [Raz 16, Valiant-Valiant 16, Kol-Raz-T 16] Applications to Bounded Storage Crypto: Encryption/Decryption scheme with:

Keys length: Encryption/Decryption time: Unconditional security, if the attackers memory size is at most In the second part of the talk: Keys length: Encryption/Decryption time: Secure against memory size Motivation: Complexity Theory Time-Space Lower Bounds have been studied in many models [Beame-Jayram-Saks 98, Ajtai 99, Beame-Saks-Sun-Vee00, Fortnow 97,

Fortnow-Lipton-van Melkebeek-Viglas05, Williams06,] Main difference: the online model is easier to prove lower bounds against, since the input is read only once. The Branching Program (BP) Model ( , ) ( , ) (width) (length)

Each layer represents a time step. Each vertex represents a memory state of the learner. Each non-leaf vertex has outgoing edges, one for each . The Branching Program (BP) Model ( , ) ( , ) (width) (length)

A sequence of random examples defines a computation path in the BP. The path finally reaches a leaf and outputs , a guess for the value of . The program is successful if . Affine Branching Programs (ABP) ( , ) ( , ) (width) (length)

An ABP is a BP where each vertex remembers a set of linear equations in the variables , such that, if is reached by the computation-path then all equations in are satisfied (by the true unknown ). Accurate Affine BPs Let be the vertex reached by the computational path of the ABP in layer . is a random variable that depends on . = the distribution of conditioned on reaching a specific vertex in layer . Accurate ABP: for every , is close to uniform over the set of (-sparse) solutions to the eqs .

Proof Plan We follow Razs two steps plan: 1. Simulate any BP for sparse parity learning with an accurate ABP. 2. Prove that ABP for sparse parity learning must be either wide or long. Fix some parameter . In the ABP, all vertices will be labeled with at most equations. Once we reach a vertex with equations in the ABP we declare success. Proof Highlights Simulation Part Layer by layer, we convert the BP to an ABP. For , we convert the -th layer of the program.

Every vertex in the -th layer is split into many vertices by regrouping the edges entering . ( , ) ( , ) (

( , ) ) , Regrouping We partition the edges going into to (not too many) groups,

and associate with each group a set of accurate equations. Layer i-1 Layer i

....

Main Lemma Each edge going into remembers a set of equations Main Lemma ( , )

( , ) ( , ) Either:

1. There exists an equation that is shared by many of the edges. 2. is close to uniform (over all -sparse vectors). Regrouping from Main Lemma ( , )

( , ) ( , ) Main Lemma: Either

1. There exists an equation that is shared by many of the edges. 2. is close to uniform (over all -sparse vectors). Applying the main lemma recursively times, we find a large fraction of the edges with common eqs s.t. conditioned on passing through one of these edges, is close to uniform over all (-sparse) solutions to the eqs.

Proof on White Board Lower Bounds on the Affine BP Recall: all subspaces in the Affine BP are defined by at most equations. Success = learned equations. ( , ) ( , )

( ( , ) ) , Fix a node in the Affine BP with linearly independent eqs. [Raz16]: prob. of reaching is at most To succeed whp, the width should be .

Proof on White Board Conclusion First Part Main Theorem: Learning -sparse parities requires either memory bits or number of samples. Implies same bounds for learning size DNF formula size Decision trees Juntas on variables Open: proving tight samples-memory hardness for learning DNFs, Decision Trees, or Juntas

Lower Bounds more Generally Q: Can we generalize the lower bounds to hold for problems not involving parities? [Raz17, Moshkovitz-Moshkovitz17, MoshkovitzMoshkovitz18]: Yes A new and general proof technique (we shall focus on Razs proof technique) As a special case: a new proof for the memory-samples lower bound for parity learning. [Garg-Raz-T18, Beame-Oveis Gharan-Yang18]: Further generalizations of the method & more applications A Learning Problem as a Matrix : finite sets : concept class

: possible samples : a matrix is chosen uniformly at random A learner tries to learn from a stream , where : and Extractor-Based Lower Bounds for Learning Thm [Garg-Raz-T18] Assume that any submatrix of of fraction has bias of at most . Then, any learning algorithm for the learning problem defined by requires either: memory bits,

or samples. 2 Independently, [BeameOveis Gharan-Yang18] got a similar result 2

Applications of Extractor-Based Theorem Learning Parities Learning Sparse Parities and implications Learning from low-degree equations: A learner tries to learn , from random polynomial equations of degree at most , over . memory or samples Learning low-degree polynomials: A learner tries to learn an -variate multilinear polynomial of degree at most over , from random evaluations of over . memory or samples and more Technique to Prove Extractor Property

: {1,1} : the learning matrix Defn: We say that the columns of are -almost orthogonal if for each column , at most of the columns have . Claim: Suppose the columns of are -almost orthogonal, for . Then, learning requires either memory bits or samples Recall: The Branching Program Model ( , ) ( , )

(width) (length) Each layer represents a time step. Each vertex represents a memory state of the learner. Each non-leaf vertex has outgoing edges, one for each . Proof Overview = the distribution of conditioned on reaching a specific vertex .

| |=2 Significant vertices: s.t. = probability that the path reaches . We prove: If is significant, Hence, there are at least significant vertices. = same as the computational path, but stops when atypical things happen (stopping rules) is exp small Proof Overview If is significant, Progress Function: For layer ,

1) 2) is very slowly growing: (as long as number of steps is at most ) 3) If , then Hence: If is significant, Open Problems Optimal tradeoffs for DNFs, Juntas, Decision Trees. What are the limits of the Extractor-Based lower bounds for these problems? Characterize memory-samples complexity from properties of the learning matrix . Generalize to Real-Valued Domains Generalize to k-passes (some progress)

Open Problem: Understanding Neural Nets Expressiveness and Learnability are empirically different in Neural Nets. Consider the following experiment: Generate (input,output) pairs from a depth-2 NN with a fixed structure & randomly chosen weights. Try to learn weights from (input,output) pairs using stochastic gradient descent. This usually fails. Can this be explained by the low-memory of the learner? !Thank You