An Introduction To Matrix Decom position Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17 Outline Matrix Decomposition

PCA, SVD, NMF LDA, ICA, Sparse Coding, etc. What Is Matrix Decomposition? We wish to decompose the matrix A by writing it as a product of two or more matrices: Anm = BnkCkm Suppose A, B, C are column matrices

Anm = (a1, a2, , am), each ai is a n-dim data sample Bnk = (b1, b2, , bk), each bj is a n-dim basis, and space B consists of k bases. Ckm = (c1, c2, , cm), each ci is the k-dim coordinates of ai projected to space B Why We Need Matrix Decomposition? Given one data sample:

a1 = Bnkc1 (a11, a12, , a1n)T = (b1, b2, , bk) (c11, c12, , c1k)T Another data sample: a2 = Bnkc2 More data sample: am = Bnkcm Together (m data samples): (a1, a2, , am) = Bnk (c1, c2, , cm)

Anm = BnkCkm Why We Need Matrix Decomposition? (a1, a2, , am) = Bnk (c1, c2, , cm) Anm = BnkCkm We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. In general, B captures the common features in A, while C carries spec

ific characteristics of the original samples. In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors

In LDA: B is discriminant directions In NMF: B is local features PRINCIPLE COMPONENT ANALYSIS Definition Eigenvalue & Eigenvector Given a m x m matrix C, for any and w, if

Then is called eigenvalue, and w is called eigenvector. Definition Principle Component Analysis Principle Component Analysis (PCA) Karhunen-Loeve transformation (KL transformation) Let A be a n m data matrix in which the rows represent data samples Each row is a data vector, each column represents a variable

A is centered: the estimated mean is subtracted from each col umn, so each column has zero mean. Covariance matrix C (m x m): Principle Component Analysis C can be decomposed as follows: C=UUUUUT U is a diagonal matrix diag(1 2,,n), each i is an eigenvalue

UU is an orthogonal matrix, each column is an eigenvector UUT UU=I UU-1=UUT Maximizing Variance The objective of the rotation transformation is to find the max imal variance Projection of data along w is Aw. Variance: 2w= (Aw)T(Aw) = wTATAw = wTCw

where C = ATA is the covariance matrix of the data (A is cent ered!) Task: maximize variance subject to constraint wTw=1. Optimization Problem Maximize is the Lagrange multiplier Differentiating with respect to w yields

Eigenvalue equation: Cw = ww, where C = ATA. Once the first principal component is found, we continue in the same fashi on to look for the next one, which is orthogonal to (all) the principal comp onent(s) already found.

Property: Data Decomposition PCA can be treated as data decomposition a=UUUUTa =(u1,u2,,un) (u1,u2,,un)T a =(u1,u2,,un) (,,)T =(u1,u2,,un) (b1, b2, , bn)T = biui

Face Recognition Eigenface Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, C VPR 1991 (Citation: 2654) The eigenface approach

images are points in a vector space use PCA to reduce dimensionality face space compare projections onto face space to recognize faces PageRank Power Iteration

Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) Row i has nonzero element in positions corresponding to inlin ks Ii. Column-Stochastic & Irreducible Column-Stochastic where

Irreducible Iterative PageRank Calculation For k=1,2, Equivalently (w=1, A is a Markov chain transition matrix)

Why can we use power iteration to find the first eigenvector? Convergence of the power iteration Expand the initial approximation r0 in terms of the eigenvector s SINGULAR VALUE DECOMPOSITION

SVD - Definition Any m x n matrix A, with m n, can be factorized Singular Values And Singular Vectors The diagonal elements j of are the singular values of the ma trix A.

The columns of UU and V are the left singular vectors and right singular vectors respectively. Equivalent form of SVD: Matrix approximation Theorem: Let UUk = (u1 u2 uk), Vk = (v1 v2 vk) and k = diag( 1, 2, , k), and define

Then It means, that the best approximation of rank k for the matrix A is SVD and PCA We can write: Remember that in PCA, we treat A as a row matrix

V is just eigenvectors for A Each column in V is an eigenvector of row matrix A we use V to approximate a row in A Equivalently, we can write: UU is just eigenvectors for AT Each column in UU is an eigenvector of column matrix A We use UU to approximate a column in A

Example - LSI Build a term-by-document matrix A Compute the SVD of A: Approximate A by A = UUVT

UUk: Orthogonal basis, that we use to approximate all the documents Dk: Column j hold the coordinates of document j in the new basis Dk is the projection of A onto the subspace spanned by UUk. SVD and PCA For symmetric A, SVD is closely related to PCA PCA: A = UUUUUT UU and U are eigenvectors and eigenvalues.

SVD: A = UUUVT UU is left(column) eigenvectors V is right(row) eigenvectors U is the same eigenvalues For symmetric A, column eigenvectors equal to row eigenvectors Note the difference of A in PCA and SVD

SVD: A is directly the data, e.g. term-by-document matrix PCA: A is covariance matrix, A=XTX, each row in X is a sample Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: Indexing: collecting terms Use stop list: eliminate meaningless words Stemming

2. Construction term-by-document matrix, sparse matrix storag e. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Latent Semantic Indexing

Assumption: there is some underlying latent semantic structur e in the data. E.g. car and automobile occur in similar documents, as do cow s and sheep. This structure can be enhanced by projecting the data (the ter m-by-document matrix and the queries) onto a lower dimensi onal space using SVD.

Similarity Measures Term to term AAT = UU2UUT = (UU)(UU)T UU are the coordinates of A (rows) projected to space V Document to document ATA = V2VT = (V)(V)T

V are the coordinates of A (columns) projected to space UU Similarity Measures Term to document A = UUVT = (UU)(V)T UU are the coordinates of A (rows) projected to space V V are the coordinates of A (columns) projected to space UU

HITS (Hyperlink Induced Topic Search) Idea: Web includes two flavors of prominent pages: authorities contain high-quality information, hubs are comprehensive lists of links to authorities A page is a good authority, if many hubs point to it. A page is a good hub if it points to many authorities.

Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities Power Iteration

Each page i has both a hub score hi and an authority score ai. HITS successively refines these scores by computing Define the adjacency matrix L of the directed web graph: Now HITS and SVD

L: rows are outlinks, columns are inlinks. a will be the dominant eigenvector of the authority matrix LTL h will be the dominant eigenvector of the hub matrix LLT They are in fact the first left and right singular vectors of L!! We are in fact running SVD on the adjacency matrix. HITS vs PageRank

PageRank may be computed once, HITS is computed per quer y. HITS takes query into account, PageRank doesnt. PageRank has no concept of hubs HITS is sensitive to local topology: insertion or deletion of a s mall number of nodes may change the scores a lot. PageRank more stable, because of its random jump step.

NMF NON-NEGATIVE MATRIX FAC TORIZATION Definition Given a nonnegative matrix Vnm, find non-negative matrix factor s Wnk and Hkm such that VnmWnkHkm V: column matrix, each column is a data sample (n-dimension)

Wi: k-basis represents one base H: coordinates of V projected to W vj Wnkhj Motivation Non-negativity is natural in many applications... Probability is also non-negative Additive model to capture local structure

Multiplicative Update Algorithm Cost function Euclidean distance Multiplicative Update Multiplicative Update Algorithm Cost function Divergence

Reduce to Kullback-Leibler divergence when A and B can be regarded as normalized probability distributions. Multiplicative update PLSA is NMF with KL divergence

NMF vs PCA n = 2429 faces, m = 19x19 pixels Positive values are illustrated with black pixels and negative values wit h red pixels NMF Parts-based representation PCA Holistic representations

Reference D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorizatio n. (pdf) NIPS, 2001. D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999).

Major Reference Saara Hyvnen, Linear Algebra Methods for Data Mining, Spring 2007, Un iversity of Helsinki (Highly recommend)