Next generation data mining tools

Next generation data mining tools

Carnegie Mellon Powerful Tools for Data Mining Fractals, power laws, SVD C. Faloutsos Carnegie Mellon University Carnegie Mellon Part I: Singular Value Decomposition (SVD) = Eigenvalue decomposition www.cs.cmu.edu/~christos/ MDIC-04 Copyright: C. Faloutsos (2004) 2 Carnegie Mellon SVD - outline Motivation Definition - properties

Interpretation; more properties Complexity Applications - success stories Conclusions MDIC-04 Copyright: C. Faloutsos (2004) 3 Carnegie Mellon SVD - Motivation problem #1: text - LSI: find concepts problem #2: compression / dim. reduction MDIC-04 Copyright: C. Faloutsos (2004) 4 Carnegie Mellon SVD - Motivation problem #1: text - LSI: find concepts MDIC-04 Copyright: C. Faloutsos (2004)

5 Carnegie Mellon SVD - Motivation problem #2: compress / reduce dimensionality MDIC-04 Copyright: C. Faloutsos (2004) 6 Carnegie Mellon Problem - specs ~10**6 rows; ~10**3 columns; no updates; Compress / extract features MDIC-04 Copyright: C. Faloutsos (2004) 7 Carnegie Mellon SVD - Motivation MDIC-04

Copyright: C. Faloutsos (2004) 8 Carnegie Mellon SVD - Motivation MDIC-04 Copyright: C. Faloutsos (2004) 9 Carnegie Mellon SVD - outline Motivation Definition - properties Interpretation; more properties Complexity Applications - success stories Conclusions MDIC-04

Copyright: C. Faloutsos (2004) 10 Carnegie Mellon SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 MDIC-04 1 x -1 = 2x1 Copyright: C. Faloutsos (2004) 11 Carnegie Mellon SVD - Definition (reminder: matrix multiplication

1 2 3 4 5 6 3x2 MDIC-04 1 x -1 = 2x1 3x1 Copyright: C. Faloutsos (2004) 12 Carnegie Mellon SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 MDIC-04 -1

1 x -1 = 2x1 3x1 Copyright: C. Faloutsos (2004) 13 Carnegie Mellon SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 MDIC-04 -1 = -1 1 x -1

2x1 3x1 Copyright: C. Faloutsos (2004) 14 Carnegie Mellon SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 MDIC-04 1 x -1 -1 = -1 -1 Copyright: C. Faloutsos (2004) 15 Carnegie Mellon

SVD - notation Conventions: bold capitals -> matrix (eg. A, U, , V) bold lower-case -> column vector (eg., x, v1, u3) regular lower-case -> scalars (eg., 1 , r ) MDIC-04 Copyright: C. Faloutsos (2004) 16 Carnegie Mellon SVD - Definition A[n x m] = U[n x r] r x r] (V[m x r])T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each concept) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) MDIC-04 Copyright: C. Faloutsos (2004) 17 Carnegie Mellon

SVD - Definition A = U VT - example: MDIC-04 Copyright: C. Faloutsos (2004) 18 Carnegie Mellon SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U VT , where U, V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) UT U = I; VT V = I (I: identity matrix) : eigenvalues are positive, and sorted in decreasing order MDIC-04 Copyright: C. Faloutsos (2004) 19 Carnegie Mellon SVD - Example A = U VT - example: retrieval

inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0

0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0

0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 20 Carnegie Mellon SVD - Example A = U VT - example: retrieval CS-concept inf. lung MD-concept

brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0

0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53

0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 21 Carnegie Mellon SVD - Example doc-to-concept similarity matrix retrieval CS-concept inf. lung MD-concept brain

data A = U VT - example: CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0

0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0

0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 22 Carnegie Mellon SVD - Example A = U VT - example: retrieval inf. lung brain data

CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0

0 0 0 2 3 1 0 0 0 0 2 3 1 = strength of CS-concept 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80

0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 23 Carnegie Mellon SVD - Example A = U VT - example: term-to-concept similarity matrix retrieval inf. lung

brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0

0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53

0.80 0.27 CS-concept x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 24 Carnegie Mellon SVD - Example A = U VT - example: term-to-concept similarity matrix retrieval

inf. lung brain data CS MD 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0

0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0

0 0.53 0.80 0.27 CS-concept x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 25 Carnegie Mellon SVD - outline

Motivation Definition - properties Interpretation; more properties Complexity Applications - success stories Conclusions MDIC-04 Copyright: C. Faloutsos (2004) 26 Carnegie Mellon SVD - Interpretation #1 documents, terms and concepts: [Foltz+92] U: document-to-concept similarity matrix V: term-to-concept sim. matrix : its diagonal elements: strength of each concept MDIC-04 Copyright: C. Faloutsos (2004) 27

Carnegie Mellon SVD - Interpretation #1 documents, terms and concepts: Q: if A is the document-to-term matrix, what is AT A? Q: A AT ? MDIC-04 Copyright: C. Faloutsos (2004) 28 Carnegie Mellon SVD - Interpretation #1 documents, terms and concepts: Q: if A is the document-to-term matrix, what is AT A? A: term-to-term ([m x m]) similarity matrix Q: A AT ? A: document-to-document ([n x n]) similarity matrix MDIC-04 Copyright: C. Faloutsos (2004) 29 Carnegie Mellon

SVD - Interpretation #2 best axis to project on: (best = min sum of squares of projection errors) MDIC-04 Copyright: C. Faloutsos (2004) 30 Carnegie Mellon SVD - Motivation MDIC-04 Copyright: C. Faloutsos (2004) 31 Carnegie Mellon SVD - interpretation #2 SVD: gives best axis to project v1 minimum RMS error MDIC-04 Copyright: C. Faloutsos (2004)

32 Carnegie Mellon SVD - Interpretation #2 MDIC-04 Copyright: C. Faloutsos (2004) 33 Carnegie Mellon SVD - Interpretation #2 A = U VT - example: 1 2 1 5 0 0 0 1 2 1 5 0 0

0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36

0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x v1 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004)

34 Carnegie Mellon SVD - Interpretation #2 A = U VT - example: variance (spread) on the v1 axis 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0

0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53

0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 35 Carnegie Mellon SVD - Interpretation #2 A = U VT - example: U gives the coordinates of the points in the projection axis 1 2 1 5

0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0

0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 36 Carnegie Mellon SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? 1 2 1 5 0 0 0 1 2 1 5 0 0 0

MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90

0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 37 Carnegie Mellon

SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? A: set the smallest eigenvalues to zero: 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0

0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27

x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 38 Carnegie Mellon SVD - Interpretation #2 1 2 1 5 0 0 0

1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3

1 ~ 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 0 x 0.58 0.58 0.58 0 0 0 0

0 0.71 0.71 Copyright: C. Faloutsos (2004) 39 Carnegie Mellon SVD - Interpretation #2 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1

5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 ~ 0.18 0.36 0.18 0.90 0 0 0 0

0 0 0 0.53 0.80 0.27 x 9.64 0 0 0 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 40 Carnegie Mellon SVD - Interpretation #2 1 2

1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1

0 0 0 0 2 3 1 ~ 0.18 0.36 0.18 0.90 0 0 0 x 9.64 x 0.58 0.58 0.58 0 Copyright: C. Faloutsos (2004) 0 41 Carnegie Mellon

SVD - Interpretation #2 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0

0 2 3 1 0 0 0 0 2 3 1 ~ 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1

2 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Copyright: C. Faloutsos (2004) 42 Carnegie Mellon SVD - Interpretation #2 Exactly equivalent: spectral decomposition of the matrix:

1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3

1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0

5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 43 Carnegie Mellon SVD - Interpretation #2 Exactly equivalent: spectral decomposition of the matrix: 1 2 1 5 0 0 0 1 2 1 5

0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = u1

u2 x 1 2 x v1 v2 Copyright: C. Faloutsos (2004) 44 Carnegie Mellon SVD - Interpretation #2 Exactly equivalent: spectral decomposition of the matrix: m n 1 2 1 5 0 0 0

1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3

1 = 1 u1 vT1 + Copyright: C. Faloutsos (2004) 2 u2 vT2 +... 45 Carnegie Mellon SVD - Interpretation #2 Exactly equivalent: spectral decomposition of the matrix: m n 1 2 1 5 0 0 0 1 2 1

5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 r terms

= 1 u1 vT1 + nx1 2 u2 vT2 +... 1xm Copyright: C. Faloutsos (2004) 46 Carnegie Mellon SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) m n 1 2 1 5 0 0 0 1 2

1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1

= 1 u1 vT1 + 2 u2 vT2 +... assume: 1 >= 2 >= ... Copyright: C. Faloutsos (2004) 47 Carnegie Mellon SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of energy (= sum of squares of i s) m n 1 2 1 5 0 0 0 1 2 1 5

0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 =

1 u1 vT1 + 2 u2 vT2 +... assume: 1 >= 2 >= ... Copyright: C. Faloutsos (2004) 48 Carnegie Mellon SVD - outline Motivation Definition - properties Interpretation #1: documents/terms/concepts #2: dim. reduction #3: picking non-zero, rectangular blobs #4: fixed points - graph interpretation Complexity ... MDIC-04 Copyright: C. Faloutsos (2004) 49

Carnegie Mellon SVD - Interpretation #3 finds non-zero blobs in a data matrix 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0

0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27

x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 50 Carnegie Mellon SVD - Interpretation #3 finds non-zero blobs in a data matrix 1 2 1 5 0 0 0

1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3

1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0

0 0.71 0.71 Copyright: C. Faloutsos (2004) 51 Carnegie Mellon SVD - Interpretation #3 finds non-zero blobs in a data matrix 1 2 1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1

5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0

0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 52 Carnegie Mellon SVD - Interpretation #3 finds non-zero blobs in a data matrix 1 2

1 5 0 0 0 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 0 0 0 0 2 3 1

0 0 0 0 2 3 1 = 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 Copyright: C. Faloutsos (2004) 53 Carnegie Mellon SVD - Interpretation #4 A as vector transformation x 2 1 = MDIC-04 A 2 1 x 1

3 1 0 Copyright: C. Faloutsos (2004) x x 54 Carnegie Mellon SVD - Interpretation #4 if A is symmetric, then, by defn., its eigenvectors remain parallel to themselves (fixed points) v1 1 0.52 3.62 * 0.85 = MDIC-04 v1 A

2 1 1 3 0.52 0.85 Copyright: C. Faloutsos (2004) 55 Carnegie Mellon SVD - Interpretation #4 fixed point operation (AT A is symmetric) AT A vi = i2 vi (i=1,.., r) [Property 1] eg., A: transition matrix of a Markov Chain; then v1 is related to the steady state probability distribution (= a particle, doing a random walk on the graph) convergence: for (almost) any v: (AT A)k v ~ (constant) v1 [Property 2] where v1: strongest right eigenvector MDIC-04 Copyright: C. Faloutsos (2004)

56 Carnegie Mellon SVD - Interpretation #4 convergence - illustration: MDIC-04 Copyright: C. Faloutsos (2004) 57 Carnegie Mellon SVD - Interpretation #4 convergence - illustration: MDIC-04 Copyright: C. Faloutsos (2004) 58 Carnegie Mellon SVD - Interpretation #4 v1 ... vr : right eigenvectors symmetrically: u1 ... ur: left eigenvectors for matrix A (= U VT ): A v i = i u i u i T A = i v i T

(right eigenvectors -> authority scores in Kleinbergs algo; left ones -> hub scores) MDIC-04 Copyright: C. Faloutsos (2004) 59 Carnegie Mellon SVD - outline Motivation Definition - properties Interpretation Complexity Applications - success stories Conclusions MDIC-04 Copyright: C. Faloutsos (2004) 60

Carnegie Mellon SVD - Complexity O( n * m * m) or O( n * n * m) (whichever is less) less work, if we just want eigenvalues ... or if we want first k eigenvectors ... or if the matrix is sparse [Berry] Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica ...) MDIC-04 Copyright: C. Faloutsos (2004) 61 Carnegie Mellon SVD - conclusions so far SVD: A= U VT : unique (*) U: document-to-concept similarities V: term-to-concept similarities : strength of each concept MDIC-04 Copyright: C. Faloutsos (2004) 62

Carnegie Mellon SVD - conclusions so far dim. reduction: keep the first few strongest eigenvalues (80-90% of energy) SVD: picks up linear correlations SVD: picks up non-zero blobs v1: fixed point (-> steady-state prob.) MDIC-04 Copyright: C. Faloutsos (2004) 63 Carnegie Mellon SVD - outline Motivation Definition - properties Interpretation Complexity Applications - success stories Conclusions

MDIC-04 Copyright: C. Faloutsos (2004) 64 Carnegie Mellon SVD - Case studies ... Applications - success stories MDIC-04 multi-lingual IR; LSI queries compression PCA - ratio rules Karhunen-Lowe transform Kleinberg/google algorithm Copyright: C. Faloutsos (2004) 65 Carnegie Mellon

Case study - LSI Q1: How to do queries with LSI? Q2: multi-lingual IR (english query, on spanish text?) MDIC-04 Copyright: C. Faloutsos (2004) 66 Carnegie Mellon Case study - LSI Q1: How to do queries with LSI? Problem: Eg., find documents with data retrieval inf. brain lung data CS MD 1 2 1 5 0 0 0 MDIC-04

1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1

= 0.18 0.36 0.18 0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2004) x 0.58 0.58 0.58 0 0 0

0 0 0.71 0.71 67 Carnegie Mellon Case study - LSI Q1: How to do queries with LSI? A: map query vectors into concept space how? retrieval inf. brain lung data CS MD 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0

0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0.18 0.36 0.18

0.90 0 0 0 0 0 0 0 0.53 0.80 0.27 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2004) x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 68 Carnegie Mellon

Case study - LSI Q1: How to do queries with LSI? A: map query vectors into concept space how? retrieval inf. brain lung data q T= 1 0 0 0 0 term2 q v2 v1 term1 MDIC-04 Copyright: C. Faloutsos (2004) 69

Carnegie Mellon Case study - LSI Q1: How to do queries with LSI? A: map query vectors into concept space how? retrieval inf. brain lung data q T= 1 0 0 0 0 term2 v2 A: inner product (cosine similarity) with each concept vector vi MDIC-04 q Copyright: C. Faloutsos (2004) v1

term1 70 Carnegie Mellon Case study - LSI Q1: How to do queries with LSI? A: map query vectors into concept space how? retrieval inf. brain lung data q T= 1 0 0 0 0 term2 v2 A: inner product (cosine similarity) with each concept vector vi MDIC-04 q

Copyright: C. Faloutsos (2004) v1 q o v1 term1 71 Carnegie Mellon Case study - LSI compactly, we have: qTconcept = qT V Eg: retrieval inf. brain lung data qT= MDIC-04 1 0 0 0 0 0.58 0.58

0.58 0 0 0 0 0 0.71 0.71 term-to-concept similarities Copyright: C. Faloutsos (2004) CS-concept = 0.58 0 72 Carnegie Mellon Case study - LSI Drill: how would the document (information, retrieval) handled by LSI? MDIC-04 Copyright: C. Faloutsos (2004)

73 Carnegie Mellon Case study - LSI Drill: how would the document (information, retrieval) handled by LSI? A: SAME: dTconcept = dT V CS-concept retrieval 0.58 0 Eg: inf. brain lung data d= T MDIC-04 0 1 1 0 0 0.58 0.58 0 0

0 0 0.71 0.71 term-to-concept similarities Copyright: C. Faloutsos (2004) = 1.16 0 74 Carnegie Mellon Case study - LSI Observation: document (information, retrieval) will be retrieved by query (data), although it does not contain data!! CS-concept retrieval inf. brain lung data d= T q=

T MDIC-04 0 1 1 0 1 0 0 0 0 1.16 0 0.58 0 0 Copyright: C. Faloutsos (2004) 75 Carnegie Mellon Case study - LSI Q1: How to do queries with LSI? Q2: multi-lingual IR (english query, on

spanish text?) MDIC-04 Copyright: C. Faloutsos (2004) 76 Carnegie Mellon Case study - LSI Problem: given many documents, translated to both languages (eg., English and Spanish) answer queries across languages MDIC-04 Copyright: C. Faloutsos (2004) 77 Carnegie Mellon Case study - LSI Solution: ~ LSI informacion datos retrieval inf. brain lung data

CS MD 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0

0 0 2 3 1 0 0 0 0 2 3 1 1 1 1 5 0 0 0 1 2 1 5 0 0 0 1 2

1 4 0 0 0 0 0 0 0 2 2 1 0 0 0 0 2 3 1 Copyright: C. Faloutsos (2004) 78 Carnegie Mellon Case study - LSI Solution: ~ LSI informacion datos

retrieval inf. brain lung data CS MD 1 2 1 5 0 0 0 MDIC-04 1 2 1 5 0 0 0 1 2 1 5 0 0

0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 1 1 1 5 0 0 0 1 2 1 5 0 0

0 1 2 1 4 0 0 0 0 0 0 0 2 2 1 0 0 0 0 2 3 1 Copyright: C. Faloutsos (2004) 79 Carnegie Mellon

SVD - outline - Case studies ... Applications - success stories MDIC-04 multi-lingual IR; LSI queries compression PCA - ratio rules Karhunen-Lowe transform Kleinberg/google algorithm Copyright: C. Faloutsos (2004) 80 Carnegie Mellon Case study: compression [Korn+97] Problem: given a matrix compress it, but maintain random access MDIC-04 Copyright: C. Faloutsos (2004)

81 Carnegie Mellon Problem - specs ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK MDIC-04 Copyright: C. Faloutsos (2004) 82 Carnegie Mellon Idea MDIC-04 Copyright: C. Faloutsos (2004) 83 Carnegie Mellon SVD - reminder space savings: 2:1 minimum RMS error MDIC-04

Copyright: C. Faloutsos (2004) 84 Carnegie Mellon Performance - scaleup error space MDIC-04 Copyright: C. Faloutsos (2004) 85 Carnegie Mellon SVD - outline - Case studies ... Applications - success stories MDIC-04 multi-lingual IR; LSI queries compression

PCA - ratio rules Karhunen-Lowe transform Kleinberg/google algorithm Copyright: C. Faloutsos (2004) 86 Carnegie Mellon PCA - Ratio Rules [Korn+00] Typically: Association Rules (eg., {bread, milk} -> {butter} But: which set of rules is better? how to reconstruct missing/corrupted values? MDIC-04 Copyright: C. Faloutsos (2004) 87 Carnegie Mellon PCA - Ratio Rules Idea: try to find concepts: Eigenvectors dictate rules about ratios: bread:milk:butter = 2:4:3 $ on butter $ on milk

4 3 2 MDIC-04 $ on bread Copyright: C. Faloutsos (2004) 88 Carnegie Mellon PCA - Ratio Rules [Korn+00] Typically: Association Rules (eg., {bread, milk} -> {butter} But: how to reconstruct missing/corrupted values? which set of rules is better? MDIC-04 Copyright: C. Faloutsos (2004) 89 Carnegie Mellon PCA - Ratio Rules Moreover: visual data mining, for free:

MDIC-04 Copyright: C. Faloutsos (2004) 90 Carnegie Mellon PCA - Ratio Rules no Gaussian clusters; Zipf-like distribution phonecalls MDIC-04 Copyright: C. Faloutsos (2004) stocks 91 Carnegie Mellon NBA dataset ~500 players; ~30 attributes MDIC-04 PCA - Ratio Rules Copyright: C. Faloutsos (2004) 92

Carnegie Mellon PCA - Ratio Rules PCA: get eigenvectors v1, v2, ... ignore entries with small abs. value try to interpret the rest MDIC-04 Copyright: C. Faloutsos (2004) 93 Carnegie Mellon PCA - Ratio Rules NBA dataset - V matrix (term to concept similarities) MDIC-04 v1 Copyright: C. Faloutsos (2004) 94 Carnegie Mellon Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept?

v1 MDIC-04 Copyright: C. Faloutsos (2004) 95 Carnegie Mellon Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept? A: goodness of player MDIC-04 Copyright: C. Faloutsos (2004) 96 Carnegie Mellon Ratio Rules - example RR2: points: rebounds negatively correlated(!) MDIC-04 Copyright: C. Faloutsos (2004)

97 Carnegie Mellon Ratio Rules - example RR2: points: rebounds negatively correlated(!) - concept? v2 MDIC-04 Copyright: C. Faloutsos (2004) 98 Carnegie Mellon Ratio Rules - example RR2: points: rebounds negatively correlated(!) - concept? A: position: offensive/defensive MDIC-04 Copyright: C. Faloutsos (2004) 99 Carnegie Mellon SVD - Case studies ...

Applications - success stories MDIC-04 multi-lingual IR; LSI queries compression PCA - ratio rules Karhunen-Lowe transform Kleinberg/google algorithm Copyright: C. Faloutsos (2004) 100 Carnegie Mellon K-L transform [Duda & Hart]; [Fukunaga] A subtle point: SVD will give vectors that go through the origin v1 MDIC-04 Copyright: C. Faloutsos (2004)

101 Carnegie Mellon K-L transform v1 A subtle point: SVD will give vectors that go through the origin Q: how to find v1 ? v1 MDIC-04 Copyright: C. Faloutsos (2004) 102 Carnegie Mellon K-L transform v1 v1 MDIC-04 A subtle point:

SVD will give vectors that go through the origin Q: how to find v1 ? A: centered PCA, ie., move the origin to center of gravity Copyright: C. Faloutsos (2004) 103 Carnegie Mellon K-L transform v1 A subtle point: SVD will give vectors that go through the origin Q: how to find v1 ? A: centered PCA, ie., move the origin to center of gravity and THEN do SVD MDIC-04 Copyright: C. Faloutsos (2004) 104

Carnegie Mellon SVD - outline Motivation Definition - properties Interpretation Complexity Applications - success stories Conclusions MDIC-04 Copyright: C. Faloutsos (2004) 105 Carnegie Mellon SVD - Case studies ... Applications - success stories

MDIC-04 multi-lingual IR; LSI queries compression PCA - ratio rules Karhunen-Lowe transform Kleinberg/google algorithm Copyright: C. Faloutsos (2004) 106 Carnegie Mellon Kleinbergs algorithm Problem dfn: given the web and a query find the most authoritative web pages for this query Step 0: find all pages containing the query terms Step 1: expand by one move forward and backward MDIC-04 Copyright: C. Faloutsos (2004) 107 Carnegie Mellon Kleinbergs algorithm

Step 1: expand by one move forward and backward MDIC-04 Copyright: C. Faloutsos (2004) 108 Carnegie Mellon Kleinbergs algorithm on the resulting graph, give high score (= authorities) to nodes that many important nodes point to give high importance score (hubs) to nodes that point to good authorities) hubs MDIC-04 authorities Copyright: C. Faloutsos (2004) 109 Carnegie Mellon Kleinbergs algorithm observations recursive definition! each node (say, i-th node) has both an

authoritativeness score ai and a hubness score hi MDIC-04 Copyright: C. Faloutsos (2004) 110 Carnegie Mellon Kleinbergs algorithm Let A be the adjacency matrix: the (i,j) entry is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the hubness and authoritativiness scores. Then: MDIC-04 Copyright: C. Faloutsos (2004) 111 Carnegie Mellon Kleinbergs algorithm Then: k l

i m MDIC-04 ai = h k + h l + h m that is ai = Sum (hj) over all j that (j,i) edge exists or a = AT h Copyright: C. Faloutsos (2004) 112 Carnegie Mellon Kleinbergs algorithm i n p q MDIC-04 symmetrically, for the hubness: hi = an + ap + aq that is hi = Sum (qj) over all j that (i,j) edge exists or

h=Aa Copyright: C. Faloutsos (2004) 113 Carnegie Mellon Kleinbergs algorithm In conclusion, we want vectors h and a such that: h=Aa a = AT h That is: a = A TA a MDIC-04 Copyright: C. Faloutsos (2004) 114 Carnegie Mellon Kleinbergs algorithm a is a right- eigenvector of the adjacency matrix A (by dfn!) Starting from random a and iterating, well eventually converge (Q: to which of all the eigenvectors? why?) MDIC-04 Copyright: C. Faloutsos (2004)

115 Carnegie Mellon Kleinbergs algorithm (Q: to which of all the eigenvectors? why?) A: to the one of the strongest eigenvalue, because of [Property 2]: (AT A ) k v ~ (constant) v1 MDIC-04 Copyright: C. Faloutsos (2004) 116 Carnegie Mellon Kleinbergs algorithm - results Eg., for the query java: 0.328 www.gamelan.com 0.251 java.sun.com 0.190 www.digitalfocus.com (the java developer) MDIC-04 Copyright: C. Faloutsos (2004) 117

Carnegie Mellon Kleinbergs algorithm discussion authority score can be used to find similar pages (how?) closely related to citation analysis, social networks / small world phenomena MDIC-04 Copyright: C. Faloutsos (2004) 118 Carnegie Mellon google/page-rank algorithm closely related: imagine a particle randomly moving along the edges (*) compute its steady-state probabilities (*) with occasional random jumps MDIC-04 Copyright: C. Faloutsos (2004) 119 Carnegie Mellon SVD - outline

Motivation Definition - properties Interpretation Complexity Applications - success stories Conclusions MDIC-04 Copyright: C. Faloutsos (2004) 120 Carnegie Mellon SVD - conclusions SVD: a valuable tool - two major properties: (A) it gives the optimal low-rank approximation (= least squares best hyperplane) it finds concepts = principal components = features; (linear) correlations dim. reduction; visualization (solves over- and under-constraint problems) MDIC-04

Copyright: C. Faloutsos (2004) 121 Carnegie Mellon SVD - conclusions - contd (B) eigenvectors: fixed points AT A vi = i vi steady-state prob. in Markov Chains / graphs hubs/authorities MDIC-04 Copyright: C. Faloutsos (2004) 122 Carnegie Mellon Conclusions - in detail: given a (document-term) matrix, it finds concepts (LSI) ... and can reduce dimensionality (KL) ... and can find rules (PCA; RatioRules) MDIC-04 Copyright: C. Faloutsos (2004) 123

Carnegie Mellon Conclusions contd ... and can find fixed-points or steady-state probabilities (Kleinberg / google/ Markov Chains) (... and can solve optimally over- and under-constraint linear systems - see Appendix, (Eq. C1) - Moore/Penrose pseudo-inverse) MDIC-04 Copyright: C. Faloutsos (2004) 124 Carnegie Mellon References Berry, Michael: http://www.cs.utk.edu/~lsi/ Brin, S. and L. Page (1998). Anatomy of a LargeScale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. Chen, C. M. and N. Roussopoulos (May 1994). Adaptive Selectivity Estimation Using Query Feedback. Proc. of the ACM-SIGMOD , Minneapolis, MN. MDIC-04 Copyright: C. Faloutsos (2004) 125

Carnegie Mellon References Duda, R. O. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Faloutsos, C. (1996). Searching Multimedia Databases by Content, Kluwer Academic Inc. Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12): 51-60. MDIC-04 Copyright: C. Faloutsos (2004) 126 Carnegie Mellon References Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. Jolliffe, I. T. (1986). Principal Component Analysis, Springer Verlag. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. MDIC-04

Copyright: C. Faloutsos (2004) 127 Carnegie Mellon References Korn, F., H. V. Jagadish, et al. (May 13-15, 1997). Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. ACM SIGMOD, Tucson, AZ. Korn, F., A. Labrinidis, et al. (1998). Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. VLDB, New York, NY. MDIC-04 Copyright: C. Faloutsos (2004) 128 Carnegie Mellon References Korn, F., A. Labrinidis, et al. (2000). "Quantifiable Data Mining Using Ratio Rules." VLDB Journal 8(3-4): 254-266. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press. Strang, G. (1980). Linear Algebra and Its Applications, Academic Press.

MDIC-04 Copyright: C. Faloutsos (2004) 129 Carnegie Mellon Appendix: SVD properties (A): obvious (B): less obvious (C): least obvious (and most powerful!) MDIC-04 Copyright: C. Faloutsos (2004) 130 Carnegie Mellon Properties - by defn.: A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] A(1): UT [r x n] U [n x r ] = I [r x r ] (identity matrix) A(2): VT [r x n] V [n x r ] = I [r x r ] A(3): k = diag( 1k, 2k, ... rk ) (k: ANY real number) A(4): AT = V UT MDIC-04 Copyright: C. Faloutsos (2004)

131 Carnegie Mellon Less obvious properties A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] B(1): A [n x m] (AT) [m x n] = ?? MDIC-04 Copyright: C. Faloutsos (2004) 132 Carnegie Mellon Less obvious properties A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] B(1): A [n x m] (AT) [m x n] = U 2 UT symmetric; Intuition? MDIC-04 Copyright: C. Faloutsos (2004) 133 Carnegie Mellon Less obvious properties A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] B(1): A [n x m] (AT) [m x n] = U 2 UT

symmetric; Intuition? document-to-document similarity matrix B(2): symmetrically, for V (AT) [m x n] A [n x m] = V 2 VT Intuition? MDIC-04 Copyright: C. Faloutsos (2004) 134 Carnegie Mellon Less obvious properties A: term-to-term similarity matrix B(3): ( (AT) [m x n] A [n x m] ) k= V 2k VT and B(4): (AT A ) k ~ v1 12k v1T for k>>1 where v1: [m x 1] first column (eigenvector) of V 1: strongest eigenvalue MDIC-04 Copyright: C. Faloutsos (2004) 135 Carnegie Mellon Less obvious properties

B(4): (AT A ) k ~ v1 12k v1T for k>>1 B(5): (AT A ) k v ~ (constant) v1 ie., for (almost) any v, it converges to a vector parallel to v1 Thus, useful to compute first eigenvector/value (as well as the next ones, too...) MDIC-04 Copyright: C. Faloutsos (2004) 136 Carnegie Mellon Less obvious properties repeated: A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] B(1): A [n x m] (AT) [m x n] = U 2 UT B(2): (AT) [m x n] A [n x m] = V 2 VT B(3): ( (AT) [m x n] A [n x m] ) k= V 2k VT B(4): (AT A ) k ~ v1 12k v1T B(5): (AT A ) k v ~ (constant) v1 MDIC-04 Copyright: C. Faloutsos (2004) 137 Carnegie Mellon Least obvious properties

A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] let x0 = V (-1) UT b if under-specified, x0 gives shortest solution if over-specified, it gives the solution with the smallest least squares error (see Num. Recipes, p. 62) MDIC-04 Copyright: C. Faloutsos (2004) 138 Carnegie Mellon Least obvious properties Illustration: under-specified, eg [1 2] [w z] T = 4 (ie, 1 w + 2 z = 4) z shortest-length solution x0 2 all possible solutions 1 1 2 MDIC-04 3 4

w Copyright: C. Faloutsos (2004) 139 Carnegie Mellon Least obvious properties Illustration: over-specified, eg [3 2]T [w] = [1 2]T (ie, 3 w = 1; 2 w = 2 ) desirable point b 2 reachable points (3w, 2w) 1 1 2 MDIC-04 3 4 Copyright: C. Faloutsos (2004) 140 Carnegie Mellon Least obvious properties - contd A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] C(2): A [n x m] v1 [m x 1] = 1 u1 [n x 1] where v1 , u1 the first (column) vectors of V, U. (v1 == right-eigenvector)

C(3): symmetrically: u1T A = 1 v1T u1 == left-eigenvector Therefore: MDIC-04 Copyright: C. Faloutsos (2004) 141 Carnegie Mellon Least obvious properties - contd A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] C(4): AT A v1 = 12 v1 (fixed point - the usual dfn of eigenvector for a symmetric matrix) MDIC-04 Copyright: C. Faloutsos (2004) 142 Carnegie Mellon Least obvious properties altogether A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] then, x0 = V (-1) UT b: shortest, actual or leastsquares solution C(2): A [n x m] v1 [m x 1] = 1 u1 [n x 1] C(3): u1T A = 1 v1T

C(4): AT A v1 = 12 v1 MDIC-04 Copyright: C. Faloutsos (2004) 143 Carnegie Mellon Properties - conclusions A(0): A[n x m] = U [ n x r ] [ r x r ] VT [ r x m] B(5): (AT A ) k v ~ (constant) v1 C(1): A [n x m] x [m x 1] = b [n x 1] then, x0 = V (-1) UT b: shortest, actual or leastsquares solution C(4): AT A v1 = 12 v1 MDIC-04 Copyright: C. Faloutsos (2004) 144

Recently Viewed Presentations

  • The Meniscus: Injuries, Management and Interventions in ...

    The Meniscus: Injuries, Management and Interventions in ...

    Lateral vs medial meniscectomy. Partial Meniscectomy. Indicated for flap tears, cleavage tears, and radial tears in the inner or vascular areas. Leads to a >350% increase in focal contact forces on the articular cartilage. Medial meniscectomy .
  • Hydrogeochemistry - CLAS Users

    Hydrogeochemistry - CLAS Users

    H. 2 O cycle: glacier, lake, ocean, river etc.Gases (atmosphere) Solutes in water etc. Flux = transfer of mass between reservoirs. Water, other fluids, solutes. Units = mass per area per time ( e.g., m3/m2/yr) Requires physical transport - advection...
  • The National Service Ride Project Helping move America

    The National Service Ride Project Helping move America

    About Us. Inspired by Travels with Harley, the National Service Ride is a community-based initiative that leverages motorcycling's appeal to freedom, adventure, and moving forward to help promote citizenship and service, starting right at home: "When we become better citizens,...
  • Climate Change and Arizona: A brief overview Mike

    Climate Change and Arizona: A brief overview Mike

    Cold Spells. Less frequent. Just as cold (NCA-SW 2013) SOUTHWEST REPORT - CHAPTER 7. NOVEMBER-MARCH is the season measured. In other words, cold outbreaks occur when temperature drops below the local levels defining the coldest 5% of winter days or...
  • NON-fiction

    NON-fiction

    A short work of nonfiction that deals with one subject. Often found in newspapers and magiazines. Expository Essay. A formal essay which is tightly structured and presents information and ideas. ... NON-fiction Last modified by: Neely, Shannon
  • Volcanoes and Igneous Activity Earth - Chapter 4

    Volcanoes and Igneous Activity Earth - Chapter 4

    Chapter 12 Geologic Time Sedimentary Rocks: Rocks formed from the weathered products of preexisting rocks that have been transported, deposited, compacted, and cemented.
  • Company Profile Startex Marketing Services Lanka Startex Mystery

    Company Profile Startex Marketing Services Lanka Startex Mystery

    Customer Satisfaction Study. Net Promoter Program (Customer Loyalty) Service Plan Preparation. Staff Training (for effective implementation of Service Plan) Telephone Mystery Shopping. Brand Standard Audits. Service Plan implementation. Post-implementation review and strategy
  • A importância da Gestão do Conhecimento para os Processos de ...

    A importância da Gestão do Conhecimento para os Processos de ...

    A importância da Gestão do Conhecimento para os Processos de Inovação Mario Costa