# Random Projection for High Dimensional Data Clustering: A ...

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML2003 Presented by Dehong Liu Contents

Motivation Random projection and the cluster ensemble approach Experimental results Conclusion Motivation

High dimensionality poses two challenges for unsupervised learning The presence of irrelevant and noisy features can mislead the clustering algorithm. In high dimensions, data may be sparse, making it difficult to find any structure in the data.

Two basic approaches to reduce the dimensionality Feature subset selection; Feature transformation-PCA, random projection. Motivation Random projection

Advantage A general data reduction technique; Has been shown to have special promise for high dimensional data clustering. Disadvantage Highly unstable. Different random projections may

lead to radically different clustering results. Idea Aggregate multiple runs of clusterings to achieve better clustering performance. A single run of clustering consists of applying ra ndom projection to the high dimensional data an

d clustering the reduced data using EM. Multiple runs of clustering are performed and the results are aggregated to form an nn similarity matrix. An agglomerative clustering algorithm is then ap plied to the matrix to produce the final clusters.

A single run Random projection: X=X R X: n d, reduced-dimension data set X : n d , high-dimensional data set R: d d, which is generated by first setting each en try of the matrix to a value drawn from an i.i.d N(0,1) distribution and then normalizing the columns to unit

length. EM clustering Aggregating multiple clustering results The probability that data point i belongs to

each cluster under the model : The probability that data point i and j belongs to the same cluster under the model : Pij forms a similarity matrix.

Producing final clusters How to decide k? We can use the occurrence of a sudden similarity drop as a heuristic to determine k.

Experimental results Evaluation Criteria Conditional Entropy (CE): measures the uncertainty of the class labels given a clustering solution. Normalized Mutual Information (NMI) between the distribution of class labels and the distribution of cluster labels.

CE: the smaller the better. NMI: the larger the better. Experimental results Cluster ensemble versus single RP+EM Experimental results

Cluster ensemble versus PCA+EM Experimental results Cluster ensemble versus PCA+EM Analysis of Diversity for Cluster Ensembles

Diversity: the NMI between each pair of clustering solutions. Quality: average the NMI values between each of the solutions and the class labels Conclusion Techniques have been investigated to produce and com

bine multiple clusterings in order to achieve an improved final clustering. The major contribution of this paper:1)Examined random projection for high dimensional data clustering and identif ied its instability problem; 2)formed a novel cluster ense mble framework based on random projection and demon strated its effectiveness for high dimensional data cluster

ing; and 3) identified the importance of the quality and di versity of individual clustering solutions and illustrated th eir influence on the ensemble performance with empirical results.

## Recently Viewed Presentations

• The critical value is 5.991. Decision rule for the Χ2 test:• If the test statistic is less than the critical value then the data supports the null hypothesis• If the test statistic is equal to or greater than the critical...
• Tree Structure: Recognizing the coefficients of the same spatial location Zero tree: set of insignificant coefficients DWT for Image decomposition Zero Tree A coefficient is part of a zero tree if it's zero and all of its descendents are zero...
• Monet was born in Paris, France in 1840. His full name is Oscar-Claude Monet. His family owned a grocery store and his father wanted him to take over the family business. He only wanted to paint; eventually he went to...
• There were no chalkboards, maps, or paper. School teachers were strict and were allowed to hit their students or make them wear a dunce hat if they were bad or said the wrong answer. In the New England colonies, children...
• Ending Report (two versions: short for Website, Complete as Deliverable). Press release. 57 experts (120 accepted) 194 experts invited. 401 invited. Total invited experts Male Female 61 59 Working panel Male Female 0 0 Member States represented Participants by country...
• The sequential semantics of anyproducer effect system forms a productor. ... Formalized sequential structure of. Effects: effectors. Semantics: productors. Proved generality/applicability of productors. All producer effect systems. Thunking guarantees producer effects. Opportunities for future ...
• Chin-Sang Lab Volunteer Training ... Worm Picking (cont'd) Putting the Worms Down: Refocus the microscope on the edge of the bacterial lawn Gently touch the surface of the agar and hold the pick there The worm will crawl off towards...
• Why do we have all these rules? These laws and regulations provide the guidance framework for how the State does business. When properly implemented and followed, they will safeguard our resources, protect our staff, and provide internal controls to ensure...