KDD-07 Invited Innovation Talk August 12, 2007 Usama

KDD-07 Invited Innovation Talk August 12, 2007 Usama

KDD-07 Invited Innovation Talk August 12, 2007 Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc. Research 1 2 Thanks and Gratitude My family: my wife Kristina and my 4 kids; my parents and my sisters My academic roots: The University of Michigan, Ann Arbor my Ph.D. committee, including

Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie Cheng), Internships at GM Research and at NASAs JPL My Mentors and Collaborators Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul Stolorz, Peter Cheeseman, David Atkinson, many others Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley, Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others

Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff Webb, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many colleagues My Business Partners Bassel Ojjeh, Nick Besbeas, many VCs, many advisers and strategic clients including Microsoft SQL Server and sales teams My Yahoo! Colleagues: Zod Nazem, Jerry Yang, David Filo, Yahoo! exec team, Prabhakar Raghavan, Pavel Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research folks, many at Yahoo SDS and current and previous Yahoo! employees

Research A Data Miners Story Getting to Know the Grand Challenges Personal Observations of a Data Mining Disciple Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc. Research 3 4 Overview The setting Why data mining is a must? Why data mining is not happening?

A Data Miners Story Grand Challenges: Pragmatic Grand Challenges: Technical Some case studies Concluding Remarks Research 5 The data gap The Machinery Moves on: Moores law: processing capacity doubles every 18 months : CPU, cache, memory Its more aggressive cousin: Disk storage capacity doubles every 9 months

The Demand is exploding: Every business is an eBusiness Scientific Instruments and Moores law Government The Internet the ubiquity of the Web The Talent Shortage Research 6 What is Data Mining? Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Interesting: ? Examples of tasks addressed by Data Mining Predictive Modeling (classification, regression) Segmentation (Data Clustering )

Affinity (Summarization) relations between fields, associations, visualization Research 7 Beyond Data Analysis Scaling analysis to large databases How to deal with data without having to move it out? Are there abstract primitive accesses to the data, in database systems, that can provide mining algorithms with the information to drive the search for patterns? How do we minimize--or sometimes even avoid--having to scan the large database in its entirety? Automated search Enumerate and create numerous hypotheses Fast search Useful data reductions

More emphasis on understandable models Finding patterns and models that are interesting or novel to users. Scaling to high-dimensional data and models. Research Data Mining and Databases Many interesting analysis queries are difficult to state precisely Examples: which records represent fraudulent transactions? which households are likely to prefer a Ford over a Toyota? Whos a good credit risk in my customer DB? Yet database contains the information good/bad customer, profitability did/did not respond to mailout/survey/... Research

9 Data Mining Grand Vision ACME CORP ULTIMATE DATA MINING BROWSER Whats New? Whats Interesting? Predict for me Research 10 The myths Companies have built up some large and impressive data warehouses Data mining is pervasive nowadays Large corporations know how to do it There are tools and applications that discover

valuable information in enterprise databases Research 11 The truths Data is a shambles, most data mining efforts end up not benefiting from existing data infra-structure Corporations care a lot about data, and are obsessed with customer behavior and understanding it They talk a lot about it An extremely small number of businesses are successfully mining data The successful efforts are one-of, lucky strikes Research

12 Current state of Databases Ancient Egypt Data navigation, exploration, & exploitation technology is fairly primitive: we know how to build massive data stores we do not know how to exploit them we do the book-keeping really well (OLTP) Inadequate basic understanding of navigation /systems many large data stores are write-only (= data tomb) Research 13 A Data Miners Story Started out in pure research Professional student

Math and algorithms Research 14 Researcher view Database Algorithms and Theory Systems Research 15 Practitioner view Database

Customer Systems and integration Algorithms Research 16 Business view Customer Database Systems $$$s Research

Algorithms 17 A Data Miners Story Started out in pure research At NASA-JPL did basic research and applied techniques to Science Data Analysis problems Worked with top scientists is several fields: astronomy, planetary geology, atmospherics, space science, remote sensing imagery Great results, strong group, lots of funding, high demand So why move to Microsoft Research? Research 18 Example: Cataloging Sky Objects

Research Data Mining Based Solution 94% accuracy in recognizing sky objects Speed up catalog generation by one to two orders of magnitude (unrealistic to perform manually). Classify objects that are at least one magnitude fainter than catalogs to-date. Tripled the data yield Generate sky catalogs with much richer content: on order of billions of objects: > 2x107 galaxies > 2x108 stars, 105 quasars Discovered new quasars 40 times more efficiently Research Research

21 A Data Miners Story Started out in pure research At NASA-JPL At Microsoft Research Basic research in algorithms and scalability Began to worry about building products and integrating with database server Two groups established: research and product So why move out to a start-up? Research 22 Working with Large Databases One scan (or less) of the database terminate early if appropriate Work within confines of a given limited RAM

buffer Cluster a Gigabyte or Terabyte in, say 10 or 100 Megabytes RAM Anytime algorithm best answer always handy Pause/resume enabled, incremental Operate on forward-only cursor over a view (essentially a data stream) Research 23 Business Results Gap Business users are unable to apply the power of existing data mining tools to achieve results

Business Challenges Acquisition Conversion Average Order Retention Loyalty Technologies Technical Tools Neural Networks OLAP Logistic Regressions Bayesian Networks CART

Segmentation Decision Trees Genetic Algorithms Chaid Research 24 Business Results Gap Business users are unable to apply the power of existing data mining tools to achieve results Business Challenges Specialists

Acquisition Statisticians Conversion Data Mining PhDs Average Order Retention Loyalty DBAs Consultants Technologies Technical Tools Neural

Networks OLAP Logistic Regressions Bayesian Networks CART Segmentation Decision Trees Genetic Algorithms Chaid Research 25 Evolving Data Mining Evolution on the technical front:

New algorithms Embedded applications Make the analyst life easier Evolution on the usability front New metaphors Vertical applications embedding Used by the business user In both cases, success means invisibility Research 26 Grand Challenges Pragmatic: Achieving integration and invisibility Research/Technical:

Solving some serious unaddressed problems Research 27 Pragmatic Grand Challenge 1 Where is the data? There is a glut of stored data Very little of that data is ready for mining Data warehousing has proven that it will not solve the problem for us Solution: integration with operational systems Take a serious database approach to solving the storage management problem Research 28 digiMine Background

Started as Venture Capital-funded company: digiMine, Inc. in March 2000. Built, operated and hosted data warehouses with built-in data mining apps Headquartered in Bellevue, Washington $45 million in funding Mayfield, Mohr Davidow, American Express, Deutsche Bank Grew to over 120 employees 50 patents+ in technology and processes

Both technology and services Research 29 Sample Customers Research 30 A Data Miners Story Started out in pure research At NASA-JPL At Microsoft Research At digiMine Lots of VC funding, great team, great press coverage,

and fast moving great customers So why move to a DMX Group? Research 31 Why DMX Group? At digiMine, we grew a large Professional Services organization We learned a lot from these engagements VC-funded companies cannot do much consulting A fork in the road appeared digiMine re-focused on a market vertical: behavioral targeting for media and publishers Renamed to Revenue Science, Inc. Formed DMX Group which was eventually acquired by Yahoo!

Research 32 DMX Group Mission Make enterprise data a working asset in the enterprise: Data strategy for the business Implementation of Business Intelligence and data mining capabilities Business issues around data What is possible? How to expose it to business users How to train people and change processes Integration with operational systems Research 33

Data Strategy How can your data influence your revenues? How do you optimize operations based on data? How do you increase customer retention based on data? How do you utilize enterprise data assets to spot new opportunities: Cross-sell to existing customers Grow new markets Avoid problems such as fraud, abuse, churn, etc? Research 34 A Data Miners Story Started out in pure research At NASA-JPL At Microsoft Research

At digiMine/Revenue Science Inc. At DMX Group Research 35 Pragmatic Grand Challenge 2 Embedding within Operational Systems We all worry about algorithms, they are fascinating Most of us know that data mining in practice is mostly data prep work Go where the data is when the data does not come to you But how much of the problem is data mining? facts: The effort in embedding an application is huge, and often not discussed Without it, all the algorithms are useless Research

Case Study Wireless Telco Churn Modelling and Prediction Research 36 37 Modeling Process 2 Sample Databas e 3 Build Churn

Model 4 Score Databas e 6 High Risk Med Risk Low Risk 5 6 High Val Med Val Low Val

Valu e 1 Customer Interactio n Base Assign Custome r Value SMS WAP CDR Billing Research Risk

High Val High Risk Med Val High Risk High Val High Val Med Risk Low Risk Low Val High Risk Low Val Low Val Med Risk Low Risk Med Val Med Val Med Risk Low Risk

38 LTV and Its Application A customers life-time value (LTV) is the net value that a customer brings in to a business by the end of their service. I.e. their profit contribution. LTV allows decisions for individual customers that optimize the return-on-investment (ROI). Examples: Aggressive retention programs, such as equipment upgrade and contract renewal for high LTV. Differentiated customer care treatment for reactivations by customer with low LTV Research 39 What is the Required? Detailed data Integration of CDR, WIG, SMS, Billing

Maintained at detailed level Integrated data mining Algorithms tuned to model thousands of variables and millions of rows Accurate Forecasts System Robustness Massively scalable back end system Flexible architecture to create new variables quickly and easily Collaborative Service Model Service model which guarantees success Combined IQ Model to optimize science and business knowledge Low cost to create and maintain models Research 40 Map Segments to Actions

High Save Program Cautiously Defend Let them go Cost Reducing Programs Churn Probability Aggressive ly Contract Defend Renewal Equipme

nt Upgrade Feature AddElite Program ChangePlan BadMigration Behavio r Grow Margin Feature Use Nurtur e/ Loyalty Programs Maintai n Low

Negativ e Research Low Forecaste d LTV High 41 Cost Rules Applied Cost Rules are introduced to define scoring For Example: Network System Usage Cost

Mobile to Land Connections Costs Technical Operations/Support Costs Long Distance Costs Inter-Carrier /International subsidy costs Roaming Costs

Bad Debt Allocation Many others Research 42 Cost Rules for a Bank? Cost Rules are introduced to define value For Example: Deposit Value Product mix Average. daily balance

Monthly service fees Technical operations/Support costs Branch/teller usage Late payment/Overdraft history Interest rate

Contract term Credit Score Employment history/Income Research 43 Pragmatic Grand Challenge 3 Integrating domain knowledge Data mining algorithms are knowledge free There is no notion of common sense reasoning Do we have to solve an AI-hard problem?

Robust and deep domain knowledge utilization solution: Very deep and very narrow integration Ability to model business strategy Reasoning capability just evolves (c.f. chess players) Research 44 Cross-Sell / Up-Sell Example Customer looking for pants Help Me Complete the Assortment Decide Any Related Products Recommendations

Collaborative Filtering Context Sensitive Approach Research Alternates Up Sells Complement Addon Impulse Buy 45 Pragmatic Grand Challenge 4 Managing and maintaining models When was the last time you thought about the lifetime of a mining model

What happens when a model is changed Have you tried to merge the results of two different clustering models over time? How many data droppings (aka temp files, quick transformations, quick fixes) do you generate in an analysis session? A framework for managing, updating, and retiring mining models solution: use techniques that have been invented for this, databases, systems mngmt, s/w engr, etc Research 46 Pragmatic Grand Challenge 5 Effectiveness Measurement How do we measure [honestly] the effectiveness of a model in a context? Return on Investment (ROI) measurement Evaluation in the context of the application

A framework and methodology for measurement and evaluation Build the measurement method as part of the design of the model An engineering recipe for measurements, and a set of metrics Research Technical Challenges Research 47 48 Technical Challenges 0. Public benchmark data sets

As a field we have failed to define a common data collection Very difficult to judge research and systems advances Not an easy task, but not impossible A mix of synthetic (but realistic) data sets and real datasets Research 49 Technical Challenges 1. How does the data grow? A theory for how large data sets get to be large Definitely not IID sampling from a static distribution Inappropriateness of a single-population model 2. Complexity/understandability tradeoff

Explaining how, when and why a model works Explaining when a model fails A Tuning Dial for reducing the complex into the understandable Research 50 Technical Challenges 3. Interestingness What is an interesting pattern or summary? How do you measure novelty? What is unusual? When is it worthy of attention? Is it low probability events? High summarization ability? Outliers? Good fits? Bad fits? Research 51 Technical Challenges

4. Scalability Beyond just dealing with a large data set: Principled feature reduction: what is SVD equivalent? Graceful degradation with dimensionality Uncovering graphical structure in data Communities, relations, link analysis, Dealing with multiple data types: Structured, sparse, dense, text, images, video, audio, sequence data, etc. I have yet to see an algorithm that deals with more than one type. Integration with DBMS Appropriate sampling Appropriate operator abstractions Taking care of minor details Initialization? Determining k Research 52

Technical Challenges 5. A theory for what we do What are the fundamental abstractions? What are the basics operations? What are the basic components of an algorithm? What is it that we are optimizing? What is hard? What is doable? Why? What is a data summary? When are two attributes similar? Can you measure efficiently? How do we extract the right representation? Research 53 A new theory is needed What are the fundamental problems? What do partial models or summaries of data really mean? What are the implications of post hoc data analysis? When is it/is it not reasonable to conclude a task is

appropriate? A new algebra for dealing with highly-summarized views of the world Effect of sparse spaces on dimensionality. What is the true dimensionality of data? What are the limits? A theory for adaptive sampling Research Summary Pragmatic and Technical Grand Challenges Research 54 55 Challenges 0. Public and challenging benchmark data sets Pragmatic

Technical 1. Wheres the Data? 1. Understanding large 2. In Situ mining 2. Simplicity knob 3. Domain knowledge 3. Interestingness 4. Life-cycle maintenance 4. Scalability 5. Metrics 5. Theory of what we do A Scorecard for the field: At least 2 advances in the next 10 years!!!

Research 56 Data Mining Grand Vision ACME CORP ULTIMATE DATA MINING BROWSER Whats New? Whats Interesting? Predict for me Research 57 In the meantime, there is an understanding gap The technical community speaks of tech problems

The business strategic thinking hit an understandability wall Traditionally, the thinking of business strategy never included data A new generation of business challenges are born Research 58 Data Strategy Is the mapping of the capabilities enabled by data in driving the business The Integration of data-driven capabilities in revenue-driving activities The Integration of data-derived metrics to feedback into the measurement of the success of the business Evolving to an operational state where planning includes data, measurability, and data-driven

feedback loops Research 59 A Data Miners Story Started out in pure research At NASA-JPL At Microsoft Research At digiMine/Revenue Science Inc. At DMX Group So why join Yahoo! ? Research Yahoo! Case Study Evolving the Data Strategy as Chief Data Officer Research

60 61 Yahoo! is the #1 Destination on the Web 73% of the U.S. Internet population uses Yahoo! About 500 million users per month globally! Global network of content, commerce, media, search and access products 100+ properties including mail, TV, news, shopping, finance, autos, travel, games, movies, health, etc.

25 terabytes of data collected each day and growing Representing thousands of cataloged consumer behaviors More people visited Yahoo! in the past month than: Use coupons Vote

Recycle Exercise regularly Have children living at home Wear sunscreen regularly Data is used to develop content, consumer, category and campaign insights for our key content partners and large advertisers ResearchSources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005. 62 Yahoo! Data A league of its own Terrabytes of Warehoused Data Millions of Events Processed Per Day


W a lm a r t NYSE 94 Y! Panam a W a re h o u se VISA 49 Y ! L iv e S to r SABRE 500 25 AT&T

225 1,000 K o re a T e le c o m 120 2,000 Am azon 50 5,000 63 To be continued

Will cover the Yahoo! case study on Tuesdays Invited talk Will include Strategic Importance of Data Evolving the data strategy Evolving towards the need to invent the new sciences of the Internet Hope the Data Miners Story continues Perhaps to a happy ending? Research Thank You! & Questions? [email protected]

Research 64

Recently Viewed Presentations

  • Lync Phase 3 April 2015 Background  Lync Phase

    Lync Phase 3 April 2015 Background Lync Phase

    Use Case: shift turnover documentation. Lync P3. Lync to Lync Federation. Example: Federate CTS Lync with DOT Lync. Requires some policy decisions. Federate with anyone? No. Federate with State agencies only? Federate with State, Counties, and cities?


    In-School Sports Physicals 2019. [email protected] Date : Friday, April 12th 2019 : Well-Beings will be offering sports physicals atLiberty Middle School again this year on April 12thPayment : $30 per child.Please send in a check or cash with your child...
  • The Australian Energy Regulation

    The Australian Energy Regulation

    AER - Chris Pattas, General Manager - Networks. Consumer challenge panel - Ruth Lavery and Hugh Grant. TransGrid - Peter McIntyre, Managing Director. Time for questions at the end of presentations. Close at 2.30pm . Short recess until presentations on...
  • Bayesian Travel Time Reliability Feng Guo, Associate Professor

    Bayesian Travel Time Reliability Feng Guo, Associate Professor

    The mean travel time of the two states are 578.8 and 972.6 seconds. If we calculate the stationary distribution, the proportion of congested state is around 11.3%. AIC indicates that hidden Markov model is superior to traditional mixture Gaussian model.
  • Happiness through Personal Learning - WebJunction

    Happiness through Personal Learning - WebJunction

    ADDIE Gant Chart Anatomy By Garybooker VARK Learning Styles Multi- Modal Card catalog by lucy.loomis Your Personal Learning Environment should include: Syndicated content Tagging Archiving Reflection Sharing with others Chart by Steve Wheeler Model by D'Arcy Norman Model by D'Arcy...
  • Linear Plot - Weebly

    Linear Plot - Weebly

    Linear Plot. Every short story or novel follows a linear plot, a straightforward storyline that contains all of the plot elements in chronological order.. Plot: the important series of events or actions that happen in a story.
  • The Rule in Rylands v Fletcher - aaas.org

    The Rule in Rylands v Fletcher - aaas.org

    Background. Book: Human Rights, IP Rights & New Biotechnologies (Johns Hopkins University Press, forthcoming 2013) Spectacular advances in the life-sciences. Extension of patents to isolated cells, genes and foundational/basic knowledge. Human rights constraints on research in the biosciences (e.g. Venice...
  • Unit 3: Hair and Fibers

    Unit 3: Hair and Fibers

    Scales The three basic patterns are: Coronal - small rodents and bats but rarely in human hairs. Spinous - seals, cats, and some other animals. ... The Study of Hair Hair Analysis Introduction The Cuticle Hair Shaft The Cortex The...