Big Ambient Data - VLDB

Is it Still Big Data if it Fits in my Pocket? Dave Campbell Microsoft The Journey: Objective Try to separate hype from reality Identify unique new value Is map-reduce a giant steps backwards? What are the dominant dimensions of Big Data 8/31/2011 VLDB 2011 2 The Journey: Process Engaged and connected with many people Many interesting debates

Created, tested, and refined a frame to explain Big Data phenomenon Major driving forces Encountered independently evolved common patterns Wrote code & prototyped 8/31/2011 VLDB 2011 3 The Journey: Results An (one) explanation for the phenomenon A set of design & architecture patterns Material to inform an R&D agenda 8/31/2011 VLDB 2011 4 The Story 8/31/2011

VLDB 2011 5 The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Signal Effort / Latency 8/31/2011 VLDB 2011 6 The Current Paradigm l e l s

7 Lifecycle of a Question Validation Question o sti e Qu t n re e ff Di n Not interesting Worth asking again? Make it repeatable Bring it to production 8/31/2011

VLDB 2011 8 A Tightly Coupled System Available data prepared on basis of scope of analysis Available Data Conceptual 8/31/2011 Model Logical VLDB 2011 Scope of Analysis Physical

9 Models have traditionally been coupled Conceptual Logical Physical Logical model has been scaffolding for physical: Relational: Indexes MOLAP: Aggregations In-memory technologies breaking logical/physical knot! Knowledge domain coupled to conceptual model 8/31/2011 VLDB 2011 10 Todays Challenge Available Data Model

Scope of Analysis Data are freely available Ability to model it is much more of a gating factor than raw size Particularly when considering new forms of data 8/31/2011 VLDB 2011 11 Sensemaking: Intelligence Analysis Reference: The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive 8/31/2011 VLDB 2011 12 Task Analysis, Peter Pirolli and Stuart Card, 2005 Sensemaking Explanation Explanatory

Frame Data Support Interdependent relationship Supports abductive logic Current systems support sensemaking within a modeled domain. Big Data expands this. Reference: A Data-Frame Theory of Sensemaking, G.A. Klein, et. al. 8/31/2011 VLDB 2011 13 Key Value Proposition Model Model Available Data Model Model Model Traditional

System Traditional System New System Key elements of Big Data value: Reducing friction to produce valuable information Enabling sensemaking over a broader space Enabling model / algorithm generation 8/31/2011 VLDB 2011 14 Reworking The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Knowledge

Application Knowledge Information Data Signal t Time to insight improvement 8/31/2011 EffortVLDB / Latency 2011 15 Objective: Change Shape of Two Curves 8/31/2011 VLDB 2011 16 Emergent Architectural Patterns

8/31/2011 VLDB 2011 17 Big Data Patterns Have observed some common patterns Many appear to occur via independent evolution Prototyped over personal sources Patterns: Digital Shoebox Information Production Transform & Load Model Development Monitor, Mine, Manage 8/31/2011 VLDB 2011

18 Pattern: Digital Shoebox Intent: Retain all ambient data to enable sensemaking over all available signals Applicability: Use to create a source data pool to bootstrap subsequent information generation Description: Enabling Enabling Trends: Trends: Cost Cost of of data data acquisition acquisition $0 $0 Cost Cost of of data data storage storage $0 $0 Tipping Tipping point point occurs

occurs if: if: ( ( ) + ( ) > )( ) Must Must keep keep modeling modeling and and storage storage costs costs low low to to achieve achieve this this

Implementation: Augment Augment raw raw data data with with sourceID, sourceID, and and instanceID instanceID and and retain retain on on inexpensive inexpensive but but reliable reliable storage storage 8/31/2011 VLDB 2011 19 Pattern: Digital Shoebox Source Model: The natural model in which the data are produced Acquisition Model: An augmented source model which contains source identifier and instance (typically timestamp)

AcquisitionModel = {sourceID, instanceID, sourceData} SourceID InstanceID Source Source Source A A A B B B C C C Source Source Source Source Source Source 8/31/2011 VLDB 2011

1 2 3 1 2 3 1 2 3 Source Source Source Source Source Source Source Source Source 20 Personal Example GPS GPS GPS A A

A B B B C C C Outlook Outlook Outlook HA HA HA 1 2 3 1 2 3 1 2 3 Source

Source Source Source Source Source Source Source Source GPS Have been carrying a GPS data logger for 5 months HA Log file from home automation system Outlook Have script that produces when I send mail, to whom, and, if a reply, my response latency 8/31/2011 VLDB 2011 21 Pattern: Information Production Intent: Turn acquired data from digital shoebox into other events and states Applicability: Used to transform raw data into information for subsequent processing Description: Often requires temporal processing & correlation of acquired data

Key point: Cleansing often much easier in transformed domain Implementation: Requires environment for parsing, grouping, aggregation, and often joining of acquired data 8/31/2011 VLDB 2011 22 Information Production Transform Transform Transforms source data into events & states Data cleanup, cleansing & imputation Quite often cleansing happens in transformed domain E.g. Nights on the road vs. @ Home Wind up with a set of composable transforms Produced information stored in

Digital Shoebox or downstream system 8/31/2011 VLDB 2011 23 Personal Example - GPS T3 T2 Source T1 T4 T5 8/31/2011 Tree of transforms and filters Cleansing often happens in transformed domain E.g. Where I slept each night Can produce higher level information [DwellAtHome],[RouteToWork], [DwellAtWork] = Commute to work

Using higher level information: Commute duration f(leavingTime) VLDB 2011 24 Commute Time as f(leaveTime) 8/31/2011 VLDB 2011 25 Event & State Correlation Dwell geolocation + 2011-06-10 2011-06-10 06:18:26, 06:18:26, 2011-06-10 2011-06-10 06:16:18, 06:16:18, 0.04 0.04 2011-06-10 2011-06-10 06:21:18,

06:21:18, 2011-06-09 2011-06-09 08:27:50, 08:27:50, 21.89 21.89 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 2011-06-10 06:26:48, 06:26:48, None, None, 0.00 0.00 2011-06-10 2011-06-10 06:29:37, 06:29:37, 2011-06-09 2011-06-09 06:53:34, 06:53:34, 23.60 23.60 2011-06-10 2011-06-10 06:34:41, 06:34:41, 2011-06-09 2011-06-09 12:00:25, 12:00:25, 18.57 18.57

2011-06-10 2011-06-10 06:39:52, 06:39:52, 2011-06-09 2011-06-09 17:44:54, 17:44:54, 12.92 12.92 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 Outlook statistics = How much email do I send from home vs. at work? 8/31/2011 VLDB 2011 26 Pattern: Transform & Load Intent: Transform acquired data and produced information to load into traditional systems e.g. Data Warehouse, OLAP cube, etc.

Applicability: Used to load other systems for production use or other analysis Description: Transformations and queries over the Digital Shoebox are used to load downstream systems Jobs can be scheduled or invoked by other systems Implementation: Requires repeatable transform mechanism Adapters to downstream systems Scheduling mechanism 8/31/2011 VLDB 2011 27 Transform & Load Acquisition Model Information Information Model Information Model Information

Model Information Model Model Data Mart 8/31/2011 Data Warehouse CEP System VLDB 2011 28 Pattern: Model Development Intent: Enable sensemaking directly over the Digital Shoebox without extensive up front modeling Applicability: Used to create knowledge from Digital Shoebox contents Description:

Provide a suite of tools which operate efficiently to enable model discovery, refinement and validation Implementation: Requires exploration, visualization, and statistical tools 8/31/2011 VLDB 2011 29 Model Development Example Its clear that Im an early to be, early to rise, guy When not home, only activity is from the pet-sitter & cleaners Marcia gets up after me and likes to read in bed 8/31/2011 VLDB 2011 30 Pattern: Monitor, Mine, Manage Intent: Develop and use generated models to perform active management or intervention

Applicability: Use for fraud detection, system alerting, intrusion detection, user classification, Description: Historical data is used to develop a model (algorythm) which is installed in active system Implementation: Requires model generation pattern, active monitoring system [e.g. Complex Event Processing (CEP)] 8/31/2011 VLDB 2011 31 Pattern: Monitor, Mine, Manage 2 1. Monitor & collect data 2. Mine and create online model 3. Deploy online model to actively manage 1 This is about reducing Time to Action! 3

Examples: Financial fraud detection and prevention Audience intelligence Personal: Home & Away settings for home automation 8/31/2011 VLDB 2011 32 Pattern Map Digital Shoebox Model Development Information Production Monitor, Mine & Manage Transform & Load 8/31/2011 VLDB 2011

33 Tying it Together Monitor, Monitor, Mine, Mine, Manage Manage Structure / Value Knowledge Application Knowledge Knowledge Application Knowledge Model Model Generation Generation Information Information

Data Transform Transform & & Load Load Data Information Information Production Production Signal Digital Digital Shoebox Shoebox t Time to Insight 8/31/2011 EffortVLDB / Latency 2011

34 R&D Agenda Improved sensemaking tools: Visualization Temporal and spatial correlation Machine learning Large Ambient Data can eclipse existing methods E.g. language translation Robust big-data query processing Leverage various degrees of structure and modeling General locality awareness Checkpoint vs. restart tradeoff Emergent intermediate structure infer and reify dimensions Re-stating history Re-feed downstream systems sourcing from big-data environment Re-think slowly changing dimensions 8/31/2011 VLDB 2011 35 Wrap up

Big Data is multi-faceted Interesting architecture/design patterns emerging Realizing new value requires re-thinking existing system assumptions Time to insight/action should be a driving metric Complements existing data platform Intersection with HPC/TC world This is reshaping information management 8/31/2011 VLDB 2011 36

