Big Ambient Data - VLDB

Big Ambient Data - VLDB

Is it Still Big Data if it Fits in my Pocket? Dave Campbell Microsoft The Journey: Objective Try to separate hype from reality Identify unique new value Is map-reduce a giant steps backwards? What are the dominant dimensions of Big Data 8/31/2011 VLDB 2011 2 The Journey: Process Engaged and connected with many people Many interesting debates

Created, tested, and refined a frame to explain Big Data phenomenon Major driving forces Encountered independently evolved common patterns Wrote code & prototyped 8/31/2011 VLDB 2011 3 The Journey: Results An (one) explanation for the phenomenon A set of design & architecture patterns Material to inform an R&D agenda 8/31/2011 VLDB 2011 4 The Story 8/31/2011

VLDB 2011 5 The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Signal Effort / Latency 8/31/2011 VLDB 2011 6 The Current Paradigm l e l s

d e l n r o d e o e o d m ti w l s o m s a e l n a m u

u t a ta a l t c a q i a p o d s da c t e e i y c e h g s h

t n h o n e p l t r o o h ) t c e a a t ti c e s w d d

d d un ns l le uil e il i l a u u u A (T B Lo Q B B Co t Time to Insight: Weeks to Months 8/31/2011 VLDB 2011

7 Lifecycle of a Question Validation Question o sti e Qu t n re e ff Di n Not interesting Worth asking again? Make it repeatable Bring it to production 8/31/2011

VLDB 2011 8 A Tightly Coupled System Available data prepared on basis of scope of analysis Available Data Conceptual 8/31/2011 Model Logical VLDB 2011 Scope of Analysis Physical

9 Models have traditionally been coupled Conceptual Logical Physical Logical model has been scaffolding for physical: Relational: Indexes MOLAP: Aggregations In-memory technologies breaking logical/physical knot! Knowledge domain coupled to conceptual model 8/31/2011 VLDB 2011 10 Todays Challenge Available Data Model

Scope of Analysis Data are freely available Ability to model it is much more of a gating factor than raw size Particularly when considering new forms of data 8/31/2011 VLDB 2011 11 Sensemaking: Intelligence Analysis Reference: The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive 8/31/2011 VLDB 2011 12 Task Analysis, Peter Pirolli and Stuart Card, 2005 Sensemaking Explanation Explanatory

Frame Data Support Interdependent relationship Supports abductive logic Current systems support sensemaking within a modeled domain. Big Data expands this. Reference: A Data-Frame Theory of Sensemaking, G.A. Klein, et. al. 8/31/2011 VLDB 2011 13 Key Value Proposition Model Model Available Data Model Model Model Traditional

System Traditional System New System Key elements of Big Data value: Reducing friction to produce valuable information Enabling sensemaking over a broader space Enabling model / algorithm generation 8/31/2011 VLDB 2011 14 Reworking The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Knowledge

Application Knowledge Information Data Signal t Time to insight improvement 8/31/2011 EffortVLDB / Latency 2011 15 Objective: Change Shape of Two Curves 8/31/2011 VLDB 2011 16 Emergent Architectural Patterns

8/31/2011 VLDB 2011 17 Big Data Patterns Have observed some common patterns Many appear to occur via independent evolution Prototyped over personal sources Patterns: Digital Shoebox Information Production Transform & Load Model Development Monitor, Mine, Manage 8/31/2011 VLDB 2011

18 Pattern: Digital Shoebox Intent: Retain all ambient data to enable sensemaking over all available signals Applicability: Use to create a source data pool to bootstrap subsequent information generation Description: Enabling Enabling Trends: Trends: Cost Cost of of data data acquisition acquisition $0 $0 Cost Cost of of data data storage storage $0 $0 Tipping Tipping point point occurs

occurs if: if: ( ( ) + ( ) > )( ) Must Must keep keep modeling modeling and and storage storage costs costs low low to to achieve achieve this this

Implementation: Augment Augment raw raw data data with with sourceID, sourceID, and and instanceID instanceID and and retain retain on on inexpensive inexpensive but but reliable reliable storage storage 8/31/2011 VLDB 2011 19 Pattern: Digital Shoebox Source Model: The natural model in which the data are produced Acquisition Model: An augmented source model which contains source identifier and instance (typically timestamp)

AcquisitionModel = {sourceID, instanceID, sourceData} SourceID InstanceID Source Source Source A A A B B B C C C Source Source Source Source Source Source 8/31/2011 VLDB 2011

1 2 3 1 2 3 1 2 3 Source Source Source Source Source Source Source Source Source 20 Personal Example GPS GPS GPS A A

A B B B C C C Outlook Outlook Outlook HA HA HA 1 2 3 1 2 3 1 2 3 Source

Source Source Source Source Source Source Source Source GPS Have been carrying a GPS data logger for 5 months HA Log file from home automation system Outlook Have script that produces when I send mail, to whom, and, if a reply, my response latency 8/31/2011 VLDB 2011 21 Pattern: Information Production Intent: Turn acquired data from digital shoebox into other events and states Applicability: Used to transform raw data into information for subsequent processing Description: Often requires temporal processing & correlation of acquired data

Key point: Cleansing often much easier in transformed domain Implementation: Requires environment for parsing, grouping, aggregation, and often joining of acquired data 8/31/2011 VLDB 2011 22 Information Production Transform Transform Transforms source data into events & states Data cleanup, cleansing & imputation Quite often cleansing happens in transformed domain E.g. Nights on the road vs. @ Home Wind up with a set of composable transforms Produced information stored in

Digital Shoebox or downstream system 8/31/2011 VLDB 2011 23 Personal Example - GPS T3 T2 Source T1 T4 T5 8/31/2011 Tree of transforms and filters Cleansing often happens in transformed domain E.g. Where I slept each night Can produce higher level information [DwellAtHome],[RouteToWork], [DwellAtWork] = Commute to work

Using higher level information: Commute duration f(leavingTime) VLDB 2011 24 Commute Time as f(leaveTime) 8/31/2011 VLDB 2011 25 Event & State Correlation Dwell geolocation + 2011-06-10 2011-06-10 06:18:26, 06:18:26, 2011-06-10 2011-06-10 06:16:18, 06:16:18, 0.04 0.04 2011-06-10 2011-06-10 06:21:18,

06:21:18, 2011-06-09 2011-06-09 08:27:50, 08:27:50, 21.89 21.89 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 2011-06-10 06:26:48, 06:26:48, None, None, 0.00 0.00 2011-06-10 2011-06-10 06:29:37, 06:29:37, 2011-06-09 2011-06-09 06:53:34, 06:53:34, 23.60 23.60 2011-06-10 2011-06-10 06:34:41, 06:34:41, 2011-06-09 2011-06-09 12:00:25, 12:00:25, 18.57 18.57

2011-06-10 2011-06-10 06:39:52, 06:39:52, 2011-06-09 2011-06-09 17:44:54, 17:44:54, 12.92 12.92 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 Outlook statistics = How much email do I send from home vs. at work? 8/31/2011 VLDB 2011 26 Pattern: Transform & Load Intent: Transform acquired data and produced information to load into traditional systems e.g. Data Warehouse, OLAP cube, etc.

Applicability: Used to load other systems for production use or other analysis Description: Transformations and queries over the Digital Shoebox are used to load downstream systems Jobs can be scheduled or invoked by other systems Implementation: Requires repeatable transform mechanism Adapters to downstream systems Scheduling mechanism 8/31/2011 VLDB 2011 27 Transform & Load Acquisition Model Information Information Model Information Model Information

Model Information Model Model Data Mart 8/31/2011 Data Warehouse CEP System VLDB 2011 28 Pattern: Model Development Intent: Enable sensemaking directly over the Digital Shoebox without extensive up front modeling Applicability: Used to create knowledge from Digital Shoebox contents Description:

Provide a suite of tools which operate efficiently to enable model discovery, refinement and validation Implementation: Requires exploration, visualization, and statistical tools 8/31/2011 VLDB 2011 29 Model Development Example Its clear that Im an early to be, early to rise, guy When not home, only activity is from the pet-sitter & cleaners Marcia gets up after me and likes to read in bed 8/31/2011 VLDB 2011 30 Pattern: Monitor, Mine, Manage Intent: Develop and use generated models to perform active management or intervention

Applicability: Use for fraud detection, system alerting, intrusion detection, user classification, Description: Historical data is used to develop a model (algorythm) which is installed in active system Implementation: Requires model generation pattern, active monitoring system [e.g. Complex Event Processing (CEP)] 8/31/2011 VLDB 2011 31 Pattern: Monitor, Mine, Manage 2 1. Monitor & collect data 2. Mine and create online model 3. Deploy online model to actively manage 1 This is about reducing Time to Action! 3

Examples: Financial fraud detection and prevention Audience intelligence Personal: Home & Away settings for home automation 8/31/2011 VLDB 2011 32 Pattern Map Digital Shoebox Model Development Information Production Monitor, Mine & Manage Transform & Load 8/31/2011 VLDB 2011

33 Tying it Together Monitor, Monitor, Mine, Mine, Manage Manage Structure / Value Knowledge Application Knowledge Knowledge Application Knowledge Model Model Generation Generation Information Information

Data Transform Transform & & Load Load Data Information Information Production Production Signal Digital Digital Shoebox Shoebox t Time to Insight 8/31/2011 EffortVLDB / Latency 2011

34 R&D Agenda Improved sensemaking tools: Visualization Temporal and spatial correlation Machine learning Large Ambient Data can eclipse existing methods E.g. language translation Robust big-data query processing Leverage various degrees of structure and modeling General locality awareness Checkpoint vs. restart tradeoff Emergent intermediate structure infer and reify dimensions Re-stating history Re-feed downstream systems sourcing from big-data environment Re-think slowly changing dimensions 8/31/2011 VLDB 2011 35 Wrap up

Big Data is multi-faceted Interesting architecture/design patterns emerging Realizing new value requires re-thinking existing system assumptions Time to insight/action should be a driving metric Complements existing data platform Intersection with HPC/TC world This is reshaping information management 8/31/2011 VLDB 2011 36

Recently Viewed Presentations

  • Year 10 Mock Exams in Maths and Science are March 22-23

    Year 10 Mock Exams in Maths and Science are March 22-23

    Year 10 Mock Exams in . Maths. and Science are March 21-24. Y10 English Mock Exam is July 4 ... Go through your books. Look at the . Maths. and Science homepages on Learning Space. Use . MathsWatch, Kerboodle. and...
  • Special Education Fiscal Auditing

    Special Education Fiscal Auditing

    Special Education . Fiscal Auditing. Roselynn Bittorf - SFS Consultant . School Financial Services Team. WASBO Accounting Conference 2019
  • Biomedical Therapies in Autism -

    Biomedical Therapies in Autism -

    What are Biomedical Therapies? Can be defined as any agent or therapy that directly influences the body's internal environment. Includes diet/nutrition, nutraceuticals, pharmaceuticals, etc.
  • Chapter 4C. Extra Debt Materials Edited December 31,

    Chapter 4C. Extra Debt Materials Edited December 31,

    Chapter 4C. Extra Debt Materials Edited December 31, 2010
  • Otologic Manifestations of Systemic Disease

    Otologic Manifestations of Systemic Disease

    the skull, ribs, proximal femur, or tibia. In . the . polyostotic. form, skull lesions are seen in more than 50% of patients. Clinical manifestations of fibrous dysplasia include . bony deformity, pathologic fracture, and cranial nerve palsies. The disease...
  • Tiered Vocabulary Instruction

    Tiered Vocabulary Instruction

    Three Tiers of Words. Tier One words are the words of everyday speech usually learned in the early grades.They are not considered a challenge to the average native speaker, though English language learners of any age will require support from...
  • Medicinal Chemistry & Drug Discovery Dr. Peter Wipf

    Medicinal Chemistry & Drug Discovery Dr. Peter Wipf

    • Several generations of leads were refined and ultimately led to a successful structure with an acceptable safety and activity profile: Bioisosteres - substituents or groups with chemical or physical similarities that produce similar biological properties. Can attenuate toxicity, modify...
  • BIO 342 HUMAN PHYSIOLOGY - Wofford College

    BIO 342 HUMAN PHYSIOLOGY - Wofford College

    Transduction produces a receptor potential Amplitude is usually in proportion to the stimulus intensity Specialty receptor cells with no axon (visual, gustatory, auditory, and vestibular systems). The graded receptor potentials will directly change amount of NT secretion onto 1st order...