Data Warehousing Overview: Issues, Terminology, Products and ...

Data Warehousing Overview: Issues, Terminology, Products and ...

Data Warehousing CPS216 Notes 13 Shivnath Babu Warehousing Growing industry: $8 billion way back in 1998 Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ... 2 Outline What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse Future directions 3 What is a Warehouse? Collection of diverse data

subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile more 4 What is a Warehouse? Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse 5 Warehouse Architecture Client

Client Query & Analysis Metadata Warehouse Integration Source Source Source 6 Motivating Examples Forecasting Comparing performance of units Monitoring, detecting fraud Visualization 7 Why a Warehouse? Two Approaches: Query-Driven (Lazy) Warehouse (Eager) ? Source Source

8 Query-Driven Approach Client Client Mediator Wrapper Source Wrapper Wrapper Source Source 9 Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information 10

Advantages of Query-Driven No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 11 OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse 12 OLTP vs. OLAP OLTP

Mostly updates Many small transactions Mb-Gb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 13 Data Marts Smaller warehouses Spans part of organization

e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? 14 Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other 15

Star product prodId p1 p2 name price bolt 10 nut 5 sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer custId 53 81 111 store custId 53 53 111 name joe fred sally prodId p1

p2 p1 storeId c1 c1 c3 address 10 main 12 main 80 willow qty 1 2 5 storeId c1 c2 c3 city nyc sfo la amt 12 11 50 city sfo sfo la 16

Star Schema product prodId name price sale orderId date custId prodId storeId qty amt customer custId name address city store storeId city 17 Terms Fact table Dimension tables Measures product prodId name price

sale orderId date custId prodId storeId qty amt customer custId name address city store storeId city 18 Dimension Hierarchies sType store store storeId s5 s7 s9 city cityId sfo sfo la tId t1 t2

t1 mgr joe fred nancy snowflake schema constellations region sType tId t1 t2 city size small large cityId pop sfo 1M la 5M location downtown suburbs regId north south region regId name north cold region south warm region 19

Cube Fact table view: sale prodId p1 p2 p1 p2 storeId c1 c1 c3 c2 Multi-dimensional cube: amt 12 11 50 8 p1 p2 c1 12 11 c2 c3 50 8 dimensions = 2

20 3-D Cube Fact table view: sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 Multi-dimensional cube: date 1 1 1 1 2 2 amt 12 11 50 8 44 4

day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 c3 50 8 dimensions = 3 21 ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing 22 Aggregates Add up amounts for day 1

In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 81 23 Aggregates

Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 ans date 1

2 sum 81 48 24 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11

50 8 44 4 sale prodId p1 p2 p1 date 1 1 2 amt 62 19 48 rollup drill-down 25 Aggregates Operators: sum, count, max, min, median, ave Having clause Using dimension hierarchy average by region (within store) maximum by month (within date)

26 Cube Aggregation day 2 day 1 p1 p2 c1 p1 12 p2 11 p1 p2 c1 56 11 c1 44 c2 4 c2 c3 Example: computing sums ... c3 50 8 c2 4

8 rollup drill-down c3 50 sum c1 67 c2 12 c3 50 129 p1 p2 sum 110 19 27 Cube Operators day 2 day 1 p1 p2 c1 p1 12 p2 11

p1 p2 c1 56 11 c1 44 c2 4 c2 c3 ... c3 50 sale(c1,*,*) 8 c2 4 8 c3 50 sale(c2,p2,*) sum c1 67

c2 12 c3 50 129 p1 p2 sum 110 19 sale(*,*,*) 28 Extended Cube c2 4 8 c312 p1 p2 c1 * 12 p1 p2 c1* 44 c1 56 11 c267

4 c2 44 c3 4 50 11 23 8 8 50 * 62 19 81 * day 2 day 1 p1 p2 * c3 50 * 50 48 48

* 110 19 129 sale(*,p2,*) 29 Aggregation Using Hierarchies day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 c3 50 8 customer region country p1

p2 region A region B 56 54 11 8 (customer c1 in Region A; customers c2, c3 in Region B) 30 Pivoting Fact table view: sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Multi-dimensional cube: date 1 1 1 1 2 2

amt 12 11 50 8 44 4 Pivot turns unique values from one column into unique columns in the output day 2 day 1 p1 p2 c1 p1 12 p2 11 p1 p2 c1 56 11 c1 44 c2 4 c2 c3 c3

50 8 c2 4 8 c3 50 31 Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh 32 Materialized Views sale Define new warehouse relations using SQL expressions prodId storeId p1

c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 joinTb date 1 1 1 1 2 2 prodId p1 p2 p1 p2 p1 p1 amt 12 11 50 8 44 4 name

bolt nut bolt nut bolt bolt product price 10 5 10 5 10 10 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 id p1 p2 amt 12

11 50 8 44 4 name price bolt 10 nut 5 does not exist at any source 33 Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Client Query & Analysis Metadata Warehouse Integration Source Source

Source 34 ROLAP Server Relational OLAP Server sale prodId p1 p2 p1 date 1 1 2 sum 62 19 48 tools utilities ROLAP server Special indices, tuning; Schema is denormalized relational DBMS

35 MOLAP Server Ci ty Multi-Dimensional OLAP Server B A M.D. tools Product milk soda eggs soap 1 utilities multidimensional server Sales 2 3 4 Date could also sit on relational DBMS

36 Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, Popular in Warehouses inverted lists bit map indexes join indexes text indexes 37 Inverted Lists 18 19 20 21 22 23 25 26 age index

r5 r19 r37 r40 rId r4 r18 r19 r34 r35 r36 r5 r41 name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26 ... 20 23 r4

r18 r34 r35 inverted lists data records 38 Using Inverted Lists Query: Get people with age = 20 and name = fred List for age = 20: r4, r18, r34, r35 List for name = fred: r18, r52 Answer is intersection: r18 39 Bit Maps 20 23 20 21 22 1 1 0 1

1 0 0 0 0 23 25 26 age index bit maps 0 0 1 0 0 0 1 0 1 1 id 1 2 3 4 5 6 7 8 name age joe

20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26 ... 18 19 data records 40 Using Bit Maps Query: Get people with age = 20 and name = fred List for age = 20: 1101100000 List for name = fred: 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed

41 Join Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT WHERE ... sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 joinTb date 1 1 1 1 2 2 prodId p1 p2 p1 p2 p1

p1 amt 12 11 50 8 44 4 name bolt nut bolt nut bolt bolt product price 10 5 10 5 10 10 storeId c1 c1 c3 c2 c1 c2 date 1 1

1 1 2 2 id p1 p2 name price bolt 10 nut 5 amt 12 11 50 8 44 4 42 Join Indexes join index product sale id p1 p2 rId r1 r2 r3 r4

r5 r6 name price bolt 10 nut 5 jIndex r1,r3,r5,r6 r2,r4 prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50

8 44 4 43 What to Materialize? Store in warehouse results useful for common queries Example: total sales day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 p1 p2 materialize c1 56

11 c2 4 8 c3 50 ... p1 c1 67 c2 12 c3 50 129 p1 p2 c1 110 19 44 Materialization Factors Type/frequency of queries Query response time Storage cost Update cost

45 Cube Aggregates Lattice 129 c1 67 p1 c2 12 c3 50 city city, product p1 p2 c1 56 11 c2 4 8 all product city, date date product, date

c3 50 day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 city, product, date use greedy algorithm to decide what to materialize 46 Dimension Hierarchies all state cities city

c1 c2 state CA NY city 47 Dimension Hierarchies all city city, product product city, date city, product, date date product, date state state, date state, product state, product, date not all arcs shown... 48 Interesting Hierarchy time all years

weeks quarters months day 1 2 3 4 5 6 7 8 week 1 1 1 1 1 1 1 2 month 1 1 1 1 1 1 1 1 quarter 1 1 1

1 1 1 1 1 year 2000 2000 2000 2000 2000 2000 2000 2000 conceptual dimension table days 49

Recently Viewed Presentations

  • Future 802.16 Networks: Challenges and Possibilities IEEE 802.16

    Future 802.16 Networks: Challenges and Possibilities IEEE 802.16

    The mobile web will reach this milestone 18 years after the first text message was sent. Cisco claims that every smartphone added to a network is akin to adding 30 feature phones, while adding a laptop is like adding 450...
  • Fracking Safety & Economics November 9, 2017 America

    Fracking Safety & Economics November 9, 2017 America

    50-200 Ft. 2,500 Ft. 5,000 Ft. ... Economic trends indicate America's tight oil and gas resources remain viable even at oil prices in the 40-50 $/Bbl range. Long term (3-5 yrs from initial production) well production - uncertainties remain. New...
  • peacechem.weebly.com

    peacechem.weebly.com

    VSEPR. The electron pairs try to get as far away as possible to minimize repulsion. VSEPR is based on the number of pairs of valence electrons, both bonded and non-bonded. An non-bonded pair of electrons is referred to as a...
  • Word List: pre

    Word List: pre

    Marie Curie. a scientist who won two Nobel prizes in the early 1900s for her work with radioactivity. physical science. the study of matter and energy. ... constant. factor in an experiment that stays the same. independent variable. independent variable.
  • Establishing a Mexican Breeding Bird Survey program

    Establishing a Mexican Breeding Bird Survey program

    Establishing a Mexican Breeding Bird Survey (BBS) Program Keith L. Pardieck1, Humberto A. Berlanga2, Connie M. Downes3, Bruce G. Peterjohn1, David J. Ziolkowski, Jr.1, and Brian Collins3 1USGS Patuxent Wildlife Research Center, 12100 Beech Forest Road, Laurel, MD 20708-4038 U.S.A.
  • Nuclear Chemistry

    Nuclear Chemistry

    Alpha Decay: occurs when an unstable nucleus emits an alpha particle which is a helium atom - 2 protons, 2 neutrons. Nuclear Chemistry. Beta Decay: an unstable nucleus emits an electron. A neutron changes into a proton and an electron...
  • Accountable in an Unaccountable World - 1 Corinthians 5

    Accountable in an Unaccountable World - 1 Corinthians 5

    Accountable in an Unaccountable World. 1 Corinthians 5. Corinthian Sin. The man's sin: ... "judged" = pass a verdict and sentence to a punishment. removal from fellowship of the church "deliver such a one to Satan" ... keep fervent in...
  • The Body in Motion - BIOLOGY JUNCTION

    The Body in Motion - BIOLOGY JUNCTION

    Kingdom Fungi Kingdom Fungi contains more than 81,000 known species (most terrestrial) All fungi are eukaryotes Cells contain Membrane-enclosed nuclei Mitochondria Other membranous organelles Fungi are heterotrophs Most are decomposers Some are parasites Found universally wherever organic material is available...