StreamApprox Approximate Computing for Stream Analytics https://streamapprox.github.io Do

StreamApprox Approximate Computing for Stream Analytics https://streamapprox.github.io Do

StreamApprox Approximate Computing for Stream Analytics https://streamapprox.github.io Do Le Quoc, Ruichuan Chen, Pramod Bhatotia, Christof Fetzer, Volker Hilt, Thorsten Strufe 12/2017 Modern online services Stream Aggrega tor Stream Analytics System Useful Information 2

Modern online services Approximate computing Tension Low latency Efficient resource utilization 3 Approximate Computing Many applications: Approximate output is good enough! The trend of data is more important than the precise numbers E.g. : Google Trends --- Bitcoin vs USD (Sep/2017 Nov/2017) 100 50 Averag e

0 Sep 7 Oct 5 Nov 2 Nov 30 4 Approximate Computing Idea: To achieve low latency, compute over a sub-set of data items instead of the entire data-set Approximate computing Take a sample Comput e Approximate

output Error bound 5 State-of-the-art systems BlinkDB [EuroSyS13] Using pre-existing samples Not designed ApproxHadoop for Using multi-stage sampling [ASPLOS15] stream analytics Quickr [SIGMOD16] Injecting samplers into query plan 6 Outline Motivation

Design Evaluation 7 StreamApprox: Overview Input data stream Streaming queryQuery budget S1 S2 Sn Stream aggregat or (E.g Kafka) Data stream

StreamAppr ox Approximate output error bound Query budget: Latency/throughput guarantees Desired computing resources for query processing Desired accuracy 8 Key idea: Sampling Simple random sampling (SRS): Stratified sampling (STS): SR S SRS SRS SRS

9 Key idea: Sampling Reservoir sampling (RS): i pro h t i W ) 1- Drop ( y t item i i babil Wit hp rob abil

ity ( Replace by item i Size of reservoir = k 10 Spark-based Sampling Spark-based Simple Random Sampling (Sparkbased SRS) Step #1 0.01 0.08 0.0 2 Step #2 0.010.0 2 0.0 6 0.0 0.1 0.6 2 8

6 0.08 0.1 2 0.1 5 0.1 0.2 0.8 6 8 5 0.2 6 Assign each item with a random number in [0, 1] 0.6 0.8 8 8 Sort items based on assigned value

Step #3 0.010.0 2 0.0 6 0.08 0.1 2 Take out k smallest items Sorting big data is very expensive 11 Spark-based Sampling Spark-based Stratified Sampling (Spark-based STS) Step #1

Step #2 Create strata using groupByKey() Apply SRS to each stratum Si Step #3 Synchronize between worker nodes to select a sample of size k These steps are very expensive 12 StreamApprox: Core idea Online Adaptive Stratified Reservoir Sampling (OASRS)

S 1 S RS Weight = #items/ Size of reservoir = k k = 8/4 RS Weight = #items/ k = 6/4 R S Weight = 1 2 S 3

RS : Reservoir Sampling k=4 Easy to parallelize, doesn't need any synchronization between workers 13 StreamApprox: Core idea Worker 1 OASRS Weight =2 Weight = 1.5 Weight =1 Worker

2 Weight OASRS =1 Weight =2 Weight = 1.5 Size of reservoir = 4 14 Implementation S1 S2 Sn Stream aggregat or Data stream

StreamAppr ox Approximate output error bound OR 15 Implementation S S 1 2 Sn Stream aggregat or Samplin g module

Batch Batched generat RDDs or Error estimatio Refined n module sampling parameters Spark computation engine Output error bound Spark-based StreamApprox 16 Implementation S S 1 2

Sn Stream aggregat or Samplin g module Refined sampling parameters Flink Computation Engine Error Estimatio n module Output Error bound Flink-based StreamApprox

17 Outline Motivation Design Evaluation 18 Experimental setup Evaluation questions Throughput vs sample size Throughput vs accuracy See the paper for more results! Testbed Cluster: 17 nodes Datasets: Synthesis: Gaussian distribution, Poisson distribution datasets CAIDA Network traffic traces; NYC Taxi ride records

19 T h ro u g h p u t (M ) # it e m s / s Throughput Higher the better 7 6 5 4 3 2 1 0 Flink-based StreamApprox Spark-based StreamApprox Spark-based STS 10 20

40 60 80 Sampling fraction (%) Spark-based StreamApprox: ~2X higher throughput over Spark-based STS Flink-based StreamApprox: 1.3X higher throughput over Spark-based StreamApprox With sampling fraction < 60% 20 Th rou g h p u t (M ) # ite m s / s Throughput vs Accuracy Higher the better 5000 Flink-based StreamApprox Spark-based StreamApprox

Spark-based STS 4000 3000 2000 1000 0 0.5 1 Accuracy loss (%) Spark-based StreamApprox: ~1.32X higher throughput over Sparkbased STS Flink-based StreamApprox: 1.62X higher throughput over Spark-based StreamApprox 21 With the same accuracy loss Conclusion StreamApprox: Approximate computing for stream analytics Transparen t

Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Online stratified sampling technique Thank you! Details: StreamApprox [Middleware17] https://streamapprox.github.io 22

Recently Viewed Presentations

  • Trade Inequality in Developing Countries: A General Equilibrium

    Trade Inequality in Developing Countries: A General Equilibrium

    Notes: The figure plots changes in a country's wage inequality against a measure of how its export shares have shifted towards more skill-intensive goods. The panels are partial regression plots. The top panel controls for country fixed effects.
  • Equilibrium of Concurrent, Coplanar Force Systems

    Equilibrium of Concurrent, Coplanar Force Systems

    Equation used for spring load can be wrong! Equilibrium of Concurrent, Coplanar Force Systems EF 202 - Week 5 Equilibrium Newton's First Law - If, and only if, an object's "mass center" has zero acceleration, then the sum of ALL...
  • The Challenge: To Create More Value in All Negotiations

    The Challenge: To Create More Value in All Negotiations

    We just live in their world." —Danny Hillis, Thinking Machines (Wired 01.2011) "It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change." —Charles Darwin "You must be the change...
  • ESSENTIAL ENGLISH ETAQ MASTERCLASS UNIT 3 MAY 2018

    ESSENTIAL ENGLISH ETAQ MASTERCLASS UNIT 3 MAY 2018

    Will have an excerpt in the exam. POLITICAL . CARTOON. Simple. Exposed to similar as prep. UNSEEN. UNIT 3 - CIA. Susan - how BSSC has implemented. Model CIA . SEEN. Explain how the audience is positioned to accept the...
  • PEACE Project - Seerah School

    PEACE Project - Seerah School

    Biased. Bill 13 Accepting Schools Act. A Culture-Shift Mandate: Embrace the Homosexual Agenda. Welcome to the PEACE Project. Thank you for your interest in Bill 13 and the affect it will have on curriculum in schools and on the values...
  • Jan Brett - PC&#92;|MAC

    Jan Brett - PC\|MAC

    Jan Brett Author and Illustrator Jan Brett Jan Brett loves animals! When she was young, horses were her favorite animal. She has several pets: ducks, chickens, and a hedgehog named Buffy. Jan Brett Jan Brett Jan Brett Jan Brett Jan...
  • Reactions Follow-up - luckyscience

    Reactions Follow-up - luckyscience

    Reactions Follow-up. Synthesis Reactions. Find an example of each type in your packet - a. metal + nonmetal binary salt. ... LEO the lion goes GER. Lose Electrons Oxidation. Gain Electrons Reduction. OIL RIG. Oxidation Is Loss. Reduction Is Gain....
  • Fundamentals of Game Design, 2nd Edition - Ch 1

    Fundamentals of Game Design, 2nd Edition - Ch 1

    Scott Kim's eight steps in puzzle game design. Lecture #2 Genres of Comp. Games. Puzzle Games (Cont.) What computers bring to puzzles. Enable nonphysical or awkward moves. Include computation features. Enforce the rules. Record player moves and undo them.