Dynamic Scheduling for Reduced Energy in Configuration-Subsetted Heterogeneous

Dynamic Scheduling for Reduced Energy in Configuration-Subsetted Heterogeneous

Dynamic Scheduling for Reduced Energy in Configuration-Subsetted Heterogeneous Multicore Systems Hammam Alsafrjalani and Ann Gordon-Ross+ Department of Electrical and Computer Engineering University of Florida, Gainesville, Florida, USA Also Affiliated with NSF Center for HighPerformance Reconfigurable Computing + This work was supported by National Science Foundation (NSF) grant CNS-0953447 Introduction and Motivation Reducing energy in computing devices is key goal Application hardware requirements significantly impact energy consumption An applications workload can thrash a cache with improper size/associativity Hardware resources can be specialized for energy efficiency Voltage, clock frequency, cache size/associativity, etc. Application Requirements CPU Speed: 2 GHz Cache: 512KB CPU Speed: 1 GHz Cache: 64KB CPU Speed: 2+ GHz Cache: 1024KB Domain-similar applications have similar resource requirements

Hardware resources can be specialized to meet application requirements 2/22 Introduction and Motivation Heterogeneous multicore systems provide specializations TI OMAP3530 TM ARM big.LITTLE LITTLE Intel Atom TM E6x5C Cortex A8 Cortex M3 Intel Atom Processor SGX530 GPU C674x DSP Altera FPGA big Various hardware resources meet disparate application-domain requirements Limitations Fixed, limited number of heterogeneous options (e.g., number of cores) Only coarse grained specialization Different applications within same domain may need finer-grained specialization Different cache associativity of a same cache size Laborious: designer-expended effort to profile applications to determine application hardware requirements Profiling info: Cache miss rate, pipeline stalls, branch miss rate, etc.

3/22 Static Profiling Profiling Challenges Good optimization potential Heterogeneous multi-core Known Applications Requires a priori knowledge of applications Profiling information dictates best core Hardware Does not react to runtime input, stimuli, environment, etc. Scheduling to best core at run time Profiling at design time Dynamic Profiling Unknown Applications Profile on base/default core - Applications best core known for next

execution - Scheduler uses profiling information during scheduling to select a core No a priori knowledge of applications Flexible: reacts to runtime input, stimuli, environment, etc. Potentially less energy savings if improper cores Incurs profiling overhead 4/22 More Flexibility with Configurable Cores Cores have configurable parameters Cache size, core frequency and/or voltage, etc. Configurations must be tuned Evaluate application requirements Determine the best configuration with respect to design goals More flexible More configurations as compared to total number of cores Finer-grained specialization Tuning incurs overhead Lowest energy Energy

Executing in base configuration Cache Size Tuning Execution time 5/22 Tuning Overhead If EVERY core offers ALL configurations, significant tuning overhead Tuning searches 8 configurations regardless of core Reduce configurations to reduce tuning overhead Tuning searches 2 configurations Must first schedule to core with best configuration for application Core A Core B Core A Core B Core C Core D Core C

Core D Example: quad core system, each core has 2 configurations Example: quad core system, each core has same 8 configurations 6 6/22 Summary Application hardware requirements impact energy consumption Hardware must be specialized to meet application requirements Heterogeneous multicore systems provide specializations Require profiling Limited number of heterogeneous options Configurable cores provide specializations Greater optimization potential Must limit tuning overhead Can eliminate/subset configurations Disparate application requirements, and thus configurations, must be distributed across cores Potential for core bottlenecks 7/22 Given: Problem Definition Disparate application requirements Vast hardware specialization options Goal: Specialized cores for all application requirements while minimizing profiling effort and tuning overhead

More app. Heterogeneous cores Configurable cores 8/22 Prior Work Heterogeneous multicore system Statically schedule applications to cores for reduced energy, Kumar et al. Statically schedule applications to cores with various cache configurations for reduced cache misses, Silva et al. Configurable-core system Tune cores after scheduling for configurable issue width, clock rate, dynamic voltage, caches, etc. Reduce core configurations to small subsets of configurations, Viana et al. No work holistically considered heterogeneous, subsetted, configurable cores Heterogeneous multi-core Configurable multi-core Configuration design space 9/22 Our Solution A heterogeneous and configurable multicore system architecture Domain-specific core configuration subsets Associated scheduling and tuning (SaT) algorithm Core heterogeneity Distinct, unchangeable per-core configuration subsets that meet an application-domain hardware requirements Core configurability Per-core configurable parameters and parameter values

SaT algorithm Dynamically profile application Based on designer goals (e.g., reduced energy) Determine core with needed hardware requirements Tune cores configurable parameters 10/22 Example Heterogeneous Configurable Quad-cores and SaT Heterogeneous, configurable multicore platform 512KB Cache 512KB Cache 64KB Profile Applications 128KB 1-way 32B Line Size 2-way 32B Line Size 2-way 16B Line Size Configurability defined by application-specific requirements (e.g., cache associativity and line size) Scheduling and tuning algorithm (SaT) Heterogeneity defined by domain-specific requirements (e.g., size has most impact on energy, so cores with various cache sizes)

Determine HW req. Schedule to core Tune core 11/22 Determining Configuration Subsets Prior work evaluated domain-similar applications Applications had execution/profiling similarity Applications had similar, but not necessarily the same, best configurations Design space can be subsetted to domain-specific similar configurations Small fraction of the complete design space Still offer best, or near-best, configurations for each application Accurate subset determination Profile several domain-similar applications Configuration design space Three subsets are sufficient to meet varying domain-specific requirements 12/22 Configurable Cache Architecture Complete cache design space 18 configurations Three domain-specific subsets Quad-core heterogeneous, configurable multicore architecture

One core for each domain Fourth core replicates core with largest cache size Additional profiling core Subset configurations mapped to cores based on size Tuning cores changes the cache line size and associativity 13/22 Software Support SaT integrated into OS scheduler Process control block (PCB) PCB Used by typical OS scheduler for application execution status, etc. Process State Process Number SaTs PCB additions Necessary profiling information to make scheduling and tuning decision Typical process information in PCB Registers Profiling Info Additional information used by SaT cores configurations Registers

... PCB size in bits, m, per application applications Process Counter Energy(core, config.) Ex Time(core, config.) ... time energy 14/22 Scheduling and Tuning Algorithm (SaT) SaT Scheduling Stage Applications waiting in ready queue Application Application profiled profiled SaT profiles application first to determine best domain/core Profiling information saved in PCB If application already profiled, SaT attempts scheduling

First checks if best core is idle If busy, check for idle non-best core Based on (1), SaT either schedules to non-best core or leaves the application in the queue Scheduling stage completes After scheduling, tuning stage begins If best configuration in PCB, SaT tunes the core to that configuration directly If not, SaT selects an unused configuration, and stores information in PCB (1) Ready Ready queue queue = ( , ) + ( , ) + ( , ) > ( , ) Yes Yes Yes Best Bestcore core idle idle No

No Profiling Profiling cores cores idle idle Idle Idlenonnonbest best cores cores Execute Executeon onbest best core core Leave Leave application applicationin in queue queue No Leave Leave application application in in ready readyqueue queue

EnergyEnergyadvantageous advantageous scheduling scheduling decision decision Profile Profile application application Leave Leave application applicationin in queue queue Execute Executeon on non-best non-bestcore core Tuning Stage No Best Bestconfig config known known Tune Tuneto tounused

unused config config Yes Tune Tuneto to best best config config 15/22 Experimental Setup Software Setup Diverse benchmark of 36 applications from EEMBC Automotive, MediaBench, and Motorolas Powerstone Replicated persistent application behavior Random queue of 1,000 applications from benchmark applications Generated using discrete uniform distribution Arrival times Normal distribution centered at the mean, one std. from ave. exe. time Hardware Setup Quad-core platform Private level-1 data/inst caches Used SimpleScalar for cache statistics CACTI and energy model to obtain energy values E(total) = E(sta) + E(dyn) E(dyn) = cache_hits * E(hit) + cache_misses * E(miss) E(miss) = E(off_chip_access) + miss_cycles * E(CPU_stall) E(cache_fill) Miss Cycles = cache_misses * miss_latency + (cache_misses * (line_size/16)) * memory_band_width) E(sta) = total_cycles * E(static_per_cycle)

E(static_per_cycle)) = E(per_Kbyte) * cache_size_in_Kbytes E(per_Kbyte) = (E(dyn_of_base_cache) * 10%) / (base_cache_size_in_Kbytes) Cache hierarchy energy model for the level one instruction and data caches 16/22 Evaluation Methodology Evaluated a base system against three proposed systems Base system: quad-core, fixed configuration representing good, average configurations across all applications Three systems with similar core configurations but distinct scheduling algorithms System 1-2:A priori profiling System-1 Applications must wait for best core Performance centric system: maximizes throughput, core utilization Round robin scheduling algorithm A A Provides insights to wasted idle energy Serves as a nearoptimal system for comparison purposes Uses SaT scheduling algorithm Uses energy and performance criteria to schedule applications (e.q.,

(1)) B B System-3 System-2 Energy conservative System 3: Dynamic profiling Provides insights on tradeoffs on scheduling decisions Provides insights on SaT and tradeoffs between performance and energy 17/22 Results: Base system vs. Systems-1, -2, -3 20.16 1.75 1.4 1.2 1.0 0.8

0.6 0.4 0.2 0.0 Dynamic Idle Total Energy normalized to base cache for data cache System-1: Lower dynamic energy System-2: Lower idle energy System-3: Lower dynamic energy System-1: Higher idle energy System-1: Lower total energy System-2: not guaranteed lower dynamic/total energy System-3: Greater idle energy System-3: Lower total energy 3.02 2.56

1.6 Normalized energy to the base system Normalized energy to the base system 1.6 System 1: Energy Conservative System 2: Performance Centric System 3: SaT 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Dynamic Idle Total Energy normalized to base cache for instruction cache 1) Wasted idle can overcome dynamic energy savings, in smaller technologies 2) Uncertain energy savings for performance-centric system 3) System-3 saves total energy, despite increased idle energy. 18/22 Results: System-3 vs. Systems-1, -2 System 1: Energy Conservative System 2: Performance Centric 20.16

System 3: SaT 1.75 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Dynamic Idle Total Energy normalized to base cache for data cache Lower total energy than system-1 and -2 3.02 2.56 1.6 Normalized energy to the base system Normalized energy to the base system 1.6 1.4 1.2 1.0 0.8

0.6 0.4 0.2 0.0 Dynamic Idle Total Energy normalized to base cache for instruction cache Only 4.8% more total energy than system-1 and lower energy than system-2 No a priori knowledge of applications is required 19/22 Profiling and Tuning Overhead Evaluation Measured energy savings with and without a priori knowledge of application profiling information System 3-A with a priori knowledge of profiling information System 3-B without a priori knowledge of profiling information Profiling energy overhead 1.8% for data cache 0.9% instruction cache Normalized energy Overhead is amortized due to persistence nature of applications 1.03 1.02 1.02 1.01 1.01 1 1 0.99

System 3-A System 3-B Instruction Cache Data Cache Energy consumption of system 3-B normalized to energy consumption of system 3-A 20/22 Conclusions Heterogeneous and configurable multicore systems Hardware specialization for disparate application requirement Leveraged application domain specific configuration subsets Associated scheduling and tuning (SaT) algorithm Dynamic application profiling Determined best core Tuned core configuration Average energy savings of 31.6% and 17.0% for the data and instruction caches, respectively Only 1.8% and 0.9% profiling and tuning overhead 21/22 Questions 22/22

Recently Viewed Presentations

  • Accreditation Purpose and Value

    Accreditation Purpose and Value

    Placental Hormones Human chorionic gonadotropin(HCG). Fertilization of the ovum prevents the regression of the corpus luteum (50 d) Enlarges, stimulated by the glycoprotein hormone, human chorionic gonadotropin (hCG), produced by the trophoblast (the developing placenta).
  • Lesson 1 - 2 Describing Distributions with Numbers

    Lesson 1 - 2 Describing Distributions with Numbers

    Two reasons why we use squared deviations rather just average deviations from the mean What is meant by degrees of freedom" Construction Objectives Identify situations in which the mean is the most appropriate measure of center and situations in which...
  • 2012 SSTAGE RTI STAR Award Teasley Middle School

    2012 SSTAGE RTI STAR Award Teasley Middle School

    TIER 1 TIER 1 PROGRAMS PowerPoint Presentation RTI TERMINOLOGY TIER 2 (13% of population) Tier 2 Interventions TIER 3 (1% of population) Tier 3 Interventions TIER 2 & 3 PROGRAMS Attendance Rewards & Interventions RTI Teasley Style BELL SCHEDULE DOUBLE...
  • Paper 1 - NISPLAN

    Paper 1 - NISPLAN

    E. E. We know how hard it is to get decent DVD covers . Lots of books in your department and school library - resource that's ready to hand. ... AQA Paper 1, section B. Imagine that you are going...
  • CS423/523

    CS423/523

    Someone with 20/20 (visual acuity) is just able to decipher a letter that bisects a visual angle of 5 minutes of arc (written 5') at the eye from distance of 20 feet 5' of arc is 5/60 of degree, since...
  • USC Clinical Trials Office (CTO) Answers to CTOs

    USC Clinical Trials Office (CTO) Answers to CTOs

    Clinical Trial Initiation In True 2.0. PI/Research Coordinator responds to all questions and provides mandatory documents (i.e. study protocol, proposed Clinical Trial Agreement if provided, and sponsors first proposed budget) for CTO's review (please refer to True 2.0 training slides).
  • Preparing for WRAP discussions of Natural Conditions and the ...

    Preparing for WRAP discussions of Natural Conditions and the ...

    On haziest days, monitoring data for 2000-2016 can be highly variable due to contributions from natural sources such as fire or dust. Revising the visibility tracking metric from haziest to most impaired days attempted to remove the influence of episodic...
  • Getting Ready for the GDPR - Sefton Council for Voluntary Service

    Getting Ready for the GDPR - Sefton Council for Voluntary Service

    What is the GDPR? The General Data Protection Regulation (GDPR) is new European legislation, replacing the existing European Directive 95/46/EC.. Despite Brexit, GDPR. will. apply in the UK from . 25 May 2018. Overview. Same basic principles as current DP...