Dependability where the Mobile World Meets the Enterprise World a Amiya Kumar Maji Advisor: Prof. Saurabh Bagchi Feb 27, 2015 School of Electrical and Computer Engineering Purdue University West Lafayette, Indiana Slide 1 Introduction Large Scale Internet Mobility
End-to-end services need dependability of both components Slide 2 Summary of Contributions Dependability of Smartphones Study of failures in Android and Symbian. Analyze location of failure manifestation, bug fixes, customizability related failures. ISSRE2010 Evaluation of robustness of Android ICC. Designed and implemented our testing tool JarJarBinks, evaluated and analyzed crashes, suggestions for improving robustness. DSN2012 Dependability of Cloud Applications Evaluated impact of performance interference in public (Amazon
EC2) and private clouds. Mitigate performance interference by intelligent application reconfiguration. MW2014 Mitigate interference by two-level reconfiguration of web server clusters. Improves the previous work by making the controller more agile and effective. ICAC2015 (submitted) Slide 3 Publications A. K. Maji, K. Hao, S. Sultana, S. Bagchi. Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian, in 21st International Symposium on Software Reliability Engineering, ISSRE 2010, November 1-4, 2010, San Jose, California. A. K. Maji, F. A. Arshad, S. Bagchi, J. S. Rellermeyer. An Empirical Study of the Robustness of Inter-component Communication in Android, in 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012, June 25-28, 2012, Boston, MA.
A. K. Maji, S. Mitra, B. Zhou, S. Bagchi, A. Verma. Mitigating Interference in Cloud Services by Middleware Reconfiguration, in 15th International Middleware Conference, MIDDLEWARE 2014, December 8-12, 2014, Bordeaux, France. Provisional application for patent A. K. Maji, S. Mitra, S. Bagchi. ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services, in 12th International Conference on Autonomic Computing, ICAC 2015, July 7-10, 2015, Grenoble, France. (Under review) Slide 4 Agenda Introduction Contributions Prelim Review: Dependability of Smartphones
Study of failures in Android and Symbian Robustness testing of Android ICC Dependability of Cloud Applications IC2: Mitigating interference by middleware reconfiguration ICE: Two-level configuration engine for WS clusters Directions for Future Research Summary Slide 5 Part I (Prelim Review) Dependability of Smartphones Slide 6
Study of Failures in Android and Symbian Analyzed 628 bugs in Android and 153 bugs in Symbian (Oct 2008-Nov 2009) Most bugs (> 90%) are permanent in nature. Majority bugs in Android middleware, fewer bugs in Kernel layer. Both platforms had significant number of bugs in Dev Tools, Web, Multimedia, and Build segments. Analyzed 233 bug fixes in Android Presented categorization of bug fixes. Only 22% of fixes required major code changes (> 10 lines). Question from Preliminary Exam: How Customizability in Android
hasagainst its cost Linux (more bugs). Q. does Android compare in terms of bug density? Bug density Android (< 2.5*10-4) lower than that reported for Windows -4 A.(2.66*10 Linux has -3 a bug density of 1*10 /LOC in Kernel version 2.6.30 ) [Alhazmi 2007] [Palix et al., Faults in Linux: Ten Years Later, ASPLOS 2011]
Collaborators: Kangli Hao, Salmin Sultana Slide 7 Robustness Evaluation of Android ICC Presented JarJarBinks, a tool for evaluating ICC robustness in Android. JJB tests Intent handling capabilities of Android components by sending large number of Semi-valid Intents (Explicit or Implicit). More than 6+ million Intents were sent to 800+ Android components over a week We found ~10% Activities crashed with Semi-valid Intents All crashes manifest as Exceptions in the runtime system. NPE most prevalent in both Android 2.2 and 4.0. Exception handling improved from 2.2 to 4.0 but still is a major concern.
Similar results with Implicit Intents. Components often crashed with valid Intents (Since extra data not captured in Intent-filter definition). Collaborators: Fahad Arshad, Jan S. Rellermeyer Purdue IBM Research, Austin Slide 8 System Crash from User-level Application 3 Activities crashed Android-Runtime Slide 9 Recommendation for Improving ICC Robustness
A. Intent Sub-typing Class CallIntent extends Intent{ String action="ACTION_DIAL"; telUri data; ComponentName cmp; getAction(){ }; setData(){ }; getData(){ }; .. } B. Checking input constraints Static (Java Annotations) Dynamic (Runtime) C. Full input-validation Use domain specific languages (e.g. WSDL)
Make Intent/Intent-filter descriptions more expressive Slide 10 Agenda Introduction Contributions Dependability of Smartphones Study of failures in Android and Symbian Robustness testing of Android ICC Dependability of Cloud Applications IC2: Mitigating interference by middleware reconfiguration ICE: Two-level configuration engine for WS clusters Directions for Future Research Summary
Slide 11 Part II Dependability of Cloud Applications Collaborator: Subrata Mitra, Bowen Zhou, Akshat Verma Purdue IBM Research, Delhi Slide 12 Running Web Applications in the Cloud WS1
WS2 VM1 VM2 .. DB2 App1 DB1 VMn
VM1 VM2 .. Hypervisor Hypervisor Host1 Host2 Network
Storage Slide 13 Appm VMm Imperfect Performance Isolation due to Shared Hardware Resources P1 Processor P2
L1 Cache L1 Other shared resources Memory bandwidth Network/IO Translation Lookaside Buffer (TLB) L2 Cache (last level)
Multi-core Cache Sharing Slide 14 Mitigating Performance Interference in Clouds Performance Interference Performance of one VM suffering due to activity of another colocated VM Why it happens? Low level hardware resources are not partitioned well Contention for Cache, Mem bandwidth, Network can degrade Tail ~ 4 X median Tail ~ 55 X median performance
Our experiments with Amazon EC2 Performance of web servers can suffer drastically during interference Cloudsuite Application benchmark m1.large VM instances (2 cores, 7.5GB) Run for 100 hours EC2 Private Cloud Slide 15 Remediation Techniques Traditional techniques for remediation Better VM placement [Paragon ASPLOS2013] Require changes in hypervisor. Hypervisor scheduling [QCloud EurosysNot feasible in public cloud
2010] Dynamic live migration [Deepdive ATC2013] Our approach Requirements Need user level control Fast response during interference Key idea: Reconfigure application to handle change in operating environment (interference) IC2: Interference-aware Cloud application Configuration Slide 16
Solution Overview Slide 17 IC2: Agenda Performance Interference in Cloud Our approach
Solution Overview Interference vs. Middleware Parameters Interference Detection Configuration Controller IC2 in Operation Key Results Slide 18 Interference vs. Middleware Parameters Setup Server 1 Server 2 Server 3
Web Server Database Clients Interference KVM KVM Servers are Poweredge T320 servers, Xeon E5-2440 processor 6(12) cores, 16GB Memory
Application: Cloudsuite (Olio, Social media calendar) Middlewares: Apache + Php-fpm Slide 19 Interference vs. Middleware Parameters Setup Middleware Parameters Thread-pool parameters Apache: MaxClients Php-fpm: pm.max_children (PhpMaxChildren) Timeout parameter Apache: KeepaliveTimeout Interference: Dcopy from BLAS (cache r+w)
LLCProbe from Ristenpart CCS12 (cache r) Varying sizes of Dcopy to create different levels of contention Slide 20 Choice of Optimal Apache Parameters Optimal MXC changes with interference Optimal KAT changes with interference Depends on degree of interference Need dynamic reconfiguration
Slide 21 Parameter Dependency Parameter dependency changes with interference KAT = MXC / #new_connections/sec no longer valid during interference With interference, need smaller MXC larger KAT Slide 22 Observations Optimal configuration values with interference Optimal MXC decreases, KAT, PHP increases Server capacity with interference CPU saturates sooner with interference
IdleCPU with different interferences (MXC=1100) No-Intf 17% Dcopy-15MB Dcopy-1.5GB 7% 1% Lots of cache misses. CPI increases. Slide 23 Agenda: IC2
Performance Interference in Cloud Our approach Solution Overview Interference vs. Middleware Parameters Interference Detection Configuration Controller IC2 in Operation Key Results
Slide 24 Solution Overview Questions that we answer How to detect interference? Which parameters to reconfigure during interference? How to determine new parameter values? Slide 25 IC2 workflow Interference Detection Interference Detection Use Decision Tree classifier
In EC2, use system and appln. metrics to detect interference Load per operation (LPO) is a key indicator Challenge: Capture metrics variations with configuration changes More details on Decision Tree in paper Slide 26 State Manager In EC2, use buffer states to deal with transient interference/noisy data
Reconfigure only after two successive periods under interference Also masks classifier errors Slide 27 Configuration Controller Choice of parameter driven by knowledge base Created from empirical results shown earlier Can be created by expert administrators Our heuristic Decrease MXC based on proportional increase in LPO Increase KAT based on proportional increase in response time. For PHP use two constant values (no-interference, interference) Implementation
Modified Apache to handle graceful parameter update Called httpd-online: https://github.com/amaji/httpd-online-2.4.3 Slide 28 Agenda
Performance Interference in Cloud Our approach Interference vs. Middleware Parameters Solution Overview Interference Detection Configuration Controller IC2 in Operation Key Results Conclusion Slide 29 IC2 in Operation Setup EC2 m1.large VMs Web server co-located with interference VM
Periodic interference of varying intensity and type (LLCProbe, Dcopy) Private testbed VMs configured to match EC2 specifications Metrics to consider Improvement in response time during interference Detection latency Detection accuracy Slide 30 IC2 Improves Response Time Httpd-online reduces overhead New values
Effects of interference lasts longer in EC2 Default Apache distribution has high overhead of reconfiguration Httpd-online solves this Slide 31 Results IC2 improved response time by upto 40% in private testbed and upto 29% in EC2 during interference Median interference detection latency 15 sec in private testbed 20 sec in EC2 testbed
Classifier accuracy Interference detection showed 89% recall and 73% precision Majority misclassifications due to Interference, Nointerference detected as Transient Our labeling does not account for ambient interference Slide 32 Summary: IC2 Interference causes severe performance degradation in cloud Optimal application configurations change during interference Web services can mitigate effects of interference by reconfiguration We presented the design and implementation of IC2 which reconfigures web servers during interference
Our evaluations showed 40% reduction of response time in Private testbed and 29% reduction in EC2. Slide 33 Agenda Introduction Contributions Review: Dependability of Smartphones Study of failures in Android and Symbian Robustness testing of Android ICC Dependability of Cloud Applications IC2: Mitigating interference by middleware reconfiguration ICE: Two-level configuration engine for WS clusters
Directions for Future Research Summary Slide 34 ICE: An Integrated Configuration Engine for Interference Mitigation Motivation IC2 improves response time by configuring WS parameters WS reconfiguration is costly and limited Use residual capacity in a WS cluster efficiently Objectives Make reconfiguration (interference mitigation) faster Make existing load-balancers interference-aware Get better response time during interference (than IC2)
We use HAProxy as our baseline load-balancer Slide 35 ICE Overview Two-level reconfiguration 1. Update load balancer weight Less overhead. More agile. 2. Update Middleware parameters Only for long interferences. Reduces overhead of idle threads. Slide 36
ICE Design We use hardware counters for interference detection Faster detection Hypervisor access not required if counters are virtualized Slide 37 ICE: Load Balancer Reconfiguration Objective: Keep WS VMs CPU utilization below a threshold Uthres If predicted CPU above threshold, find a new request rate such that it goes below threshold Request rate (RPS) determines server weight value in load balancer configuration Use the following empirical function for load estimation
Predicted Util Past Util CPI RPS Slide 38 Indicator of Interference Evaluation Experimental Setup
Cloudsuite benchmark with different interferences We look at ICE with two different load balancer scheduling policies Weighted Round Robin (WRR or simply RR) WRR shows performance of a static configuration. Weighted Least Connection (WLC or simply LC) WLC shows performance of an out-of-box dynamic load balancer Slide 39 Response Time 200ms 400ms
Least Connection (LC) Round Robin (RR) ICE improves response time both in RR and LC LC (out-of-box) reduces effect of interference significantly, but occasional spikes remain ICE reduces frequency of these spikes Slide 40 Results ICE improves median response time by upto 94% compared to a static configuration (RR) ICE improves median response time by upto 39% compared to a dynamic load balancer (LC) Median interference detection latency
3 sec using ICE (15-20 sec for IC2) Slide 41 ICE: Summary Effect of interference can be mitigated by reducing load on the affected VM We presented ICE for two-level configuration in WS clusters ICE improves median RT by 94% compared to static configuration and 39% compared to a dynamic out-ofbox load balancer Median interference detection latency 3s Slide 42 Agenda
Introduction Contributions Review: Dependability of Smartphones Study of failures in Android and Symbian Robustness testing of Android ICC Dependability of Cloud Applications IC2: Mitigating interference by middleware reconfiguration ICE: Two-level configuration engine for WS clusters Directions for Future Research Summary Slide 43 Directions for Future Research
Reliability with software evolution in Android Enhance JJB by instrumenting ActivityManager IC2: Automated generation of KB How to find which parameters to reconfigure in unknown applications? ICE: Handling long-lasting sessions. Move some sessions to other servers during interference. Slide 44 Summary of Contributions Presented failure characterization of Android and Symbian Robustness testing of Android ICC Designed and implemented JarJarBinks
Analysis of crashes Suggestions for robust ICC Mitigating interference in clouds Presented two solutions for handling interference without hypervisor modification IC2: mitigates interference by middleware reconfiguration ICE: mitigates interference by load-balancer and WS reconfiguration Slide 45 Publications A. K. Maji, K. Hao, S. Sultana, S. Bagchi. Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian, in 21st International Symposium on Software Reliability Engineering, ISSRE 2010, November 1-4, 2010, San Jose, California. [*49]
A. K. Maji, F. A. Arshad, S. Bagchi, J. S. Rellermeyer. An Empirical Study of the Robustness of Inter-component Communication in Android, in 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012, June 25-28, 2012, Boston, MA. [*23] A. K. Maji, S. Mitra, B. Zhou, S. Bagchi, A. Verma. Mitigating Interference in Cloud Services by Middleware Reconfiguration, in 15th International Middleware Conference, MIDDLEWARE 2014, December 8-12, 2014, Bordeaux, France. A. K. Maji, S. Mitra, S. Bagchi. ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services, in 12th International Conference on Autonomic Computing, ICAC 2015, July 7-10, 2015, Grenoble, France. (Under review) [*] is Google Scholar Citations Slide 46 Questions
Slide 47 Acknowledgements Prof. Saurabh Bagchi Committee members Collaborators: Akshat Verma (IBM Research, MakeMyTrip)
Jan S. Rellermeyer (IBM Research) Subrata Mitra (Purdue University) Fahad Arshad (Purdue University) Bowen Zhou (Purdue University) Kangli Hao (Purdue University, Samsung) Salmin Sultana (Purdue University, Intel Research) Slide 48 Thank You! Slide 49 Backup Slides Slide 50