Introduction to MPI

Introduction to MPI

Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo*, Wesley Bland+, Pavan Balaji+, Xiaobo Zhou* * Dept. of Computer Science, University of Colorado, Colorado Springs + Mathematics and Computer Science Division, Argonne National Laboratory Outline Overview Backgrounds Challenges FT-MRMPI Design Checkpoint-Restart Detect-Resume Evaluation Conclusion FT-MRMPI for HPC Clusters, SC15 2 MapReduce on HPC Clusters What MapReduce Provides Write serial code and run parallelly Reliable execution with detect-restart fault tolerance model

CPU Memory Storage HPC Clusters High Performance CPU, Storage, Network MapReduce on HPC Clusters High Performance Big Data Analytics Reduced Data Movements between Systems MapReduce on HPC software stack Network Software Env File System Scheduler MapReduce Lib Mira 16 1.6GHz PPC A2 cores 16 GB (1 GB/core) Local: N/A Shared: 24 PB SAN 5D Torus MPI, GPFS Cobalt MapReduce-MPI

Hadoop 16 2.4GHz Intel Xeon cores 64-128 GB (4-8 GB/core) Local: 500 GB x 8 Shared: N/A 10/40 Gbps Ethernet Java, HDFS Hadoop Hadoop With Fault Tolerance FT-MRMPI for HPC Clusters, SC15 3 Fault Tolerance Model of MapReduce Master/Worker Model Detect: Master monitors the all workers Restart: Affect tasks are rescheduled to another worker Worker Job Worker Scheduler Master

MapTask MapTask MapTask ReduceTask ReduceTask ReduceTask MapTask Map Slot Reduce Slot Worker MapTask FT-MRMPI for HPC Clusters, SC15 4 No Fault Tolerance in MPI MPI: Message Passing Interface Inter-process Communication Communicator (COMM) Frequent Failures at Large Scale MTTF=4.2 hr (NCSA Blue Waters) MTTF<1 hr in future MPI Standard 3.1 Custom Error Handler No guarantee that all processes go into the error handler

No fix for a broken COMM FT-MRMPI for HPC Clusters, SC15 5 Scheduling Restrictions Gang Scheduling Scheduler all processes at the same time Preferred by HPC application with extensive synchronizations MapReduce Scheduling Per-task scheduling Schedule each task as early as possible Compatible with the detect-restart fault tolerance model Resizing a Running Job Many platform does not support Large overhead (re-queueing) The detect-restart fault tolerance model is not compatible with HPC schedulers FT-MRMPI for HPC Clusters, SC15 6 Overall Design Fault Tolerant MapReduce using MPI Reliable Failure Detection and Propagation Compatible Fault Tolerance Model FT-MRMPI Task Runner Distributed Master & Load Balancer Failure Handler

MapReduce Job Distributed Master Task Runner Load Balancer Distributed Master Task Runner Load Balancer Failure Hldr Failure Hldr MapReduce Processs MapReduce Processs MPI Features Tracable Job Interfaces

HPC Scheduler Compatible Fault Tolerance Models Checkpoint-Restart Detect-Resume FT-MRMPI for HPC Clusters, SC15 7 Task Runner Tracing, Establish Consistent States User Program MR-MPI Map() Delegating Operations to the Library RD record Call (*func)() New Interface Process Highly Extensible Embedded Tracing WR KV Record Level Consistency template class RecordReader template class RecordWriter

class WordRecordReader : public RecordReader template class Mapper template class Reducer void Mapper::map(int& key, string& value, BaseRecordWriter* out, void* param) { out->add(value, 1); } FT-MRMPI for HPC Clusters, SC15 int main(int narg, char** args) { MPI_Init(&narg,&args); <***snip***> mr->map(new WCMapper(), new WCReader(), NULL, NULL); mr->collate(NULL); mr->reduce(new WCReducer(), NULL, new WCWriter(), NULL); <***snip***> } 8 Distributed Master & Load Balancer Task Dispatching MapReduce Job Global Task Pool Job Init Task Pool Task

Task Task Recovery Task Task Task Task Task Task Task Task Task Global Consistent State Distributed Master Shuffle Buffer Tracing Load Balancing Task Runner Monitoring Processing Speed of Tasks Linear Job Performance Model Load Balancer

Distributed Master Task Runner Load Balancer Failure Hldr Failure Hldr MapReduce Processs MapReduce Processs FT-MRMPI for HPC Clusters, SC15 9 Fault Tolerance Model: Checkpoint-Restart Failed Process Custom Error Handler Save and exit gracefully Propagate failure event with MPI_Abort() RD record

Normal Process Other Processes Err Hldr Save States MPI_Abort() Checkpoint Asynchronous in phase Saved locally Multiple granularity Restart to Recover Resubmit w/ -recover Pickup from where it left Map FT-MRMPI for HPC Clusters, SC15 Shuffle Reduce 10 Where to Write Checkpoint Write to GPFS Performance issue due to small I/O Interferences on shared hardware Fast, no interferences Global availability in recovery?

Background Data Copier Write local Sync to GPFS in background 1200 Job Completion Time (s) Write to Node Local Disk 1400 1000 800 600 400 200 0 GPFS Local Copier Wordcount 100GB, 256 procs, ppn=8 Overlapping I/O w/ computation FT-MRMPI for HPC Clusters, SC15

11 Recover Point Recover to Last File (ft-file) Less frequent checkpoint Need reprocess when recover, lost some work Recover to Last Record (ft-rec) Require fine grained checkpoint Skipping records than reprocessing 300 250 200 150 100 50 0 ft-rec ft-file Init Skip/Reprocess Recover Runtime Job Recover Time (s) Job Recover Time (s) 300

wordcount 250 200 150 100 50 0 ft-rec ft-file Init Skip/Reprocess Recover Runtime pagerank FT-MRMPI for HPC Clusters, SC15 12 Drawbacks of Checkpoint-Restart Checkpoint-Restart works, but not perfect Large Overhead due to Read/Write Checkpoints Requires Human intervention Failure in Recovery FT-MRMPI for HPC Clusters, SC15

13 Fault Tolerance Model: Detect-Resume Failed Process Detect Global knowledge of failure Identify failed processes by comparing groups Normal Process Other Processes Err Hldr Revoke() Shrink() Resume Fix COMM by excluding failed processes Balanced distribution of affected tasks Work-Conserving vs. Non-Work-Conserving User Level Failure Mitigation (ULFM) MPIX_Comm_revoke() MPIX_Comm_shrink() Map FT-MRMPI for HPC Clusters, SC15 Shuffle Reduce

14 Evaluation Setup LCRC Fusion Cluster[1] 256 nodes CPU: 2-way 8-core Intel Xeon X5550 Memory: 36GB Local Disk: 250 GB Network: Mellanox Infiniband QDR Benchmarks Wordcount, BFS, Pagerank mrmpiBLAST [1] http://www.lcrc.anl.gov FT-MRMPI for HPC Clusters, SC15 15 Job Performance Job Completion Time (s) Job Completion Time (s) 10%-13% overhead of checkpointing Up to 39% shorter completion time with failure 500 50

32 64 128 mrmpi detect-resume (WC) 256 512 1024 2048 1000 100 32 checkpoint-restart detect-resume (NWC) 64 128 MR-MPI detect-resume (WC)

FT-MRMPI for HPC Clusters, SC15 256 512 1024 2048 checkpoint-restart detect-resume (NWC) 16 Checkpoint Overhead Factors Granularity: number of records per checkpoint Size of records 60 Checkpoint Overhead (%) 50 40 30 20 10 0 1 10

100 1000 10000 100000 1000000 10000000 Number of Records per Checkpoint FT-MRMPI for HPC Clusters, SC15 17 Time Decomposition Performance with failure and recovery Wordcount, All processes together Detect-Recover has less data that needed to be recovered Detect-Resume 100% 100% 90% 90% 80%

80% Execution Time Percentage Execution Time Percentage Checkpoint-Restart 70% 60% 50% 40% 30% 20% 10% 0% 70% 60% 50% 40% 30% 20% 10% 64 128 map recover

256 aggregate 512 convert 1024 2048 0% 64 reduce 128 map FT-MRMPI for HPC Clusters, SC15 recover 256 aggregate 512 convert 1024 2048

reduce 18 Continuous Failures Pagerank 256 processes, randomly kill 1 process every 5 secs 1800 1600 1400 1200 1000 800 600 400 200 0 2 4 8 16 32 Work-Conserving Non-Work-Conserving FT-MRMPI for HPC Clusters, SC15 64

128 Reference 19 Conclusion First Fault Tolerant MapReduce Implementation in MPI Redesign MR-MPI to provide fault tolerance Highly extensible while providing the essential features for FT Two Fault Tolerance Model Checkpoint-Restart Detect-Resume FT-MRMPI for HPC Clusters, SC15 20 Thank you! Q&A Backup Slides Prefetching Data Copier Recover from GPFS Reading everything from GPFS Processes wait for I/O Prefetching in Recovery Move from GPFS to local disk Overlapping I/O with computation 2-Pass KV-KMV Conversion

4-Pass in MR-MPI Excessive disk I/O when shuffle Hard to make checkpoints HashTable Keys 2-Pass KV-KMV Conversion Log-Structure File System KV->Sketch, Sketch->KMV KV KMV sketch KMV Recover Time Recover from local, GPFS, GPFS w/ prefetching 50 45 Recover Time (s) 40 35 30 25 20 15 10

5 0 64 128 local 256 GPFS 512 1024 2048 GPFS w/ prefetching FT-MRMPI for HPC Clusters, SC15 25

Recently Viewed Presentations

  • Fig. 2 PilB Pil T+U 5 PilB Vch

    Fig. 2 PilB Pil T+U 5 PilB Vch

    EpsE Vch XcpR Asp GspE Pae EtpE Eco Orf Ngo PulE Kpn XcpR Pal MTB OutE Pca PilT Ngo OutE Pch PilT Cpe Orf1 Pae GspE Bce Orf1 Ssp PilQ GspE Bps PilT Tma GspE Ccr Orf2 Ssp XcpR Ppu...
  • Treatment and Therapies - Get Psyched!

    Treatment and Therapies - Get Psyched!

    History of Therapy. Modern Approach (1900-2000) 50's hospitals were not overly helpful (lobotomy) Deinstutionalized efforts: Psychological Therapy: Focus on changing disordered thoughts, feelings and behaviors (psychotherapy, humanistic, cognitive, behavioral, group/family) Biomedical Therapy: Focus on changing the underlying biology of the...
  • Memory Management

    Memory Management

    COSC 513 Presentation Jun Tian 08/17/2000 Memory Management Memory management is the art and the process of coordinating and controlling the use of memory in a computer system Why memory management?
  • WiMAX - its-wiki.no

    WiMAX - its-wiki.no

    Fixed/Nomadic WiMAX (IEEE 802.16d-2004) Mobile WiMAX (IEEE 802.16e-2005) reference about the commercial frquency implementations. So let's start the first topic, what is WiMAX. WiMAX is the acronym for worldwide interoperability for microwave access. Let's analysis this definition. Worldwide is very...
  • GIFTS/IOMI Clear Sky Fast Model

    GIFTS/IOMI Clear Sky Fast Model

    Clear Sky Forward Model & Its Adjoint Model MURI Review April 27, 2004 leslie moy, dave tobin, paul van delst, hal woolf
  • Syntax, Diction, Tone, Mood

    Syntax, Diction, Tone, Mood

    Diction. writer's choice of words and phrases. Types of Diction: Formal, informal, simple, poetic, monosyllabic, polysyllabic, elevated, etc. Diction will be effective only when the words you choose are appropriate for the audience and purpose, when they convey your message...
  • Rocks

    Rocks

    Volcanic rock having many cavities (vesicles) at the surface and inside. Pyroclastic Texture. Vesicular Texture. ... Intermediate (or andesitic) composition. Contains 25% or more dark silicate minerals. Associated with explosive volcanic activity.
  • Writing a Canadian distance course in Feminist Family

    Writing a Canadian distance course in Feminist Family

    The Context So we begin to write a Canadian course on Feminist Family Therapy Society & Culture Class, gender, and power are critically relevant dimensions in the field feminist family therapy Construction of meaning based on our cultural narratives Society...