CS 5412: Topics in Cloud Computing

CS 5412: Topics in Cloud Computing

CS5412/LECTURE 15 HARDWARE ACCELERATORS Ken Birman CS5412 Spring 2019 HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 1 IN THE EARLY DAYS, DIVIDE

AND CONQUER SUFFICED People broke web page computations into a first-tier, and then a bank of specialized -services optimized for highly parallel computation. Then sharded data and held it in memory, and created huge inmemory (key,value) layers. Batched programming techniques helped to amortize overheads, introducing delays, but weak cache consistency made some delay tolerable. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 2 YET THIS TURNED OUT TO BE EXPENSIVE! Cloud computing companies began to look closely

at their cost of operations, and use of energy An efficient cloud would fully utilize hardware but also minimize energy consumption. Those early steps were valuable and improved these metrics. But as the model matured, inefficiencies became more apparent A lot of resources were owned but not fully used. Time and money and energy was being spent waiting. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 3 TENSION: GENERALITY VS. EFFICIENCY If we understand the workload deeply, we can often create extremely

efficient specialized solutions, and could even create specialized chips that only include the exact hardware ideal for the task. But because computing workloads evolve, the solution would only be ideal for a few years, at best. Then it would start to seem inflexible and inefficient! Conversely, if we are overly general, we have this issue of copying data from place to place, and perhaps computing in less than ideal ways. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 4 CAN WE HAVE IT ALL? Modern datacenter hardware designers are asking: Can they create general purpose solutions in a normal

way yet leverage specialized hardware where the benefits are large in way that still can be upgraded periodically, or repurposed and cut back on work done on the general purpose CPUs? HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 5 BROAD HARDWARE TRENDS Amazon AWS server card

There has always been a tradeoff between generality and efficiency A general purpose CPU has considerable advantages: Very cost-effective (high volume sales drive costs down) Highly performant (Moores law, until ~2010. Multicore+hyperthreading since then), flexible (lots of languages, computing models), familiar. Virtualization (VMs and containers) easily support sharing, so cloud can pack jobs to keep machines busy. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 6

BUT FOR CERTAIN TASKS, SPECIALIZED HARDWARE IS REALLY NEEDED Basically, these are devices that can either do something in hardware that normal CPU instructions dont support (like direct operations on analog signals), or they can do parallel operations very efficiently. The parallel computing opportunity is the most intriguing, today. Someday, the analog dimension may get more attention. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 7

ACCELERATORS: THE SECRET TO AZURE PERFORMANCE! It is important to understand how vital these accelerators are in the cloud. People who pretend the cloud is just a rent-a-server model lose access to the accelerators (the vendors all have security features that block you). So because the accelerators are so amazing, you must use -services! HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 8

HOW MUCH SPEEDUP CAN WE HOPE FOR? This was a debated topic in the 1970s. Some people imagined that there could be magic ways to speed computation up, and the people building the actual chips needed to find a way to limit these unrealistic expectations! Eventually, Gene Amdahl found a way to explain the limits. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 9 AMDAHLS LAW Consider a computational task. We can express the code in terms

of actions that can occur in parallel, and actions that can only be done sequentially. Measure the path-length of the sequential portion. This is performance-limiting for the whole computation! If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelized, then the maximum speed-up that can be achieved by using P processors is 1/(F+(1-F)/P). EXAMPLES If 90% of a calculation can be parallelized then the maximum speed-up on 2 processors is 1/(0.1+(1-0.1)/2) or 1.8 (i.e. investing twice as much hardware speeds the calculation up by almost 2x) but with 10 processors, we only get a 5.2x speedup on 20 processors, our speedup is 6.9x: diminishing returns!

on 1000 processors is 1/(0.1+(1-0.1)/1000) or 9.9x HIGHWAY ANALOGY You buy a Tesla, take it out on California Route 101, and mash the Ludicrous Acceleration button. It can instantly accelerate to the speed of light! But you wont get far Your commute will be limited by stragglers. THE OTHER LIMITING FACTOR: HEAT! The clock rate might seem like a limiting factor, but a faster clock rate pumps more energy into the circuits and logic gates. The heat dissipated will be proportional to the square of the

clock rate. In a parallel computing device, the whole surface might be active. So very fast clock rates make a chip run very hot. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 13 BUT IF A DEVICE GETS TOO HOT Even a general purpose CPU is close to the heat-dissipation limits! Operating systems like Linux run the clock as slowly as possible for less active computing elements, and even disable hardware components that are not currently in use. This helps.

But the clock rate on an accelerator might actually be lower than for a standard CPU! The (only) big win is parallelism. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 14 SO ACCELERATION OPTIONS ARE LIMITED TO HIGHLY PARALLEL TASKS OR BUMP IN THE WIRE Hardware might be able to perform highly parallel steps rapidly. We can also use hardware to reduce the work the host computer is doing. And if host computers cant actually keep up with the

network, we could perhaps wire the network directly to the hardware accelerator and if were lucky, the device might keep up with the incoming data! HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 15 ACRONYM CITY! So now well review a staggering list of incomprehensible Dude! They run 4-letter terms. Verilog on a Xilinx Vertix 5QV!

Cant wait to You should memorize these to impress people.Cool!tell Mom! But we wouldnt see them on exams! Sort of a survey of the options HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 16 FIRST, STANDARD CPUS As you know, prior to 2010 Moores law was still in control and we had general purpose CPUs, with associated DRAM and caches, rotating disks.

Around 2010 rotating disks were displaced by flash memory drives. These are actually kind of slow, so they often have some DRAM as a buffer. Simultaneously, chip designers invented branch prediction, data prefetching, speculative execution, hyperthreading, HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 17 AFTER 2010 WE SAW NUMA Today, a cloud computing data center server probably has 12 or more cores per CPU chip, with DRAM organized into clusters, perhaps 4 chunks of DRAM with 3 cores each. (More cores/server are likely in the future)

An on-board coherency protocol allows any core to access any memory, but the fastest data path is to the local DRAM. Then with container virtualization, we can run lots of HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 18 STORAGE DEVICES ARE IMPROVING TOO Disk I/O (even with flash SSD drives) often limits performance. New non-volatile memory options like Intels Optane NVMe are much faster. They use phase change memory technology. Today:

NVMe is the new flash (somewhat expensive, but very fast) Flash is the new disk (slow, but cheaper and more capacity) Disk is the new tape (even slower, but massive capacity) HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 19 NETWORKS HAVE EVOLVED TOO The Network Interface Card (NIC) on your server now has a small operating system in it, and runs programs in C! (Written by the vendor) You can perform DMA transfers directly from machine to machine, not just from the network in and out of the machine as before. Remote

DMA is like TCP (reliable, ordered, etc) but the hardware does all the work. RDMA is way faster than TCP: we have RDMA at 200Gbps today, but the fastest TCP solutions are easily 4x or 6x slower. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 20 RDMA FEATURES With RDMA you can do some cool tricks Recall that with a NUMA machine, one core can access memory on any DRAM, so every machine shares the full memory pool. With RDMA, any core in the data center can potentially

DMA transfer to memory anywhere else in the data center (but only if authorized). Moreover, RDMA allows direct access to variables or data structures HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 21 GPU: A COMMON OPTION Titan GPU cluster Desktop computers generally have a multicore general-purpose computing infrastructure, but servers have a GPU that the general

machine controls. The vendor creates a software library, so that a general purpose program can ask the GPU to perform a computation: DMA transfer to copy the data from general-purpose DRAM memory into the specialized GPU memory, which allows highly parallel access. GPU program executes to perform desired actions. Finally, the results are copied back to the general purpose host. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 22 GPU PROGRAMMING IS HARD! Example of CUDA code

There has been work on taking a general program coded in Java or C# and automatically finding patterns that can run on a GPU. Like a new compilation model in which the GPU offers special instructions. This gets to within 5x or 10x of hand-coded CUDA, but that isnt enough HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 23 EVEN USING PREBUILT GPU LIBRARIES IS AN ART If our goal was just to recolor photos, it might be easier. But graphics and vision algorithm often do very elaborate long sequences of matrix operations, and they may be

designed with the specialized graphic display cards in mind (those cards can see data directly in memory, and can perform some operations on their own, like rescaling). As a result, hard-core gaming or imaging companies hire specialists. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 24 LIMITING ISSUES WITH GPU? First, heat: on a GPU chip, we do a form of single-instruction, multiple data processing (like multiply every pixel by this value). Expends a lot of energy But also copying: You also do a lot of copying from the general purpose

host memory to the GPU memory, then back. (Hidden in GPU library, but costly). The GPU has extra logic not really needed for machine learning. If we could just power those features down, we could reduce these costs. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 25 GPU FOR MACHINE LEARNING? There has been more and more pressure to equip every computer in the cloud with a GPU, but this is very costly if those GPU units arent all in use. Still, many data centers take this approach.

Google is betting that GPUs just arent cost-effective at scale and decided to strip the concept down to a minimum: TPUs == Tensor Processing Units HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 26 TPU IDEA Googles first TPU unit If the main demand for GPU is from machine learning, not full graphics code, we only need to support a subset of operations. Google focuses on tensor arithmetic (a tensor is just a matrix, but with d dimensions rather than just 1 or 2).

Because less heat is produced by unnessary circuitry we can reduce energy costs, or even run the clock a bit faster. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 27 TENSOR FLOW LANGUAGE? Tensor flow is a version of Python extended to have a built-in concept of tensor objects and TPUs. Easy to express machine learning code this way. The typical program is a kind of data-flow graph in which nodes compute and these tensors flow from input sources to outputs. Mostly, tensor flow programs run on one NUMA machine, taking advantage of the attached TPU unit to accelerate the mathematical

steps. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 28 FIELD PROGRAMMABLE GATE ARRAYS (FPGA) As you know, a CPU chip just maps instructions down to gate-level operations like AND, OR, XOR, NOT. Xilinx invented a way to take a chip and download a wiring diagram and a logic diagram to it. So you can configure your chip to have, say, an ARM or i86 processor on it. Of course that would be silly. But you can also design your own specialized chip, and in theory, it could do anything a GPU or TPU could do, or anything else, really.

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 29 FPGA PROGRAMMING IS HARD! Chip designers use Verilog, and FPGA designers do too. Normal chips are carefully debugged. If you try to create your own FPGA chip and it has bugs, you can cause the FPGA device to hang. So FPGAs are often built up from libraries of logic blocks that are carefully tested, but this makes FPGA programming a specialized task. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP

30 EVEN SO FPGA turns out to be a very cost-effective option for some important cases seen in todays cloud! Cryptography is one example: As a bump in the wire, we can use FPGA chips to perform whatever cryptographic action is needed for various network security protocols (there are many). You could create an Application Specific Integrated Circuit (ASIC) for each protocol, but an FPGA solution can be reconfigured as needed HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP

31 EVEN SO There is also very exciting work on mapping deep neural networks to clusters of FPGA chips. This yields very rapid and cost-effect image classification solutions, or voice recognition ones. The future of vision and speech could easily depend on these FPGA clusters. But these ideas depend on having a cloud full of data (the models used are massive, and there may be one model per situation) HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 32

NICS AND ROUTERS We mentioned that NICs are able to do RDMA and run a kind of TCP on the card, in a dedicated processor. But NICs and Routers are actually becoming programmable Useful for many tasks: smart routing that looks into packets and directs packets based on content. Network virtualization. And there is even talk of running machine-learning tasks directly in the network itself. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 33 AND THERE IS EVEN MORE! I didnt even mention

iSGX: Intels hardware-based privacy and security model Bitcoin mining and BlockChain proof of work codes Special chips for performing tasks like FFTs and Sonar/Radar/Lidar Chips with analog components, or optical components Quantum computing accelerators HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 34 OPTICAL NETWORKS Looking to the future, people are noticing that we can run multiple wavelengths on a typical strand of optical network fiber.

So why not have multiple side-by-side networks? And if the application owns its own wavelength, why not make that network specialized in various ways: a software defined optical wavelength just for the particular application! HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 35 HYPERSTORAGE Microsoft Silica Zettabyte Storage Unit Other cool ideas involve storage systems with insane capacity!

A single cube of computing-quality silicon can record 7-bit numbers using a tricky laser zap system (in a write once model) Then the data will persist for as long as 100,000 years! One silica chip of this kind could hold all the movies ever made, plus the whole library of Congress, and would only be ~1/3 full. There is another massive scale concept that uses DNA as a medium. DNA memory can potentially be read-write HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 36 DISAGGREGATION

Some people believe that eventually, data centers will evolve into what they call a disaggregated model. In this we will have racks of identical components: one kind of rack specialized for CPU, another for memory, for storage, for TPU / GPU, etc. Then the network wiring will evolve to let us assemble the ideal virtual processor on the fly, with exactly the hardware for the specific use case. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 37 SO WHATS THE PROBLEM? The cloud is becoming massively complex and specialized!

And yet if you ignore all this stuff, your performance would be very poor. The only possible answer is to learn to use the vendorprovided -services, because those take full advantage of this special hardware. When you use their services and dont try to roll-your-own, you get cost-effective and scalable performance (and dont HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 38 VENDORS HAVE A PROBLEM TOO: LINUX ISNT UP TO THE NEW ROLE

You can bolt stuff to a Model T, but it still is a Model T underneath! Employees at cloud companies want to write programs in a normal way! company.) (Even Microsoft has become a mostly Linux But Linux wasnt designed for specialized accelerators and crosscomputer memory access and computation occurring at every level of every device

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 39 LINUX ISSUES Its basic model is of a program with memory But for most of this hardware the program actually controls some sort of device outside the computer, with its own memory, and maybe its own network connection. Even copying just once might be 10x or 100x slower than what the device could have done directly on the wire

We also lack a security model for this kind of distributed HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 40 TODAYS SITUATION? As the owner of an infrastructure, a company like Google, Microsoft or Amazon can build -services that leverage all sorts of specialized hardware to accelerate important tasks. But as a developer, you can only benefit if you use their service, not by trying to leverage these devices directly. Why? Because they worry about the risk of instabilities (very real). HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP

41 TECH TALES FROM THE CRYPT They that live by the sword shall die by it The data center that couldnt stop rebooting Bob forgot to verify his Verilog Oracle maxed out... on Black Friday Attack of the zombie refrigerators HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 42

BUT WHY ARE THOSE RELEVANT? Only some involve new hardware. But they all illustrate how when we change the way we do things, unexpected disruptions are a common consequence. Even success can be a cause of technology collapse risks are inevitable. But dont panic. The game is to just accept that we need to climb this endless slope: better technology, that creates opportunity, but brings issues. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 43

SUMMARY? Specialized hardware is a key to cost-effective modern cloud computing. Yet these innovations are also creating new risks. Accelerators will be even more critical if the future IoT edge needs a lot of support for very rapid computer vision, speech recognition, and intelligence. But these devices will need to live near the edge. Accelerators can only be used in a -service model. The main exception is that end-users do have ways to access TPU and GPU accelerators from their code, via libraries of numerical methods. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 44

Recently Viewed Presentations

  • Small Business Server 2003 Technical Overview

    Small Business Server 2003 Technical Overview

    Microsoft Windows Small Business Server 2003 Technical Overview Byron Hynes Microsoft Corporation [email protected]
  • Overview of major theoretical perspectives

    Overview of major theoretical perspectives

    SOCIAL ACTIONMax Weber (1864-1920) . Social Action or Individual Behaviour should be the focus of the study of society.. THE PROTESTANT ETHIC AND THE SPIRIT OF CAPITALISM (1904) Social Action is an action carried out by an individual to which...
  • Approche stratégique de réallocation des ressources basée sur ...

    Approche stratégique de réallocation des ressources basée sur ...

    The CHU de Québec: recognized in the field and in the region. Most important UniversityHospital in Quebec, among the 3 most important in Canada; Regional and superregional centre serving 1.7 M people. Named Canadian Model by Accreditation Canada in: Governance....
  • FCAT Backpacks for Parents Initiative Introduction The FCAT

    FCAT Backpacks for Parents Initiative Introduction The FCAT

    James E Stephens Elementary. James W Sikes Elementary. Janie Howard Wilson Elementary. Jesse Keen Elementary. Jewett School of the Arts. ... Sleepy Hill Elementary. Snively Elem School Of Choice. Socrum Elementary. Southwest Elementary. Spook Hill Elementary.
  • Macbeth Jeopardy Characters Plot QQ1  QQ2 Lit Terms

    Macbeth Jeopardy Characters Plot QQ1 QQ2 Lit Terms

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *...
  • What Volunteers Need - ASME

    What Volunteers Need - ASME

    Methods may include Ringii process, panel method, story boards, electronic brainstorming and bulletin boards, Crawford slip writing, Gallery method, Delphi method, TRIZ, mind mapping, integrated problem solving, collective notebooks, morphological creativity or synectics. Training may be needed in some of...
  • 0202PP06 Direct from the Battlefield 140602 - NAEMT

    0202PP06 Direct from the Battlefield 140602 - NAEMT

    Reassess the casualty in Tactical Field Care - remove it if it is not needed unless the casualty is in shock. ... volunteers are given mind-altering substances like ketamine and tested on tasks like manual dexterity. Tactical Combat Casualty Care...
  • Think with your Hands:  How to get started

    Think with your Hands: How to get started

    Point out differences in the towers - no right or wrong way of building. Explain how you reasoned in the beginning. Who did what? Did you assign roles? Did you split the work? Test for stability - show attachment to...