# CS 179: GPU Programming - California Institute of Technology

CS 179: GPU Programming Lecture 1: Introduction Images: http://en.wikipedia.org http://www.pcper.com http://northdallasradiationoncology.com/ The Problem Are our computers fast enough?

Source: XKCD Comics (http://xkcd.com/676/) The Problem Are our computers really fast enough? http://lauraskelton.github.io/images/posts/5deepnetworklayer.png http://www.dmi.unict.it/nicosia/research/proteinFolding3.png http://www.cnet.com/ The Problem

What does it mean to solve a computational problem? The CPU The central processing unit Traditionally, applications use CPU for primary calculations Powerful, general-purpose capabilities R+D -> Moores Law! Established technology

Wikimedia commons: Intel_CPU_Pentium_4_640_Prescott_bottom.jpg The GPU Designed for our graphics For graphics problems, much faster than the CPU! What about other problems? This course in 30 seconds

For certain problems, use instead of Images: http://www.nvidia.com, Wikimedia commons: Intel_CPU_Pentium_4_640_Prescott_bottom.jpg This course in 60 seconds GPU: Hundreds of cores! vs. 2,4,8 cores on CPU

Good for highly parallelizable problems: Increasing speed by 10x, 100x+ Questions

What is a GPU? What is a parallelizable problem? What does GPU-accelerated code look like? Who cares? Outline

Motivations Brief history A simple problem A simple solution Administrivia GPUs The Motivation Screens!

1e5 1e7 pixels Refresh rate: ~60 Hz Total: ~1e7-1e9 pixels/sec ! (Very approximate orders of magnitude) GPUs The Motivation Lots of calculations are the same!

e.g. Raytracing: Superquadric Cylinders, exponent 0.1, yellow glass balls, Barr, 1981 Goal: Trace light rays, calculate object interaction to produce realistic image f Watt, 3D Computer Graphics (from http://courses.cms.caltech.edu/cs171/)

GPUs The Motivation Lots of calculations are the same! e.g. Raytracing: Superquadric Cylinders, exponent 0.1, yellow glass balls, Barr, 1981 for all pixels (i,j):

Calculate ray point and direction in 3d space if ray intersects object: calculate lighting at closest object store color of (i,j) GPUs The Motivation Lots of calculations are the same! e.g. Simple shading:

for all pixels (i,j): replace previous color with new color according to rules "Example of a Shader" by TheReplay - Taken/shaded with YouFX webcam software, composited next to each other in Photoshop. Licensed under CC BY-SA 3.0 via Wikipedia http://en.wikipedia.org/wiki/File:Example_of_a_Shader.png#/media/File: Example_of_a_Shader.png GPUs The Motivation Lots of calculations are

the same! e.g. Transformations (camera, perspective, ): for all vertices (x,y,z) in scene: Obtain new vertex (x,y,z) = T(x,y,z) h Outline

Motivations Brief history A simple problem A simple solution This course

GPUs Brief History Fixed-function pipelines Pre-set functions, limited options http://gamedevelopment.tutsplus.com/articles/the-endof-fixed-function-rendering-pipelines-and-how-to-moveon--cms-21469 Source: Super Mario 64, by Nintendo GPUs Brief History Shaders

Could implement ones own functions! GLSL (C-like language) Could sneak in general-purpose programming! http://minecraftsix.com/glsl-shaders-mod/ GPUs Brief History CUDA (Compute Unified Device Architecture) General-purpose parallel computing platform for NVIDIA GPUs

OpenCL (Open Computing Language) General heterogenous computing framework Accessible as extensions to C! (and other languages) GPUs Today General-purpose computing on GPUs

(GPGPU) Demonstrations Outline

Motivations Brief history A simple problem A simple solution This course A simple problem Add two arrays A[] + B[] -> C[]

On the CPU: float *C = malloc(N * sizeof(float)); for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; Operates sequentially can we do better? A simple problem On the CPU (multi-threaded, pseudocode):

(allocate memory for C) Create # of threads equal to number of cores on processor (around 2, 4, perhaps 8) (Indicate portions of A, B, C to each thread...) ... In each thread, For (i from beginning region of thread) C[i] <- A[i] + B[i] //lots of waiting involved for memory reads, writes, ... Wait for threads to synchronize...

Slightly faster 2-8x (slightly more with other tricks) A simple problem How many threads? How does performance scale? Context switching: High penalty on the CPU Low penalty on the GPU

A simple problem On the GPU: (allocate memory for A, B, C on GPU) Create the kernel each thread will perform one (or a few) additions Specify the following kernel operation: For (all is assigned to this thread) C[i] <- A[i] + B[i] Start ~20000 (!) threads

Wait for threads to synchronize... GPU: Strengths Revealed Parallelism / lots of cores Low context switch penalty! We can cover up performance loss by creating more threads! Outline

Motivations Brief history A simple problem A simple solution This course

GPU Computing: Step by Step

Setup inputs on the host (CPU-accessible memory) Allocate memory for inputs on the GPU Allocate memory for outputs on the host Allocate memory for outputs on the GPU Copy inputs from host to GPU Start GPU kernel Copy output from GPU to host (Copying can be asynchronous)

The Kernel Our parallel function Simple implementation Indexing Can get a block ID and thread ID within the block: Unique thread ID! Calling the Kernel

Calling the Kernel (2) Summary For many highly parallelizable problems GPU offers massive performance increase! Making difficult problems easy Putting impossible problems within reach

Outline Motivations Brief history A simple problem

A simple solution This course This Course General topics: GPU computing /parallelization Audio, linear algebra, medical engineering, machine learning, finance, CUDA (parallel computing platform)

Libraries, optimizations, etc Prerequisites: C/C++ knowledge Administrivia Course Instructors/TAs: Kevin Yuh ([email protected]) Eric Martin ([email protected])

CS179: GPU Programming Website: http://courses.cms.caltech.edu/cs179/ Overseeing Instructor: Al Barr ([email protected]) Class time: ANB 107, MWF 3:00 PM Course Requirements

Option 1: Homework: 7 assignments Each worth 10% of grade Due Wednesdays, 5 PM 3 PM (chgd 4/3/2015) Final project: 3-week project 30% of grade

Course Requirements Option 2: Homework: 5 assignments Each worth 10% of grade Due Wednesdays, 5 PM 3 PM (chgd 4/3/2015) Final project: 5-week project 50% of grade

Difference: Exchange sets 6,7 for more time on project Projects Topic your choice! Project scale 5-week projects: Significantly more extensive Solo or pairs

Expectations set accordingly Idea generation: Keep eyes open! Talk to us We hope to bring guests! Administrivia Collaboration policy: Discuss ideas and strategies freely, but all code

must be your own 50 foot rule (in spirit) dont consult your code when helping others with their code Administrivia Office Hours: Located in ANB 104 Kevin: Mondays, 9-11 PM Eric: Tuesdays, 7-9 PM Extensions on request

Talk to TAs Machines Primary machines (multi-GPU, remote access): haru.caltech.edu mako.caltech.edu (pending) E-mail me your preferred username! Change your password Separately on each machine (once mako is up)

Use passwd command Machines Secondary (CMS) machines: mx.cms.caltech.edu minuteman.cms.caltech.edu Use your CMS login Not all assignments work here!

Machines Alternative: Use your own! (Harder): Must have an NVIDIA CUDA-capable GPU Virtual machines wont work! Exception: Machines with I/O MMU virtualization and certain GPUs Special requirements for: Hybrid/optimus systems Mac/OS X

Setup is difficult! (But we have some instructions) May need to modify assignment makefiles Final remarks for the day "Three RAAF FA-18 Hornets in formation after refueling" by U.S. Air Force photo by Senior Airman Matthew Bruch -

Welcome to the course!

## Recently Viewed Presentations

• School Effectiveness Framework (SEF) K-12 A School Focus On Student Achievement Wednesday, December 2nd Goals for Today Introduce the School Effectiveness Framework (SEF) Select SEF indicators to assist in SIP planning Enhance SIP Planning the SEF Use specific and varied...
• Introduction to PCensus PCensus for MapPoint - Selected Basic Features - Profiling Study Areas Creating Thematic Maps Exercise PCensus for MapPoint - Selected Advanced Features - Targeting by Lifestyle - Scanning an Area for Potential Sites Scoring with Multiple Variables...
• Catalonia is one of the regions of Spain. It is in the north-east and it borders with Andorra and France. Its capital is Barcelona. The population is 7.5 million people. Official languages: Spanish and Catalan
• 5. Scree settles between 34 and 40 degrees, usually. 6. Particles travel down the slope depending on their weight and density. Lighter ones finish nearer to the face, heavier ones travel further, often moved by rolling over smaller ones. 7....
• In some trees, the length of a branch can reflect the number of genetic changes that have taken place in a particular DNA sequence in that lineage. Phylogenetic Trees with Proportional Branch Lengths. Figure 20.13 Branch lengths can represent genetic...
• Internal and External Dialogue:Shoulder Partner Talk - what are some of the internal and external discussions you have had related to disproportionality? Using post it notes, have shoulder partners jot down some of their internal and external dialogue. On a...
• EHS-Inspired Nanotechnology Development "CRADLE-TO-GRAVE" LCI & LCA. Assisting development of technology in an efficient manner by allowing for more informed decision making, which can lead to lower costs, reduced EHS risk, minimized carbon footprint, and business growth.
• Motivation. Device Drivers are all around us. Commonly distributed as part of software packages. Process explorer, procmon, etc, all have drivers. Commonly escape rigorous testing