

#### **VIP Briefing on GPU Accelerator Technology**

Steve Scott, CTO of Tesla Ian Buck, GM of GPU Computing Dr. Dirk Pleiter, Juelich

# Long Term Goals for Tesla







Power Efficiency Ease of Programming And Portability Application Space Coverage



#### KEPLER THE WORLD'S FASTEST, MOST EFFICIENT HPC ACCELERATOR

**Dynamic Parallelism** 

SMX Hyper-Q

(power efficiency)

(programmability and application coverage)







**Dual GK104 GPUs** 

**3x Single Precision** Video, Signal, Life Sciences, Seismic

Available Now





**3x Double Precision** Hyper-Q & Dynamic Parallelism CFD, FEA, Finance, Physics, etc.

Available Q4 2012

#### Kepler GK110 Block Diagram

- 7.1B Transistors
- 15 SMX units
- > 1 TFLOP FP64
- 1.5 MB L2 Cache
- 384-bit GDDR5
  ~250 GB/s
- PCI Express Gen3



|                       | PCI Express 3.0 Host Interface                                                |                                     |  |  |  |  |  |  |  |  |  |  |
|-----------------------|-------------------------------------------------------------------------------|-------------------------------------|--|--|--|--|--|--|--|--|--|--|
|                       | GigaThread Engine                                                             |                                     |  |  |  |  |  |  |  |  |  |  |
| Memory Controller     | SMX  SMX  SMX  SMX  SMX  SMX    SMX  SMX  SMX  SMX  SMX    SMX  SMX  SMX  SMX | Memory Controller                   |  |  |  |  |  |  |  |  |  |  |
| Memory Controller     | L2 Cache                                                                      |                                     |  |  |  |  |  |  |  |  |  |  |
| ler Memory Controller |                                                                               | Memory Controller Memory Controller |  |  |  |  |  |  |  |  |  |  |

#### Kepler GK110 SMX vs Fermi SM

| SM.  |           |                    | in Dava        |          |      |  |  |  |  |  |
|------|-----------|--------------------|----------------|----------|------|--|--|--|--|--|
|      | a battera | -                  | Warp Schedurer |          |      |  |  |  |  |  |
|      | -         |                    | Departs Ved    |          |      |  |  |  |  |  |
|      |           |                    |                |          |      |  |  |  |  |  |
| -    | -         | -                  | -              |          | -    |  |  |  |  |  |
| Com  | Gere      | Core               | Core:          | LOST     |      |  |  |  |  |  |
| Cire | Care      | Cire               | Core           | LOW      | SPD. |  |  |  |  |  |
| Ces  | Com       | Com                | Core           | LOST     |      |  |  |  |  |  |
| Con  | Core      | Gere               | Core           | LOST     | ano. |  |  |  |  |  |
| Con  | Gen       | -                  | Core           | LUST     | srù  |  |  |  |  |  |
| Care | Core      | Care               | Care           | LOIST    |      |  |  |  |  |  |
| Con  | Com       | Com                | Cale           | LOWER    |      |  |  |  |  |  |
| Care | Care      | Care               | Core           | LOST     |      |  |  |  |  |  |
| -    |           |                    |                |          | -    |  |  |  |  |  |
|      |           |                    | morp (1        | 1 Castle |      |  |  |  |  |  |
|      |           | Line of the second | 1 CALIFORN     |          |      |  |  |  |  |  |
| Tes  |           | Tex                | Ten            |          | fen: |  |  |  |  |  |
|      |           | THICKS             | 14.00          |          |      |  |  |  |  |  |

#### 3x sustained perf/W

Ground up redesign for perf/W 6x the SP FP units 4x the DP FP units Significantly slower FU clocks

| SMX Instruction Casha |                                                         |        |           |      |      |         |           |         |              |        |          |          |          |         |      |          |          |          |      |
|-----------------------|---------------------------------------------------------|--------|-----------|------|------|---------|-----------|---------|--------------|--------|----------|----------|----------|---------|------|----------|----------|----------|------|
| -                     |                                                         | rp Bot | oduler    |      |      | w       | irp Sched |         | II FII CHI   |        |          | rgi Sath | eduler   | _       | -    | w        | rp Sched | white    |      |
| E.PH                  | uelach th                                               | •      | Dispatati | U+8  | -    | patch U |           | Nepitch | Unit         | - Dis  | naich Us |          | Dispetch | Qvill . | Dis  | petch () |          | Dispetch | 294K |
|                       |                                                         |        |           |      |      |         | R         | egister | File (I      | 55,536 | x 32-b   | #0       |          |         |      |          |          |          |      |
|                       |                                                         | +      |           | +    |      | ٠       |           |         | -            | +      | -        |          | +        |         | 4    |          |          |          | ٠    |
| Care                  | Com                                                     | Com    | CP Unit   | Con  | Com  | Cone    | DP Int    | LDST    | seu          | Gant   | Cote     | Care     | DP SHE   | Care    | Cost | Care     | DP Unit  | LOST     | sru  |
| Cure                  | Core                                                    | Core   | DP UNIT   | Core | Core | Gure    | OP linit  | LDST    | sFU          | Com    | Core     | Core     | DP Unit  | Core    | Core | Core     | DP Linit | inst     | SFU  |
| Com                   | Conc                                                    | Com    | DP Unit   | Con  | Corr | Core    | DP Unit   | LINET   | SFU          | Com    | Core     | Care     | DP Unit  | Core    | Core | Core     | DP Unit  | LDIST    | SFU  |
| Corr                  | Cere                                                    | Gun    | EXP Ment  | Care | Care | Core    | OP SHI    | LDIST   | SFU          | Core   | Cole     | Core     | DP LINE  | Gare    | Core | Care     | DP-Unit  | LOST     | SFU  |
| Gare                  | Cont                                                    | Cim    | DP.Unit   | Com  | Core | Cone    | OP Unit   | LDIST   | 580          | Core   | Care     | Gore     | OP UNK   | Gure    | Cire | Core     | DP Lint  | LOST     | SFU  |
| Conv                  | Core                                                    | Core   | DP Unit   | Con  | Core | Com     | DP Unit   | LDIET   | SFU          | Core   | Con      | Core     | DP Link  | Cere    | Core | Core     | DP Use   | LDIST    | SFU  |
| Gón                   | Core                                                    | Con    | DP Unit   | Cón  | Com  | Core    | OP Unit   | LDIST   | SFU          | Com    | Com      | Core     | OP Unit  | Core    | Core | Core     | OP Unit  | LDIST    | SFU  |
| Com                   | Core                                                    | Con    | CP Link   | Com  | Core | Coni    | DP Dee    | 1097    | SEU          | Gam    | Con      | Core     | DP Unit  | Core    | Core | Core     | OP Unit  | LOWT     | SFU  |
| Gare                  | Core                                                    | Con    | DP Unit   | Con  | Core | Core    | OP Line   | LDAT    | SFU          | Cone   | Core     | Gare     | DP link  | Core    | Con  | Core     | DP Dest  | LINET    | SFU  |
| Con                   | Cons                                                    | Com    | DP Unit   | Cote | Cors | Cure    | OP Link   | LOIST   | SFU          | Com    | Core     | Com      | OP Lint  | Core    | Core | Core     | OF Unit  | LDIST    | SFU  |
| Core                  | Core                                                    | Gum    | DP Unit   | Core | Core | Core    | OF Unit   | LDST    | sru          | Cone   | Core     | Core     | DP Line  | Gure    | Gore | Gare     | DP Unit  | LOST     | SFU  |
| Com                   | Cont                                                    | Cnn    | CP Vol    | Cán  | Care | Com     | OP Unit   | LDOT    | SF U         | Com    | Core     | Core     | DP Unit  | Gare    | Con  | Core     | OP Ven   | LOST     | sru  |
| Con                   | Cons                                                    | Com    | DP Lind   | Com  | Com  | Cure    | DP Linit  | LDAT    | SFU          | Com    | Core     | Core     | OP Unit  | Core    | Con  | Core     | TP Unit  | LDIST    | SFU  |
| Com                   | Core                                                    | Con    | DP Unit   | Con  | Core | Gore    | OP Unit   | LOST    | 3 <b>7</b> U | Com    | Com      | Core     | DP Lint  | Core    | Core | Core     | OP Unit  | LDIST    | sru  |
| Gam                   | Com                                                     | Com    | DP une    | Com  | Core | Cont    | SP linit  | LDUT    | SFU          | Com    | Core     | Core     | DP 100   | Core    | Core | Core     | OP Unit  | LDIST    | SFU  |
| Com                   | Core                                                    | Com    | DP Unit   | Com  | Core | Gure    | OP Link   | LDST    | SPU          | Gam    | Core     | Care     | DP Unit  | Gara    | Core | Core     | CP Link  | LOST     | SFU  |
| -                     | Unterconnect Network.<br>84 KB-Strami Memory I.LT Cacha |        |           |      |      |         |           |         |              |        |          |          |          |         |      |          |          |          |      |
|                       |                                                         |        |           |      |      |         | 92        |         | l Read       |        |          |          |          |         |      |          |          |          | -    |
|                       | Tex                                                     |        | Tex       | E.   |      | Tex     |           | Tex     | 1            |        | Tex      | Ì        | Tex      |         |      | Tex      |          | Tex      |      |
|                       | Tex                                                     |        | Tex       | 6    |      | Tex     |           | Tex     | 1            | 1      | Tex      |          | Tex      |         |      | Tex      |          | Tex      |      |

#### Selected Kepler ISA Enhancements

#### Larger number of registers per thread

- 63 in Fermi  $\rightarrow$  255 in Kepler
- Common performance limited in Fermi due to register spilling
- Significant performance improvement for some codes (e.g.: 5.3x on Quda QCD!)

#### Atomic operations

- Added int64 to match int32
- Added functional units  $\rightarrow$  2-10x performance gains

#### SHFL instruction for data exchange amongst threads of a warp

- Broadcast, shifts, butterflies
- Useful for sorts, reductions, etc.

#### Loads through texture memory

Higher bandwidth and flexibility for read-only data (const\_\_restrict)

Hyper-Q

**FERMI** 1 Work Queue **KEPLER** 32 Concurrent Work Queues







## Fermi Concurrency



#### Fermi allows 16-way concurrency

- Up to 16 grids can run at once
- But CUDA streams multiplex into a single queue
- Overlap only at stream edges

# **Kepler Improved Concurrency**



#### **Kepler allows 32-way concurrency**

- One work queue per stream
- Concurrency at full-stream level
- No inter-stream dependencies



CPU Processes Shared GPU



















Shared GPU





# Kepler Hyper-Q: Simultaneous Multiprocess



# Without Hyper-Q



Time

.....

# With Hyper-Q



Time

.....

# **Dynamic Parallelism**

#### The ability for any GPU thread to launch a parallel GPU kernel

- Dynamically
- Simultaneously
- Independently



Fermi: Only CPU can generate GPU work

Kepler: GPU can generate work for itself

# **Dynamic Parallelism**



# **Dynamic Work Generation**



**Lower Accuracy** 

Higher Accuracy Lower Performance Target performance where accuracy is required



# Familiar Syntax and Programming Model



# Simpler Code: LU Example



#### LU decomposition (Kepler)



# **CUDA** By the Numbers:

>375,000,000 CUDA-Capable GPUs

>1,000,000 Toolkit Downloads

>120,000 Active Developers

>500 Universities Teaching CUDA

#### CUDA 5

#### Nsight<sup>™</sup> for Linux & Mac

### **NVIDIA GPUDirect**<sup>™</sup>

**Library Object Linking** 

Preview Release Now Available



## NVIDIA<sup>®</sup> Nsight<sup>™</sup> Eclipse Edition



& Nam TOCCAN DO

this Size

Registers/Des

Distant Darry D'Robel

Shared Herniny Net

Theoreman

1 Cache Carifinie

[256.1.1]

Hart

| • 10 10 10 10 10 • 0 • 9 • 9 • 10 • 1 1 10 • 0 • 0 •                                                            |                          |                     |                 |                      |                  |  |  |
|-----------------------------------------------------------------------------------------------------------------|--------------------------|---------------------|-----------------|----------------------|------------------|--|--|
|                                                                                                                 | 1                        |                     |                 |                      |                  |  |  |
| ebug H A H H H H H H H H H H H H H H H H H                                                                      | - He Variables CUDA      | Information         | 22 ** Breakpoi  | nts i                |                  |  |  |
| Findmax (C/C++ Application)                                                                                     | Q. sm 2 wa               | (p 7                |                 | 01                   |                  |  |  |
| cudaFindMax [0] [device: 0] (Suspended : Step)                                                                  | * 🖬 [0] cudaFindMax      | Device 0            | ees(32, 1, 1),  | (256,1,1)>>>         |                  |  |  |
| CUDA Thread (0,0,0) Block (0,0,0)                                                                               | *****                    | Running             | 564.2           |                      |                  |  |  |
| cudaFindMax() at findmax.cu: 114 0x91f3a8                                                                       | ₽ <sup>®</sup> (224,0,0) | Running             | Warp 7 Late     | 0 E findmax.cu       | 113 (0x91f318)   |  |  |
| @ CUDA Thread (1.0.0) Block (0.0.0)                                                                             | ₽ (225,0,0)              | Running             | Warp 7 Lane     | 1 Findmax.cu         | 113 (0x91/318)   |  |  |
| Block (0,0,0) [sm: 0] (256 Active Threads)                                                                      | ₽ (220,0,0)              | 🔎 (220,0,0) Running |                 | 2 R findmax.cu       |                  |  |  |
| Block (1,0,0) [sm: 2] (256 Active Threads)                                                                      | - 🚚 (777 B M             | Bunning             | Warn 71 ane     | 1 Di Ferliman cur    | 111/0x91F118) +  |  |  |
| indmax.cu tt                                                                                                    |                          |                     | * 🛛 🕃 Outline 🖩 | F Disassembly ## Reg | itters II "O     |  |  |
| uint32 t nextElement;<br>uint32 t 1 = firstElementIndex + threadsCount;                                         |                          |                     |                 | L 4 8 73 1           |                  |  |  |
|                                                                                                                 |                          |                     | Name            | T(0,0,0)B(0,0,0)     | T[1,0,038(0,0,05 |  |  |
| <pre>for (; i &lt; ARRAY SIZE; i += threadsCount) {  nextElement + array[i];</pre>                              |                          |                     | 222 810         | 0                    | 1                |  |  |
| if (nextElement > max) {                                                                                        |                          | 702.911             | 16776272        | 16776272             |                  |  |  |
| max = nextElement;                                                                                              |                          |                     | III 82          | 4935629              | 2024586          |  |  |
| maxIndex = 1;                                                                                                   | Þ                        |                     | IIIRJ           | 8192                 | 8193             |  |  |
| 3                                                                                                               | . 46                     |                     | 202 19.4        | 3149939              | 8115414          |  |  |
| threadMax[threadIds.x] = maxindex;<br>threadMaxIds[threadIds.x] = maxIndex;                                     |                          |                     | WI RS           | 4                    | 4                |  |  |
| threadmaxids[threadids.x] = Basindes]                                                                           |                          |                     | 112 R5          | 1048576              | 1048576          |  |  |
| and and the second s |                          |                     | · III'87        | 4                    | 4                |  |  |
| omale 11                                                                                                        |                          |                     | 717 915         | 32768                | 32772            |  |  |
|                                                                                                                 | 11 IN AT (\$1(\$1) C     | - H+ 11+ -          | 117 89          | 0                    | 0                |  |  |
| max (C/C++ Application) findmax<br>ning single-threaded host code                                               |                          |                     | 707 R10         | 8387951              | 16778240         |  |  |
| number is 0x000000 with index 2737090                                                                           |                          |                     | ## R51          | 0                    | 0                |  |  |
|                                                                                                                 |                          |                     | III R12         | 1048576              | 1048576          |  |  |
| ning multi-threaded device code                                                                                 |                          |                     | #FR13           | 0                    | 0                |  |  |
|                                                                                                                 |                          |                     | C BT B1A        | n .                  | La B             |  |  |





- Automated CPU to GPU code refactoring
- Semantic highlighting of CUDA code
- Integrated code samples & docs

#### **Nsight Debugger**

- Simultaneously debug of CPU and GPU
- Inspect variables across CUDA threads
- Use breakpoints & single-step debugging

#### **Nsight Profiler**

11 12 12

C -duerge up 12

il Thread -121

Burtline Alt Driver Alt

Continue Diffs and

P menday limit

WareCase (Ettai

7 56.5% (4) web

32.6% (4) Webble

W to bry 147 Machingstyle

T 11 Ph [4] Section in

BE Details, ET Co.

State No. 8

Sea .....

Quickly identifies performance issues

Low Global Humary Store Efficiency [ 21.3% avg. for kernels accounting for 73.9% of compute

Integrated expert system

Low Global Memory Load Efficiency | 9% and, for key

- Automated analysis
- Source line correlation

#### Available for Linux and Mac OS

# Kepler Enables Full NVIDIA GPUDirect<sup>™</sup>



Server 1

Server 2

#### **GPU Computing with LLVM**

Developers want to build front-ends for Java, Python, R, DSLs

Target other processors like ARM, FPGA, GPUs, x86





**NVIDIA Confidential** 

#### **OpenACC Directives**



Simple Compiler hints

Compiler Parallelizes code

Portability, Productivity, Performance

Your original Fortran or C code

**NVIDIA Confidential** 

# Performance: Leveraging GPU





#### Enabling ARM Ecosystem: CARMA DevKit CUDA on ARM



Tegra 3 Quad-core ARM A9 Quadro 1000M (96 CUDA cores) Ubuntu Gigabit Ethernet SATA Connector HDMI, DisplayPort, USB

**NVIDIA Confidential** 

# The Day Job That Makes It All Possible...

Leverage volume graphics market to serve HPC

- HPC needs outstrip HPC market's ability to fund the development
- Computational graphics and compute are highly aligned



Tegra

GeForce

Quadro





# Jülich-NVIDIA Application Lab

19. June 2012 | Dirk Pleiter (JSC)



#### Supercomputing at Forschungszentrum Jülich

# Role of the Jülich Supercomputing Centre (JSC):

- Operation of supercomputers for local, national and European scientists
- User support including support of research communities by means of simulation laboratories
- R&D on future IT technologies, algorithms, tools, GRID, etc.
- Education and training of users





#### **Our view on GPU computing**

- Performance acceleration for a significant set of relevant scientific applications
- **JUDGE** = Jülich Dedicated GPU Environment
  - 206 node IBM iDataPlex cluster
  - Dual-CPU, dual-GPU nodes
  - About 240 TFlops (peak)
  - Partitions dedicated to astrophysics and brain research
- Large potential for energy efficient computing
  - JUDGE is #14 on Green500 (Nov. 2011)
  - Need for efficient utilisation of all computing devices





#### **Jülich-NVIDIA Application Lab**

- Lab hosted at JSC
- Mission statement
  - Enable scientific applications for GPU-based architectures
  - Provide support for optimization
  - Investigate performance and scaling

#### Targeted research areas

- Astrophysics and astronomy
- Computational medicine and neuroscience
- Elementary particle physics
- Material science
- Protein folding





[R. Spurzem et al., 2012]



[O. Zimmerrmann, 2011]

[G. Sutmann et al., 2011]



#### **Pilot application: JuBrain**

- The Jülich Brain Model will display selected aspects of the brain's structural organization such as cortical areas and fiber tracts
  - Improve understanding of fiber operation
  - Help treating neurological disease
- Procedure
  - Preparation of brain sections
  - Image processing
  - 3D reconstruction and fiber tractography
- Already today significant speed-up using GPUs



[M. Axer et al., 2012]



# **Questions?**