SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Productive Parallel Programming
for Intel® Xeon Phi™
Coprocessors
    Bill Magro!
    Director and Chief Technologist!
    Technical Computing Software!
    Intel Software & Services Group!
Still an Insatiable Need For Computing
                                                                       Weather Prediction
    1 ZFlops
  100 EFlops
   10 EFlops                     Genomics Research
    1 EFlops
  100 PFlops
   10 PFlops       Medical Imaging
    1 PFlops
  100 TFlops
   10 TFlops
    1 TFlops
  100 GFlops
  10 GFlops
    1 GFlops                                 Forecast
  100 MFlops
            1993      1999     2005             2011           2017   2023    2029

               PetaFlop Systems of Today Are
       The Client And Handheld Systems 10 years Later
                                      Source: www.top500.org
Approaching Exascale
Some believe…


 •  Virtually none of today’s hardware or software technologies
           can be improved or modified to reach exascale

•         A complete revolution is needed



                           We believe…
     •     Evolution of today’s technologies + hardware and software
           innovation can get us there

     •     A systems approach – with co-design – is critical
Moore’s Law: Alive and Well

   2003                  2005                   2007           2009
2011
  90 nm              65 nm                    45 nm         32 nm           22         22nm
nm                                                                                   A Revolutionary
                                                                                         Leap in
                                                                                        Process
 SiGe        SiGe
                                                                                       Technology


     Invented
       SiGe
                         2nd Gen.
                           SiGe
                                               Invented
                                              Gate-Last
                                                            2nd Gen.
                                                           Gate-Last
                                                                          First to
                                                                        Implement
                                                                                          37%
 Strained Silicon    Strained Silicon           High-k       High-k      Tri-Gate
                                                                                     Performance Gain at
                                              Metal Gate   Metal Gate                   Low Voltage*

         STRAINED SILICON

                                                HIGH-k METAL GATE                       >50%
                                                                                        Active Power
                                                                                     Reduction at Constant
                                                                          TRI-GATE
                                                                                        Performance*

        The foundation for all computing… including Exa-Scale
         Source: Intel
         *Compared to Intel 32nm Technology
Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power
                                                Efficiency
                                  Performance per Watt of a prototype Knights Corner Cluster
                                     compared to the 2 Top Graphics Accelerated Clusters

                                                         1381                                             1380
                                                                                                                                   1266
                           1400


                           1200
             MFLOPS/Watt




                           1000


                            800
                                                             +                                              +                       +
                            600


                            400


                            200


                              0
                                                Intel Corp                             Nagasaki Univ.                       Barcelona
                                                                                                                            Supercomputing Center
                                                Knights Corner                         ATI Radeon                           Nvidia Tesla 2090
    Higher is Better Source: www.green500.org
                                                Top500 #150                            Top500 #456                          Top500 #177
                                                June 2012                              June 2012                            June 2012
                                                72.9 kW                                47 kW                                81.5 kW

6   Visual and Parallel Computing Group                          Copyright © 2012 Intel Corporation. All rights reserved.
Myth: explicitly managed locality is in and caches are out!
 Reality: Caches remain path to high performance and
                       efficiency


                     Relative BW      Relative BW/Watt
  50

  45

  40

  35

  30

  25

  20

  15

  10

   5

   0
         Memory BW           L2 Cache BW            L1 Cache BW
#1 Green 500 Cluster
                                                                            WORLD RECORD!
                                                                           “Beacon” at NICS
                                                                            Intel® Xeon® Processor +
                                                                      Intel Xeon Phi™ Coprocessor Cluster
                                                                         Most Power Efficient on the List
                                                                             2.449 GigaFLOPS / Watt
                                                                                 70.1% efficiency




Other brands and names are the property of their respective owners.
Source: www.green500.org as of Nov 2012
Reaching Exascale Power Goals Requires
     Architectural & Systems Focus
  •  Memory (2x-5x)
     –  New memory interfaces (optimized memory control and xfer)
     –  Extend DRAM with non-volatile memory
  •  Processor (10x-20x)
     –  Reducing data movement (functional reorganization, > 20x)
     –  Domain/Core power gating and aggressive voltage scaling
  •  Interconnect (2x-5x)
     –  More interconnect on package
     –  Replace long haul copper with integrated optics
  •  Data Center Energy Efficiencies (10%-20%)
     –  Higher operating temperature tolerance
     –  480V to the rack and free air/water cooling efficiencies



                                                              9
Reliability of these machines requires a systems
                     approach
                         Extreme Parallelism
                1E+07
                                Top System Concurrency Trend
                1E+06
                1E+05
                1E+04
                                                                                                  •    Transparent process migration
                1E+03
                1E+02
                        ’93     ‘95     ‘97     ‘99     ’01    ‘03   ‘05     ‘07   ‘09
                                                                                                  •    Holistic fault detection and recovery
                 MTTI Measured in Minutes
                                                                                                  •    Reliable end to end communications
                                                                        nt                        •    Integrated memory in storage layer for
 MTTI (hours)




                1000                                           ip Cou
                                                        Ch                   1E+08
                                                   DRAM
                 100
                                                                                                       fast checkpoint and workflow
                                                                                     Count
                                                                             1E+07
                 10
                                                                             1E+06
                   1
                              Socket Cou
                                         nt                                  1E+05                •  N+1 scale reliable architectures
                   0
                        2004 2006 2008 2010 2012 2014 2016
                                                                             1E+04                   consistent with stacked memory
                        0.1 Failures per socket per year:

                                                            Time to save a global
                                                                                                     constraints
                                                            Global Checkpoint
                                                Crossover point                                   •  System wide power management and
                                                                                                     dynamic optimization
                 Time




                                                                                                  •  Must design for system level debug
                                                                                                     capability.

                          Reliability is the primary force driving next generation designs
                                                              Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
Foundation of Performance:
Computing
Architecture for Discovery

     Intel® Xeon® processor
     •    Ground-breaking real-world application performance
     •    Industry-leading energy efficiency
     •    Meets broad set of HPC challenges


     Intel® Xeon Phi™ product family
     •  Based on Intel® Many Integrated Core (MIC) architecture
     •  Leading performance for highly parallel workloads
     •  Common Intel Xeon programming model
     •  Productive solution for highly-parallel computing




12
Intel® Xeon® E5-2600 processors


                                                                                                                                                                   Up to 73% performance boost
                                                                                                                                                                   vs. prior gen1 on HPC suite of
                                                                                                                                                                   applications

                                                                                                                                                                   Over 2X improvement on key
                                                                                                                                                                   industry benchmarks
                                                                                                          Up to 4 channels
                                                                                                          DDR3 1600 memory
                                                                                                                                                                   Significantly reduce compute
                                                                                                                                                                   time on large, complex data
                                                                                                                Up to 8 cores
                                                                                                                Up to 20 MB cache                                  sets with Intel® Advanced
                                                                                                                                                                   Vector Extensions
               Integrated
               PCI Express*
                                                                                                                                                                   Integrated I/O cuts latency
                                                                                                                                                                   while adding capacity &
                                                                                                                                                                   bandwidth



     1    Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance


13
Introducing Intel® Xeon Phi™ Coprocessors
                                                  Highly-parallel Processing for Unparalleled Discovery


       Groundbreaking: differences
       Up to 61 IA cores/1.1 GHz/ 244 Threads

       Up to 8GB memory with up to 352 GB/s bandwidth

       512-bit SIMD instructions

       Linux operating system, IP addressable

       Standard programming languages and tools


       Leading to Groundbreaking results
       Up to 1 TeraFlop/s double precision peak performance1

       Enjoy up to 2.2x higher memory bandwidth than on an Intel®
       Xeon® processor E5 family-based server.2

       Up to 4x more performance per watt than with an Intel® Xeon®
       processor E5 family-based server. 3
     Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
     computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
     in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
14   For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
BIG GAINS FOR SELECT APPLICATIONS



                      8

                      7                                             Scale to many-
                                                                    core
                      6

                      5
                                                             Vectorize
                       4

                       3                              Parallelize

                       2

Performance               1
                                                                                                            0.8
                          0
                              0%   10%                                                                0.4
                                         20%   30%   40%   50%     60%    70%                     0
                                                                                                                  Fraction
                                                                                  80%   90%
                                                                                              100%                Parallel
                                                % Vector

* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor
  versus a standard multi-core Intel® Xeon® processor
Performance Potential of
Intel® Xeon Phi™ Coprocessors
Synthetic Benchmark Summary (Intel® MKL)

             SGEMM                                                                                      DGEMM                                                                    HPLinpack                                                               STREAM Triad
                (GF/s)                                                                                       (GF/s)                                                                (GF/s)                                                                               (GB/s)
                       Up to                                                                          Up to                                                                     Up to                                                                         Up to
  2000                 2.9Xis Better
                       Higher                                                     1000                2.8X
                                                                                                    Higher is Better
                                                                                                                                                              1000              2.6X
                                                                                                                                                                              Higher is Better                                                                2.2X
                                                                                                                                                                                                                                                            Higher is Better
                                                   1,860                                                                                                                                                                                    200
                                                                                                                                  883                                                                                                                                                      181
                                                                                                                                                                                                                                                                               175
                                                                                                                                                                                                                803
                                                                                    800                                                                          800
  1500                                                                                                                                                                                                                                      150


                                                                                    600                                                                          600




                                                                                                                                   82% Efficient




                                                                                                                                                                                                                  75% Efficient
                                                      86% Efficient




  1000                                                                                                                                                                                                                                      100
                                                                                                                                                                                                                                                              79




                                                                                                                                                                                                                                                                               ECC On



                                                                                                                                                                                                                                                                                             ECC Off
                                                                                    400                                                                          400
                         640                                                                              309                                                                         303

    500                                                                                                                                                                                                                                        50
                                                                                    200                                                                          200



         0                                                                                                                                                                                                                                       0
                   2S Intel® Xeon®           1 Intel® Xeon Phi™
                                                                                         0                                                                            0         2S Intel® Xeon®          1 Intel® Xeon Phi™
                                                                                                                                                                                                                                                        2S Intel® Xeon® 1 Intel® Xeon   1 Intel® Xeon
                                                                                                    2S Intel® Xeon®        1 Intel® Xeon Phi™                                                                                                              processor         Phi™            Phi™
                      Processor                  coprocessor                                           processor               coprocessor                                         processor                 coprocessor                                                 coprocessor     coprocessor



 Notes
 1.    Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720
 2.    Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.
    Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
    For more information go to http://www.intel.com/performance


Copyright © 2013 Intel Corporation. All rights reserved.                                 *Other brands and names are the property of their respective owners
PARALLELIZING FOR HIGH PERFORMANCE
                                       Example: SAXPY




            STARTING POINT
                 Typical serial code
              running on multi-core
                                                  67.097       Current
           Intel® Xeon® processors                SECONDS
                                                             Performance


                          STEP 1.
           OPTIMIZE CODE
        Parallelize and vectorize                   0.46        145X
                                                   SECONDS
    code and continue to run on                                 FASTER
multi-core Intel Xeon processors




USE COPROCESSORS
                       STEP 2.
                                                     0.197
                                                     SECONDS
                                                                  2.3X
                                                                  FASTER
                                                                           340X
     Run all or part of the
 optimized code on Intel®
Xeon Phi™ coprocessors                                                     FASTER
Performance Proof-Point: Government and Academic Research WEATHER
                                                             RESEARCH AND FORECASTING (WRF)


                                                             Speedup
                                                             (Higher is Better)

                                                                                                     1.45
                                 1.6
                                 1.4
                                                                                                                                              •       Application: Weather Research and Forecasting (WRF)
                                 1.2                                1
                                   1                                                                                                          •       Status: WRF v3.5 coming soon
                                 0.8
                                 0.6                                                                                                          •       Code Optimization:
                                 0.4                                                                                                                  –  Approximately two dozen files with less than 2,000
                                 0.2                                                                                                                     lines of code were modified (out of approximately
                                   0                                                                                                                     700,000 lines of code in about 800 files, all Fortran
                                                                                                                                                         standard compliant)
                                                                                                                                                      –  Most modifications improved performance for both the
                             • 2S Intel® Xeon® processor E5-2670 with                                                                                    host and the co-processors
                                  four-node cluster configuration
                                                                                                                                              •      Performance Measurements: V3.5Pre and NCAR
                             • 2S Intel® Xeon® processor E5-2670 +                                                                                   supported CONUS2.5KM benchmark (a high resolution
                                   Intel® Xeon Phi™ coprocessor
                                   (pre-production HW/SW)
                                                                                                                                                     weather forecast)
                                   with four-node cluster configuration
                                                                                                                                              •      Acknowledgments: There were many contributors to
                                                                                                                                                     these results, including the National Renewable Energy
                                                                                                                                                     Laboratory and The Weather Channel Companies

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance.

  Copyright © 2013 Intel Corporation. All rights reserved.                                 *Other brands and names are the property of their respective owners
PROVEN PERFORMANCE BENEFITS
       Intel® Xeon Phi™ Coprocessor


                                                                                                 Finite Element Analysis

                                                                                                    Sandia National                 UP TO
                                                                                                      Labs MiniFE2
                                                                                                                                    2X

                                                                                                Seismic

                                                                                                   Acceleware 8th                   UP TO
                                                                                                   Order Isotropic
                                                                                                 Variable Velocity1                 2.23X
                                                                                                  China Oil & Gas                   UP TO
                                                                                                Geoeast Pre-stack
                                                                                                  Time Migration3                   2.05X


 1.    2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
 2.    8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
 3.    2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)


20                                                                               INTEL CONFIDENTIAL
PROVEN PERFORMANCE BENEFITS
      Intel® Xeon Phi™ Coprocessor


                                                                                              Embree Ray Tracing

                                                                                                         Intel Labs       SPEED-UP
                                                                                                       Ray Tracing3
                                                                                                                          1.8X

                                                                                                Physics

                                                                                                    Jefferson Lab        UP TO
                                                                                                     Lattice QCD
                                                                                                                         2.7X

                                                                                                Finance

                                                                                             Black-Scholes SP2           UP TO   7X
                                                                                                Monte Carlo SP2          UP TO

                                                                                                                         10.75X
1.    2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
2.    Includes additional FLOPS from transcendental function unit
3.    Intel Measured Oct. 2012



21                                                                             INTEL CONFIDENTIAL
Achieving Productive Parallelism
with Intel® Xeon Phi™
Coprocessors
More Cores. Wider Vectors. Performance Delivered
                                   Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013


                  More Cores                                           Scaling
                                                                     Performance
                                                                      Efficiently

           Multicore               Many-core




                                                                          Serial
                                  50+ cores                            Performance




                Wider Vectors
                                                                Task & Data Parallel                                   •  Industry-leading performance from
                                                                   Performance                                            advanced compilers

       128 Bits                                                                                                        •  Comprehensive libraries

                                                                                                                       •  Parallel programming models
       256 Bits
                                                                                                                       •  Insightful analysis tools
                                                                        Distributed
                                                                       Performance
       512 Bits



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Parallel Performance Potential

                                                                                                                                 •  If your performance
                                                                                                                                    needs are met by a an
                                                                                                                                    Intel Xeon® processor,
                                                                                                                                    they will be achieved
                                                                                                                                    with fewer threads than
                                                                                                                                    on a coprocessor

                                                                                                                                 •  On a coprocessor:
                                                                                                                                   –  Need more threads to
                                                                                                                                      achieve same performance
                                                                                                                                   –  Same thread count can
                                                                                                                                      yield less performance




                          Intel® Xeon Phi™ excels on highly parallel
                          applications
Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Maximizing Parallel Performance

        •  Two fundamental considerations:
                – Scaling: Does the application already scale to the limits of Intel®
                  Xeon® processors?


                – Vectorization and memory usage: Does the application make good use
                  of vectors or is it memory bandwidth bound?


        •  If both are true for an application, then the highly parallel and
           power-efficient Intel Xeon Phi coprocessor is most likely to be
           worth evaluating.



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Intel® Family of Parallel Programming Models
              Intel® Cilk™ Plus                            Intel® Threading                     Domain-Specific                        Established       Research and
                                                           Building Blocks                      Libraries                              Standards         Development

              C/C++ language                               Widely used C++                      Intel® Math Kernel                     Message Passing   Intel® Concurrent
              extensions to                                template library for                 Library                                Interface (MPI)   Collections
              simplify parallelism                         parallelism

                                                                                                                                       OpenMP*           Offload Extensions


                                                                                                                                       Coarray Fortran   Intel® SPMD
                                                                                                                                                         Parallel Compiler
              Open sourced &                               Open sourced &
              Also an Intel product
                                                                                                                                       OpenCL*
                                                           Also an Intel product




                                          Choice of high-performance parallel programming models

                                            Applicable to Multi-core and Many-core Programming *

Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Single-source approach to Multi- and Many-Core

   Source

                                 Compilers
                                 Libraries,
                               Parallel Models

                                                                                                          Intel® MIC
              Multicore CPU                                   Multicore
                                                               CPU                                       architecture
                                                                                                        co-processor              Multicore Cluster   Clusters with Multicore
                                                                                                                                                      and Many-core




                                                                                                                                                      …                  …


                              Multicore      Many-core                          Clusters
                             “Unparalleled productivity… most of this software does
                                    not run on a GPU” - Robert Harrison, NICS, ORNL
                                                           “R. Harrison, “Opportunities and Challenges Posed by Exascale Computing
                                                           - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov
                                                           2011”


Copyright © 2013 Intel Corporation. All rights reserved.    *Other brands and names are the property of their respective owners
Intel® C/C++ and Fortran
                                         Compilers w/OpenMP

                                   Intel® MKL, Intel® Cilk Plus,
                                    Intel® TBB, and Intel® IPP

                                       Intel® Inspector XE,
                                  Intel® VTune™ Amplifier XE,
                                           Intel® Advisor



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Intel® C/C++ and Fortran
                                         Compilers w/OpenMP

                                   Intel® MKL, Intel® Cilk Plus,
                                    Intel® TBB, and Intel® IPP

                                         Intel® Inspector XE,
                                      Intel® VTune™ Amplifier                                                                    Intel® Parallel
                                           XE, Intel® Advisor                                                                      Studio XE



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Intel® C/C++ and Fortran
                                         Compilers w/OpenMP                                                                 Intel® MPI Library

                                   Intel® MKL, Intel® Cilk Plus,
                                    Intel® TBB, and Intel® IPP                                                         Intel® Trace Analyzer
                                                                                                                            and Collector
                                         Intel® Inspector XE,
                                      Intel® VTune™ Amplifier                                                                    Intel® Parallel
                                           XE, Intel® Advisor                                                                      Studio XE



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Intel® C/C++ and Fortran
                                         Compilers w/OpenMP                                                                 Intel® MPI Library

                                   Intel® MKL, Intel® Cilk Plus,
                                    Intel® TBB, and Intel® IPP                                                         Intel® Trace Analyzer
                                                                                                                            and Collector
                                         Intel® Inspector XE,
                                      Intel® VTune™ Amplifier                                                                    Intel® Parallel
                                           XE, Intel® Advisor                                                                      Studio XE



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Intel® Xeon Phi™ Coprocessor- Game Changer for HPC
                  Build your applications on a known compute platform…
                             and watch them take off sooner.



         With restrictive special
         purpose hardware
                                                                                                       “We ported millions of
                                                                                                       lines of code in only days
                                                 Complex
                                                                                                       and completed accurate
             New learning                      code porting                                            runs. Unparalleled
                                                                                                       productivity…
                                                                                                       most of this software does
                                                                                                       not run on a GPU and
         With Intel® Xeon Phi™                                                                         never will”.
         Coprocessor

                                                                                                       — Robert Harrison,
                                                                                                            National Institute for
                                                                                                            Computational Sciences,
         Familiar tools & runtimes                                                                          Oak Ridge National
                                                                                                            Laboratory


     7Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of
     Computational Sciences (NICS), 2011.
32
Achieving Parallelism in Applications

IA Benefit: Wide Range of Development Options

        Parallelization Options                                                               Vector Options
                                                                                                                                                                        Ease	
  of	
  use	
  
                       Intel®	
  Math	
  Kernel	
  Library	
                                     Intel®	
  Math	
  Kernel	
  Library	
  
                                    MPI*	
  
                                                                                                         Auto	
  vectorizaBon	
  

                                                                                               Semi-­‐auto	
  vectorizaBon:	
  	
  	
  	
  	
  
                                         OpenMP*                                            #pragma	
  (vector,	
  ivdep,	
  	
  simd)	
  

                                                                                          Array	
  NotaBon:	
  	
  Intel®	
  Cilk™	
  Plus	
  
                         Intel®	
  Threading	
  Building	
  
                                      Blocks	
                                                         C/C++	
  Vector	
  Classes	
  	
  	
  	
  	
  	
  	
  	
  	
  
                                                                                                       (F32vec16,	
  F64vec8)	
  	
  
                              Intel®	
  Cilk™	
  Plus
                                                                                                                    OpenCL*	
  

                                         Pthreads*	
                                                                                                                    Fine	
  control	
  
                                                                                                                     Intrinsics	
  



Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Spectrum of Programming Models and Mindsets

                                                   Multi-Core Centric                                                                   Many-Core Centric
                        CPU                                                                                                                                          Coprocessor


                      MulB-­‐Core	
  Hosted      	
                                            Symmetric          	
  
                                                                                                                                                   Many	
  Core	
  Hosted  	
  
                       General	
  purpose	
                                                Codes	
  with	
  balanced	
  
                      serial	
  and	
  parallel	
                                                                                                 Highly-­‐parallel	
  codes    	
  
                          compu0ng        	
                                                         needs   	
  

                                                                Offload	
  
                                                            Codes	
  with	
  highly-­‐	
  
                                                             parallel	
  phases	
  
                              Main(	
  )	
                       Main(	
  )	
                         Main(	
  )	
  
      CPU                     Foo(	
  )	
                        Foo(	
  )	
                          Foo(	
  )	
  
                              MPI_*(	
  )	
                      MPI_*(	
  )	
                        MPI_*(	
  )	
  

                                                                                                      Main(	
  )	
                                       Main(	
  )	
  
 Many-core                                                       Foo(	
  )	
                          Foo(	
  )	
                                        Foo(	
  )	
  
Co-processor                                                                                          MPI_*(	
  )	
                                      MPI_*(	
  )	
  


                                         Range of models to meet application needs

 Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Operating Environment View

                  Intel® Xeon® processor                                                         Intel® Xeon Phi™ coprocessor

                                                           Host                                              MIC
                                                                                                                                           Linux Standard
                                                                                                                                           Base:
                                                                                                                                   Linux   •  IP
                                                                                 PCIe                                                      •  SSH
                                                                                                                                           •  NFS
                File I/O                                          Sockets / OFED
                Runtimes                                                                                                           “LSB”
                Platform                                               SCIF

                         ABI                                                                                                        ABI




                                       A flexible, familiar, compatible operating
                                                      environment


Copyright © 2013 Intel Corporation. All rights reserved.     *Other brands and names are the property of their respective owners
Programming View

      Intel ® Xeon® Processor                                                                                                    Intel® Xeon Phi™ coprocessor

                                                     MKL                             Host                                         MIC        MKL
        Intra-node                                  TBB
                                                    Cilk                                                                                    TBB
                                                                                                                                            Cilk
        parallel                                    Plus
                                                   OpenMP                                                                                   Plus
                                                                                                                                           OpenMP
                                                  PThreads                                                                                PThreads
                                                                                                                PCIe
        Intra- and
                                                        MPI                                                                                  MPI
        inter-node
        parallel                                      AO-MKL                                                                               AO-MKL


    Node
                                                      OpenCL                                                                               OpenCL
    performance                                                                    Language Extensions for
                                                    C++/FTN                        Offload
                                                                                                                                           C++/FTN
    and offload
                                                                                           SCIF / OFED / IP

                                                Same Parallel Models for Processor and Co-
                                                                processor
Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Spectrum of Programming Models and Mindsets

                                                   Multi-Core Centric                                                                   Many-Core Centric
                        CPU                                                                                                                                          Coprocessor


                      MulB-­‐Core	
  Hosted      	
                                            Symmetric          	
  
                                                                                                                                                   Many	
  Core	
  Hosted  	
  
                       General	
  purpose	
                                                Codes	
  with	
  balanced	
  
                      serial	
  and	
  parallel	
                                                                                                 Highly-­‐parallel	
  codes    	
  
                          compu0ng        	
                                                         needs   	
  

                                                                Offload	
  
                                                            Codes	
  with	
  highly-­‐	
  
                                                             parallel	
  phases	
  
                              Main(	
  )	
                       Main(	
  )	
                         Main(	
  )	
  
      CPU                     Foo(	
  )	
                        Foo(	
  )	
                          Foo(	
  )	
  
                              MPI_*(	
  )	
                      MPI_*(	
  )	
                        MPI_*(	
  )	
  

                                                                                                      Main(	
  )	
                                       Main(	
  )	
  
 Many-core                                                       Foo(	
  )	
                          Foo(	
  )	
                                        Foo(	
  )	
  
Co-processor                                                                                          MPI_*(	
  )	
                                      MPI_*(	
  )	
  


                                         Range of models to meet application needs

 Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Programming Intel® Xeon Phi™ based
          Systems (MPI+Offload)
•  MPI ranks on Intel® Xeon®
                                                        	
        Offload    	
  
   processors (only)                                    	
                   	
  
•  All messages into/out of Xeon                        	
  
                                                       Data	
                	
  
   processors                                           	
                   	
  
                                                       CPU	
                MIC	
  
•  Offload models used to                        MPI
                                                        	
                   	
  
   accelerate MPI ranks                                 	
                   	
  
                                                       Data	
  
                                                        	
                   	
  
•  TBB, OpenMP*, Cilk Plus,
                                                        	
                   	
  
   Pthreads within coprocessor                         CPU	
                MIC	
  
                                   Network	
  
•  Homogenous network of hybrid                         	
                   	
  
   nodes:                                               	
                   	
  
                                                        	
  
                                                       Data	
                	
  
                                                        	
                   	
  
                                                       CPU	
                MIC	
  

                                                        	
                   	
  
                                                        	
                   	
  
                                                        	
  
                                                       Data	
                	
  
                                                        	
                   	
  
                                                       CPU	
                MIC	
  
Offload Code Examples
             •  C/C++	
  Offload	
  Pragma	
                                                                                       •  Fortran	
  Offload	
  DirecBve	
  
             #pragma	
  offload	
  target	
  (mic)	
                                                                               !dir$	
  omp	
  offload	
  target(mic)	
  
             #pragma	
  omp	
  parallel	
  for	
  reducBon(+:pi)	
                                                               !$omp	
  parallel	
  do	
  
             	
  for	
  (i=0;	
  i<count;	
  i++)	
  	
  {	
                                                                     	
  	
  	
  	
  	
  	
  do	
  i=1,10	
  
             	
  	
  	
  	
  	
  	
  	
  	
  float	
  t	
  =	
  (float)((i+0.5)/count);	
                                          	
  	
  	
  	
  	
  	
            	
  A(i)	
  =	
  B(i)	
  *	
  C(i)	
  
             	
  	
  	
  	
  	
  	
  	
  	
  pi	
  +=	
  4.0/(1.0+t*t);	
                                                        	
  	
  	
  	
  	
  	
  enddo	
  
             	
  }	
                                                                                                             	
  
             	
  pi	
  /=	
  count;	
  
                                                                                                                                 •  C/C++	
  	
  Language	
  Extension	
  
                                                                                                                                 class	
  _Shared	
  common	
  	
  {	
  
             •  FuncBon	
  Offload	
  Example	
                                                                                         	
  int	
  data1;	
  
             #pragma	
  offload	
  target(mic)	
                                                                                        	
  char	
  *data2;	
  
             	
  	
  	
  	
  	
  	
  	
  in(transa,	
  transb,	
  N,	
  alpha,	
  beta)	
  	
                                        	
  class	
  common	
  *next;	
  
             	
  	
  	
  	
  	
  	
  	
  in(A:length(matrix_elements))	
  	
                                                         	
  void	
  process();	
  
             	
  	
  	
  	
  	
  	
  	
  in(B:length(matrix_elements))	
  	
                                                    };	
  
             	
  	
  	
  	
  	
  	
  	
  inout(C:length(matrix_elements))	
                                                      _Shared	
  class	
  common	
  	
  obj1,	
  obj2;	
  
             	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sgemm(&transa,	
  &transb,	
  &N,	
  &N,	
  &N,	
                           _Cilk	
  _spawn	
  	
  	
  _Offload	
  obj1.process();	
  
             &alpha,	
  A,	
  &N,	
  B,	
  &N,	
  &beta,	
  C,	
  &N);                      	
                                   _Cilk_spawn	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  obj2.process();	
  
             	
  

Copyright © 2013 Intel Corporation. All rights reserved.          *Other brands and names are the property of their respective owners
Spectrum of Programming Models and Mindsets

                                                   Multi-Core Centric                                                                   Many-Core Centric
                        CPU                                                                                                                                          Coprocessor


                      MulB-­‐Core	
  Hosted      	
                                            Symmetric          	
  
                                                                                                                                                   Many	
  Core	
  Hosted  	
  
                       General	
  purpose	
                                                Codes	
  with	
  balanced	
  
                      serial	
  and	
  parallel	
                                                                                                 Highly-­‐parallel	
  codes    	
  
                          compu0ng        	
                                                         needs   	
  

                                                                Offload	
  
                                                            Codes	
  with	
  highly-­‐	
  
                                                             parallel	
  phases	
  
                              Main(	
  )	
                       Main(	
  )	
                         Main(	
  )	
  
      CPU                     Foo(	
  )	
                        Foo(	
  )	
                          Foo(	
  )	
  
                              MPI_*(	
  )	
                      MPI_*(	
  )	
                        MPI_*(	
  )	
  

                                                                                                      Main(	
  )	
                                       Main(	
  )	
  
 Many-core                                                       Foo(	
  )	
                          Foo(	
  )	
                                        Foo(	
  )	
  
Co-processor                                                                                          MPI_*(	
  )	
                                      MPI_*(	
  )	
  


                                         Range of models to meet application needs

 Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Programming Intel® Xeon Phi™ based
        Systems (MIC Native)
•  MPI ranks on Intel MIC (only)                   	
       	
  
•  All messages into/out of                        	
       	
  
                                                   	
       	
  
                                                           Data	
  
   coprocessor
                                                   	
       	
  
•  TBB, OpenMP*, Cilk Plus,                      CPU	
      MIC	
  
                                                                      MPI
   Pthreads used directly within                  	
        	
  
                                                  	
        	
  
   MPI processes                                           Data	
  
                                                  	
        	
  
                                                  	
        	
  
                                                 CPU	
      MIC	
  
•  Programmed as homogenous        Network	
  
                                                  	
        	
  
   network of many-core CPUs:
                                                  	
        	
  
                                                  	
        	
  
                                                           Data	
  
                                                  	
        	
  
                                                 CPU	
      MIC	
  
                                                  	
        	
  
                                                  	
        	
  
                                                  	
        	
  
                                                           Data	
  
                                                  	
        	
  
                                                 CPU	
      MIC	
  
Spectrum of Programming Models and Mindsets

                                                   Multi-Core Centric                                                                   Many-Core Centric
                        CPU                                                                                                                                          Coprocessor


                      MulB-­‐Core	
  Hosted      	
                                            Symmetric          	
  
                                                                                                                                                   Many	
  Core	
  Hosted  	
  
                       General	
  purpose	
                                                Codes	
  with	
  balanced	
  
                      serial	
  and	
  parallel	
                                                                                                 Highly-­‐parallel	
  codes    	
  
                          compu0ng        	
                                                         needs   	
  

                                                                Offload	
  
                                                            Codes	
  with	
  highly-­‐	
  
                                                             parallel	
  phases	
  
                              Main(	
  )	
                       Main(	
  )	
                         Main(	
  )	
  
      CPU                     Foo(	
  )	
                        Foo(	
  )	
                          Foo(	
  )	
  
                              MPI_*(	
  )	
                      MPI_*(	
  )	
                        MPI_*(	
  )	
  

                                                                                                      Main(	
  )	
                                       Main(	
  )	
  
 Many-core                                                       Foo(	
  )	
                          Foo(	
  )	
                                        Foo(	
  )	
  
Co-processor                                                                                          MPI_*(	
  )	
                                      MPI_*(	
  )	
  


                                         Range of models to meet application needs

 Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Programming Intel® MIC-based Systems
                  (Symmetric)

•  MPI ranks on coprocessor and                          	
        	
  
   Intel® Xeon® processors                               	
        	
  
                                                         	
  
                                                       Data	
      	
  
                                                                  Data	
  
•  Messages to/from any core
                                                         	
        	
  
•  TBB, OpenMP*, Cilk Plus,                      MPI
                                                       CPU	
       MIC	
  
                                                                             MPI
   Pthreads used directly within                        	
         	
  
                                                        	
         	
  
   MPI processes                                       Data	
     Data	
  
                                                        	
         	
  
                                                        	
         	
  
                                                       CPU	
       MIC	
  
•  Programmed as heterogeneous     Network	
  
                                                        	
         	
  
   network of homogeneous
                                                        	
         	
  
   nodes:                                               	
  
                                                       Data	
      	
  
                                                                  Data	
  
                                                        	
         	
  
                                                       CPU	
       MIC	
  

                                                        	
         	
  
                                                        	
         	
  
                                                        	
  
                                                       Data	
      	
  
                                                                  Data	
  
                                                        	
         	
  
                                                       CPU	
       MIC	
  
Spectrum of Programming Models and Mindsets

                                                   Multi-Core Centric                                                                   Many-Core Centric
                        CPU                                                                                                                                          Coprocessor


                      MulB-­‐Core	
  Hosted      	
                                            Symmetric          	
  
                                                                                                                                                   Many	
  Core	
  Hosted  	
  
                       General	
  purpose	
                                                Codes	
  with	
  balanced	
  
                      serial	
  and	
  parallel	
                                                                                                 Highly-­‐parallel	
  codes    	
  
                          compu0ng        	
                                                         needs   	
  

                                                                Offload	
  
                                                            Codes	
  with	
  highly-­‐	
  
                                                             parallel	
  phases	
  
                              Main(	
  )	
                       Main(	
  )	
                         Main(	
  )	
  
      CPU                     Foo(	
  )	
                        Foo(	
  )	
                          Foo(	
  )	
  
                              MPI_*(	
  )	
                      MPI_*(	
  )	
                        MPI_*(	
  )	
  

                                                                                                      Main(	
  )	
                                       Main(	
  )	
  
 Many-core                                                       Foo(	
  )	
                          Foo(	
  )	
                                        Foo(	
  )	
  
Co-processor                                                                                          MPI_*(	
  )	
                                      MPI_*(	
  )	
  


                                         Range of models to meet application needs

 Copyright © 2013 Intel Corporation. All rights reserved.         *Other brands and names are the property of their respective owners
Approved for Public Presentation



                                                           A GROWING ECOSYSTEM:
                                        Developing today on Intel® Xeon Phi™ coprocessors




Other brands and names are the property of their respective owners.

Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Software, Drivers, Tools & Online Resources

                                                                                                                                Tools & Software Downloads



                                                                                                                                Getting Started Development Guides



                                                                                                                                Video Workshops, Tutorials, & Events



                                                                                                                                Code Samples & Case Studies



                                                                                                                                Articles, Forums, & Blogs



                                                                                                                                Associated Product Links



                                                           http://software.intel.com/mic-developer
Copyright © 2013 Intel Corporation. All rights reserved.      *Other brands and names are the property of their respective owners
Keys to Productive Parallel Performance
Determine the best platform target for your application
  •    Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors
       - or both


Choose the right Xeon-centric or MIC-centric model for
 your application

Vectorize your application

Parallelize your application
  • With MPI (or other multi-process model)
  • With threads (via Pthreads, TBB, Cilk Plus, OpenMP,
    etc.)
  • Go asynchronous: overlap computation and
    communication


Maintain unified source code for CPU and
Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
Legal Disclaimer & Optimization Notice
                 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
                 OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
                 LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION
                 INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
                 INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

                 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
                 Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
                 operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
                 performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
                 combined with other products.

                 Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk
                 are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of
                 others.



                 Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor
                 family, not across different processor families: Go to: Learn About Intel® Processor Numbers




                    Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
                    unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
                    Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured
                    by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
                    optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
                    product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
                                                                                                                              Notice revision #20110804

                                                                                    Copyright© 2012, Intel Corporation. All rights reserved.
                                                                                                                                               intel.com/software/products
Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and *Other brandsthe property ofthe property of theirowners owners.
                                                                             names are and names are their respective respective
This slide MUST be used with any slides removed from this presentation




                                                                       Legal Disclaimers
          •  All products, computer systems, dates, and figures specified are preliminary based on current expectations,
             and are subject to change without notice.
          •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within
             each processor family, not across different processor families. Go to:
             http://www.intel.com/products/processor_number
          •  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
             may cause the product to deviate from published specifications. Current characterized errata are available on
             request.
          •  Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual
             machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and
             software configurations. Software applications may not be compatible with all operating systems. Consult
             your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
          •  No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology
             (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled
             processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched
             environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit
             http://www.intel.com/technology/security
          •  Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer.
             Performance varies depending on hardware, software and system configuration. For more information, visit
             http://www.intel.com/technology/turboboost
          •  Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to
             execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For
             availability, consult your reseller or system manufacturer. For more information, see
             http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/
          •  Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive
             (2002/95/EC, Annex A). No exemptions required
          •  Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below
             900ppm bromine and 900ppm chlorine.
          •  Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or
             registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
          •  Copyright © 2011, Intel Corporation. All rights reserved.


Copyright © 2013 Intel Corporation. All rights reserved.   *Other brands and names are the property of their respective owners
This slide MUST be used with any slides with performance data removed from this presentation




                                                  Legal Disclaimers: Performance
         •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
            performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
            may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or
            components they are considering purchasing. For more information on performance tests and on the performance of Intel
            products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm.
         •  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document.
            Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are
            reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
         •  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual
            benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and
            assigning them a relative performance number that correlates with the performance improvements reported.
         •  SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and
            SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.
         •  TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information.
         •  SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See
            http://www.sap.com/benchmark for more information.
         •  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
            OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
            LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
            INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
            MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
         •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
            performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
            may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or
            components they are considering purchasing. For more information on performance tests and on the performance of Intel
            products, reference www.intel.com/software/products.




Copyright © 2013 Intel Corporation. All rights reserved.
51                                                         *Other brands and names are the property of their respective owners
WRF Configuration (Backup)

            •  Measured by Intel 3/27/2013 (Endeavor Cluster)
            •  Runs in symmetric model.
            •  Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2
               socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W),
               each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61
               core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available
               from the US National Center for Atmospheric Research in Boulder, Colorado
            •  It is available from http://www.wrf-model.org/
            •  All KNC optimizations are in the V3.5 svn today
            •  Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X
               B1 KNC (61c, 1.1Ghz, 5.5Gts)
            •  WRF CONUS2.5km workload available from
               www.mmm.ucar.edu/wrf/WG2/bench/
            •  Performance comparison is based upon average timestep, we ignore
               initialization and post simulation file operations.

Copyright © 2013 Intel Corporation. All rights reserved.     *Other brands and names are the property of their respective owners

Weitere ähnliche Inhalte

Was ist angesagt?

SPICE MODEL of TG6063 in SPICE PARK
SPICE MODEL of TG6063 in SPICE PARKSPICE MODEL of TG6063 in SPICE PARK
SPICE MODEL of TG6063 in SPICE PARKTsuyoshi Horigome
 
2013 nvh str_borne_workshop_web
2013 nvh str_borne_workshop_web2013 nvh str_borne_workshop_web
2013 nvh str_borne_workshop_webAlan Duncan
 
SPICE MODEL of 2SA1890 in SPICE PARK
SPICE MODEL of 2SA1890 in SPICE PARKSPICE MODEL of 2SA1890 in SPICE PARK
SPICE MODEL of 2SA1890 in SPICE PARKTsuyoshi Horigome
 
SPICE MODEL of 2SA1300 in SPICE PARK
SPICE MODEL of 2SA1300 in SPICE PARKSPICE MODEL of 2SA1300 in SPICE PARK
SPICE MODEL of 2SA1300 in SPICE PARKTsuyoshi Horigome
 
SPICE MODEL of 2SA2174J in SPICE PARK
SPICE MODEL of 2SA2174J in SPICE PARKSPICE MODEL of 2SA2174J in SPICE PARK
SPICE MODEL of 2SA2174J in SPICE PARKTsuyoshi Horigome
 
Ga techsusthpc patterson
Ga techsusthpc pattersonGa techsusthpc patterson
Ga techsusthpc pattersonMelanie Brandt
 
SPICE MODEL of 2SA1837 in SPICE PARK
SPICE MODEL of 2SA1837 in SPICE PARKSPICE MODEL of 2SA1837 in SPICE PARK
SPICE MODEL of 2SA1837 in SPICE PARKTsuyoshi Horigome
 
SPICE MODEL of 2SC5201 in SPICE PARK
SPICE MODEL of 2SC5201 in SPICE PARKSPICE MODEL of 2SC5201 in SPICE PARK
SPICE MODEL of 2SC5201 in SPICE PARKTsuyoshi Horigome
 
2011 SAE Structure Borne NVH Workshop Web
2011 SAE Structure Borne NVH Workshop Web2011 SAE Structure Borne NVH Workshop Web
2011 SAE Structure Borne NVH Workshop WebAlan Duncan
 
2009 SAE Structure Borne NVH Workshop
2009 SAE Structure Borne NVH Workshop2009 SAE Structure Borne NVH Workshop
2009 SAE Structure Borne NVH WorkshopAlan Duncan
 
The Cloud and Exa by intel
The Cloud and Exa by intelThe Cloud and Exa by intel
The Cloud and Exa by intelOracle Day
 
SPICE MODEL of 2SB906 in SPICE PARK
SPICE MODEL of 2SB906 in SPICE PARKSPICE MODEL of 2SB906 in SPICE PARK
SPICE MODEL of 2SB906 in SPICE PARKTsuyoshi Horigome
 
SPICE MODEL of 2SC4054 in SPICE PARK
SPICE MODEL of 2SC4054 in SPICE PARKSPICE MODEL of 2SC4054 in SPICE PARK
SPICE MODEL of 2SC4054 in SPICE PARKTsuyoshi Horigome
 
Led Lifetime -- EDN Digikey at LFI 2010
Led Lifetime -- EDN Digikey at LFI 2010Led Lifetime -- EDN Digikey at LFI 2010
Led Lifetime -- EDN Digikey at LFI 2010Mark McClear
 

Was ist angesagt? (17)

Deviser
DeviserDeviser
Deviser
 
SPICE MODEL of TG6063 in SPICE PARK
SPICE MODEL of TG6063 in SPICE PARKSPICE MODEL of TG6063 in SPICE PARK
SPICE MODEL of TG6063 in SPICE PARK
 
2013 nvh str_borne_workshop_web
2013 nvh str_borne_workshop_web2013 nvh str_borne_workshop_web
2013 nvh str_borne_workshop_web
 
SPICE MODEL of 2SA1890 in SPICE PARK
SPICE MODEL of 2SA1890 in SPICE PARKSPICE MODEL of 2SA1890 in SPICE PARK
SPICE MODEL of 2SA1890 in SPICE PARK
 
SPICE MODEL of 2SA1300 in SPICE PARK
SPICE MODEL of 2SA1300 in SPICE PARKSPICE MODEL of 2SA1300 in SPICE PARK
SPICE MODEL of 2SA1300 in SPICE PARK
 
SPICE MODEL of 2SA2174J in SPICE PARK
SPICE MODEL of 2SA2174J in SPICE PARKSPICE MODEL of 2SA2174J in SPICE PARK
SPICE MODEL of 2SA2174J in SPICE PARK
 
Ga techsusthpc patterson
Ga techsusthpc pattersonGa techsusthpc patterson
Ga techsusthpc patterson
 
SPICE MODEL of 2SA1837 in SPICE PARK
SPICE MODEL of 2SA1837 in SPICE PARKSPICE MODEL of 2SA1837 in SPICE PARK
SPICE MODEL of 2SA1837 in SPICE PARK
 
Big Ed Rechargeable
Big Ed RechargeableBig Ed Rechargeable
Big Ed Rechargeable
 
SPICE MODEL of 2SC5201 in SPICE PARK
SPICE MODEL of 2SC5201 in SPICE PARKSPICE MODEL of 2SC5201 in SPICE PARK
SPICE MODEL of 2SC5201 in SPICE PARK
 
2011 SAE Structure Borne NVH Workshop Web
2011 SAE Structure Borne NVH Workshop Web2011 SAE Structure Borne NVH Workshop Web
2011 SAE Structure Borne NVH Workshop Web
 
2009 SAE Structure Borne NVH Workshop
2009 SAE Structure Borne NVH Workshop2009 SAE Structure Borne NVH Workshop
2009 SAE Structure Borne NVH Workshop
 
The Cloud and Exa by intel
The Cloud and Exa by intelThe Cloud and Exa by intel
The Cloud and Exa by intel
 
Roy
RoyRoy
Roy
 
SPICE MODEL of 2SB906 in SPICE PARK
SPICE MODEL of 2SB906 in SPICE PARKSPICE MODEL of 2SB906 in SPICE PARK
SPICE MODEL of 2SB906 in SPICE PARK
 
SPICE MODEL of 2SC4054 in SPICE PARK
SPICE MODEL of 2SC4054 in SPICE PARKSPICE MODEL of 2SC4054 in SPICE PARK
SPICE MODEL of 2SC4054 in SPICE PARK
 
Led Lifetime -- EDN Digikey at LFI 2010
Led Lifetime -- EDN Digikey at LFI 2010Led Lifetime -- EDN Digikey at LFI 2010
Led Lifetime -- EDN Digikey at LFI 2010
 

Andere mochten auch

鏡はなぜ左右だけ逆になるのか?
鏡はなぜ左右だけ逆になるのか?鏡はなぜ左右だけ逆になるのか?
鏡はなぜ左右だけ逆になるのか?Akira Asano
 
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」2013年度秋学期 統計学 第2回「統計資料の収集と読み方」
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」Akira Asano
 
北海道大学有機元素化学抄録会2014
北海道大学有機元素化学抄録会2014北海道大学有機元素化学抄録会2014
北海道大学有機元素化学抄録会2014Hajime Ito
 
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)MITSUNARI Shigeo
 
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1Computational Materials Science Initiative
 
セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門Takami Sato
 
ジャストシステムJava100本ノックのご紹介
ジャストシステムJava100本ノックのご紹介ジャストシステムJava100本ノックのご紹介
ジャストシステムJava100本ノックのご紹介JustSystems Corporation
 
マルチコアを用いた画像処理
マルチコアを用いた画像処理マルチコアを用いた画像処理
マルチコアを用いた画像処理Norishige Fukushima
 

Andere mochten auch (8)

鏡はなぜ左右だけ逆になるのか?
鏡はなぜ左右だけ逆になるのか?鏡はなぜ左右だけ逆になるのか?
鏡はなぜ左右だけ逆になるのか?
 
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」2013年度秋学期 統計学 第2回「統計資料の収集と読み方」
2013年度秋学期 統計学 第2回「統計資料の収集と読み方」
 
北海道大学有機元素化学抄録会2014
北海道大学有機元素化学抄録会2014北海道大学有機元素化学抄録会2014
北海道大学有機元素化学抄録会2014
 
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)
Xeon PhiとN体計算コーディング x86/x64最適化勉強会6(@k_nitadoriさんの代理アップ)
 
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1
CMSI計算科学技術特論B(15) インテル Xeon Phi コプロセッサー向け最適化、並列化概要 1
 
セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門
 
ジャストシステムJava100本ノックのご紹介
ジャストシステムJava100本ノックのご紹介ジャストシステムJava100本ノックのご紹介
ジャストシステムJava100本ノックのご紹介
 
マルチコアを用いた画像処理
マルチコアを用いた画像処理マルチコアを用いた画像処理
マルチコアを用いた画像処理
 

Ähnlich wie Productive parallel programming for intel xeon phi coprocessors

La Nuova Architettura Processori Intel® Xeon® 5500
La Nuova Architettura Processori Intel® Xeon® 5500La Nuova Architettura Processori Intel® Xeon® 5500
La Nuova Architettura Processori Intel® Xeon® 5500Walter Moriconi
 
IT FUTURE 2011 - Présentation d'Intel
IT FUTURE 2011 - Présentation d'IntelIT FUTURE 2011 - Présentation d'Intel
IT FUTURE 2011 - Présentation d'IntelFujitsu France
 
Emc world 2011 gelsinger finalb
Emc world 2011 gelsinger finalbEmc world 2011 gelsinger finalb
Emc world 2011 gelsinger finalbTina Jiang
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesJeffrey Funk
 
Green Telecom & IT Workshop: Ee routing and networking thierry klein
Green Telecom & IT Workshop: Ee routing and networking   thierry kleinGreen Telecom & IT Workshop: Ee routing and networking   thierry klein
Green Telecom & IT Workshop: Ee routing and networking thierry kleinBellLabs
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationxKinAnx
 
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...ayubimoak
 
intel PDF 32nm Technology Update Mark Bohr
intel  PDF  	32nm Technology Update Mark Bohrintel  PDF  	32nm Technology Update Mark Bohr
intel PDF 32nm Technology Update Mark Bohrfinance6
 
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...Gruppo TIM
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
Protocol Aware Ate Semi Submitted
Protocol Aware Ate Semi SubmittedProtocol Aware Ate Semi Submitted
Protocol Aware Ate Semi SubmittedEric Larson
 
Evento Startup Essential Barcelona
Evento Startup Essential BarcelonaEvento Startup Essential Barcelona
Evento Startup Essential BarcelonaManuel Jaffrin
 
Two phase loop cooling system
Two phase loop cooling system Two phase loop cooling system
Two phase loop cooling system Jeehoon Choi
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Intel IT Center
 
intel Presentation
intel Presentationintel Presentation
intel Presentationfinance6
 

Ähnlich wie Productive parallel programming for intel xeon phi coprocessors (20)

La Nuova Architettura Processori Intel® Xeon® 5500
La Nuova Architettura Processori Intel® Xeon® 5500La Nuova Architettura Processori Intel® Xeon® 5500
La Nuova Architettura Processori Intel® Xeon® 5500
 
IT FUTURE 2011 - Présentation d'Intel
IT FUTURE 2011 - Présentation d'IntelIT FUTURE 2011 - Présentation d'Intel
IT FUTURE 2011 - Présentation d'Intel
 
Emc world 2011 gelsinger finalb
Emc world 2011 gelsinger finalbEmc world 2011 gelsinger finalb
Emc world 2011 gelsinger finalb
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phones
 
Green Telecom & IT Workshop: Ee routing and networking thierry klein
Green Telecom & IT Workshop: Ee routing and networking   thierry kleinGreen Telecom & IT Workshop: Ee routing and networking   thierry klein
Green Telecom & IT Workshop: Ee routing and networking thierry klein
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentation
 
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...
High-Performance In0.75Ga0.25As Implant-Free n-Type MOSFETs for Low Power App...
 
intel PDF 32nm Technology Update Mark Bohr
intel  PDF  	32nm Technology Update Mark Bohrintel  PDF  	32nm Technology Update Mark Bohr
intel PDF 32nm Technology Update Mark Bohr
 
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...
Strategic Outlook - 2009 Results and the 2010-2012 Strategic Plan Update (Cic...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
Protocol Aware Ate Semi Submitted
Protocol Aware Ate Semi SubmittedProtocol Aware Ate Semi Submitted
Protocol Aware Ate Semi Submitted
 
Resd technologies-epri
Resd technologies-epriResd technologies-epri
Resd technologies-epri
 
Evento Startup Essential Barcelona
Evento Startup Essential BarcelonaEvento Startup Essential Barcelona
Evento Startup Essential Barcelona
 
Two phase loop cooling system
Two phase loop cooling system Two phase loop cooling system
Two phase loop cooling system
 
Cim 20070701 jul_2007
Cim 20070701 jul_2007Cim 20070701 jul_2007
Cim 20070701 jul_2007
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
 
Cim 20071101 nov_2007
Cim 20071101 nov_2007Cim 20071101 nov_2007
Cim 20071101 nov_2007
 
intel Presentation
intel Presentationintel Presentation
intel Presentation
 
Cim 20070801 aug_2007
Cim 20070801 aug_2007Cim 20070801 aug_2007
Cim 20070801 aug_2007
 
Soc lect1
Soc lect1Soc lect1
Soc lect1
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Kürzlich hochgeladen (20)

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

Productive parallel programming for intel xeon phi coprocessors

  • 1. Productive Parallel Programming for Intel® Xeon Phi™ Coprocessors Bill Magro! Director and Chief Technologist! Technical Computing Software! Intel Software & Services Group!
  • 2. Still an Insatiable Need For Computing Weather Prediction 1 ZFlops 100 EFlops 10 EFlops Genomics Research 1 EFlops 100 PFlops 10 PFlops Medical Imaging 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1 GFlops Forecast 100 MFlops 1993 1999 2005 2011 2017 2023 2029 PetaFlop Systems of Today Are The Client And Handheld Systems 10 years Later Source: www.top500.org
  • 4. Some believe… •  Virtually none of today’s hardware or software technologies can be improved or modified to reach exascale •  A complete revolution is needed We believe… •  Evolution of today’s technologies + hardware and software innovation can get us there •  A systems approach – with co-design – is critical
  • 5. Moore’s Law: Alive and Well 2003 2005 2007 2009 2011 90 nm 65 nm 45 nm 32 nm 22 22nm nm A Revolutionary Leap in Process SiGe SiGe Technology Invented SiGe 2nd Gen. SiGe Invented Gate-Last 2nd Gen. Gate-Last First to Implement 37% Strained Silicon Strained Silicon High-k High-k Tri-Gate Performance Gain at Metal Gate Metal Gate Low Voltage* STRAINED SILICON HIGH-k METAL GATE >50% Active Power Reduction at Constant TRI-GATE Performance* The foundation for all computing… including Exa-Scale Source: Intel *Compared to Intel 32nm Technology
  • 6. Intel® Xeon Phi™ Coprocessor [Knights Corner]: Power Efficiency Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters 1381 1380 1266 1400 1200 MFLOPS/Watt 1000 800 + + + 600 400 200 0 Intel Corp Nagasaki Univ. Barcelona Supercomputing Center Knights Corner ATI Radeon Nvidia Tesla 2090 Higher is Better Source: www.green500.org Top500 #150 Top500 #456 Top500 #177 June 2012 June 2012 June 2012 72.9 kW 47 kW 81.5 kW 6 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • 7. Myth: explicitly managed locality is in and caches are out! Reality: Caches remain path to high performance and efficiency Relative BW Relative BW/Watt 50 45 40 35 30 25 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW
  • 8. #1 Green 500 Cluster WORLD RECORD! “Beacon” at NICS Intel® Xeon® Processor + Intel Xeon Phi™ Coprocessor Cluster Most Power Efficient on the List 2.449 GigaFLOPS / Watt 70.1% efficiency Other brands and names are the property of their respective owners. Source: www.green500.org as of Nov 2012
  • 9. Reaching Exascale Power Goals Requires Architectural & Systems Focus •  Memory (2x-5x) –  New memory interfaces (optimized memory control and xfer) –  Extend DRAM with non-volatile memory •  Processor (10x-20x) –  Reducing data movement (functional reorganization, > 20x) –  Domain/Core power gating and aggressive voltage scaling •  Interconnect (2x-5x) –  More interconnect on package –  Replace long haul copper with integrated optics •  Data Center Energy Efficiencies (10%-20%) –  Higher operating temperature tolerance –  480V to the rack and free air/water cooling efficiencies 9
  • 10. Reliability of these machines requires a systems approach Extreme Parallelism 1E+07 Top System Concurrency Trend 1E+06 1E+05 1E+04 •  Transparent process migration 1E+03 1E+02 ’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09 •  Holistic fault detection and recovery MTTI Measured in Minutes •  Reliable end to end communications nt •  Integrated memory in storage layer for MTTI (hours) 1000 ip Cou Ch 1E+08 DRAM 100 fast checkpoint and workflow Count 1E+07 10 1E+06 1 Socket Cou nt 1E+05 •  N+1 scale reliable architectures 0 2004 2006 2008 2010 2012 2014 2016 1E+04 consistent with stacked memory 0.1 Failures per socket per year: Time to save a global constraints Global Checkpoint Crossover point •  System wide power management and dynamic optimization Time •  Must design for system level debug capability. Reliability is the primary force driving next generation designs Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
  • 12. Architecture for Discovery Intel® Xeon® processor •  Ground-breaking real-world application performance •  Industry-leading energy efficiency •  Meets broad set of HPC challenges Intel® Xeon Phi™ product family •  Based on Intel® Many Integrated Core (MIC) architecture •  Leading performance for highly parallel workloads •  Common Intel Xeon programming model •  Productive solution for highly-parallel computing 12
  • 13. Intel® Xeon® E5-2600 processors Up to 73% performance boost vs. prior gen1 on HPC suite of applications Over 2X improvement on key industry benchmarks Up to 4 channels DDR3 1600 memory Significantly reduce compute time on large, complex data Up to 8 cores Up to 20 MB cache sets with Intel® Advanced Vector Extensions Integrated PCI Express* Integrated I/O cuts latency while adding capacity & bandwidth 1  Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance 13
  • 14. Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.1 GHz/ 244 Threads Up to 8GB memory with up to 352 GB/s bandwidth 512-bit SIMD instructions Linux operating system, IP addressable Standard programming languages and tools Leading to Groundbreaking results Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2 Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 14 For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
  • 15. BIG GAINS FOR SELECT APPLICATIONS 8 7 Scale to many- core 6 5 Vectorize 4 3 Parallelize 2 Performance 1 0.8 0 0% 10% 0.4 20% 30% 40% 50% 60% 70% 0 Fraction 80% 90% 100% Parallel % Vector * Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor
  • 16. Performance Potential of Intel® Xeon Phi™ Coprocessors
  • 17. Synthetic Benchmark Summary (Intel® MKL) SGEMM DGEMM HPLinpack STREAM Triad (GF/s) (GF/s) (GF/s) (GB/s) Up to Up to Up to Up to 2000 2.9Xis Better Higher 1000 2.8X Higher is Better 1000 2.6X Higher is Better 2.2X Higher is Better 1,860 200 883 181 175 803 800 800 1500 150 600 600 82% Efficient 75% Efficient 86% Efficient 1000 100 79 ECC On ECC Off 400 400 640 309 303 500 50 200 200 0 0 2S Intel® Xeon® 1 Intel® Xeon Phi™ 0 0 2S Intel® Xeon® 1 Intel® Xeon Phi™ 2S Intel® Xeon® 1 Intel® Xeon 1 Intel® Xeon 2S Intel® Xeon® 1 Intel® Xeon Phi™ processor Phi™ Phi™ Processor coprocessor processor coprocessor processor coprocessor coprocessor coprocessor Notes 1.  Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720 2.  Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 15360 x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 18. PARALLELIZING FOR HIGH PERFORMANCE Example: SAXPY STARTING POINT Typical serial code running on multi-core 67.097 Current Intel® Xeon® processors SECONDS Performance STEP 1. OPTIMIZE CODE Parallelize and vectorize 0.46 145X SECONDS code and continue to run on FASTER multi-core Intel Xeon processors USE COPROCESSORS STEP 2. 0.197 SECONDS 2.3X FASTER 340X Run all or part of the optimized code on Intel® Xeon Phi™ coprocessors FASTER
  • 19. Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF) Speedup (Higher is Better) 1.45 1.6 1.4 •  Application: Weather Research and Forecasting (WRF) 1.2 1 1 •  Status: WRF v3.5 coming soon 0.8 0.6 •  Code Optimization: 0.4 –  Approximately two dozen files with less than 2,000 0.2 lines of code were modified (out of approximately 0 700,000 lines of code in about 800 files, all Fortran standard compliant) –  Most modifications improved performance for both the • 2S Intel® Xeon® processor E5-2670 with host and the co-processors four-node cluster configuration •  Performance Measurements: V3.5Pre and NCAR • 2S Intel® Xeon® processor E5-2670 + supported CONUS2.5KM benchmark (a high resolution Intel® Xeon Phi™ coprocessor (pre-production HW/SW) weather forecast) with four-node cluster configuration •  Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel or third party measured results as of December, 2012. Configuration Details: Please see backup slides.. For more information go to http://www.intel.com/performance Any difference in system hardware or software design or configuration may affect actual performance. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 20. PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor Finite Element Analysis Sandia National UP TO Labs MiniFE2 2X Seismic Acceleware 8th UP TO Order Isotropic Variable Velocity1 2.23X China Oil & Gas UP TO Geoeast Pre-stack Time Migration3 2.05X 1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2.  8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 3.  2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload) 20 INTEL CONFIDENTIAL
  • 21. PROVEN PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor Embree Ray Tracing Intel Labs SPEED-UP Ray Tracing3 1.8X Physics Jefferson Lab UP TO Lattice QCD 2.7X Finance Black-Scholes SP2 UP TO 7X Monte Carlo SP2 UP TO 10.75X 1.  2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 2.  Includes additional FLOPS from transcendental function unit 3.  Intel Measured Oct. 2012 21 INTEL CONFIDENTIAL
  • 22. Achieving Productive Parallelism with Intel® Xeon Phi™ Coprocessors
  • 23. More Cores. Wider Vectors. Performance Delivered Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 More Cores Scaling Performance Efficiently Multicore Many-core Serial 50+ cores Performance Wider Vectors Task & Data Parallel •  Industry-leading performance from Performance advanced compilers 128 Bits •  Comprehensive libraries •  Parallel programming models 256 Bits •  Insightful analysis tools Distributed Performance 512 Bits Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 24. Parallel Performance Potential •  If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor •  On a coprocessor: –  Need more threads to achieve same performance –  Same thread count can yield less performance Intel® Xeon Phi™ excels on highly parallel applications Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 25. Maximizing Parallel Performance •  Two fundamental considerations: – Scaling: Does the application already scale to the limits of Intel® Xeon® processors? – Vectorization and memory usage: Does the application make good use of vectors or is it memory bandwidth bound? •  If both are true for an application, then the highly parallel and power-efficient Intel Xeon Phi coprocessor is most likely to be worth evaluating. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 26. Intel® Family of Parallel Programming Models Intel® Cilk™ Plus Intel® Threading Domain-Specific Established Research and Building Blocks Libraries Standards Development C/C++ language Widely used C++ Intel® Math Kernel Message Passing Intel® Concurrent extensions to template library for Library Interface (MPI) Collections simplify parallelism parallelism OpenMP* Offload Extensions Coarray Fortran Intel® SPMD Parallel Compiler Open sourced & Open sourced & Also an Intel product OpenCL* Also an Intel product Choice of high-performance parallel programming models Applicable to Multi-core and Many-core Programming * Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 27. Single-source approach to Multi- and Many-Core Source Compilers Libraries, Parallel Models Intel® MIC Multicore CPU Multicore CPU architecture co-processor Multicore Cluster Clusters with Multicore and Many-core … … Multicore Many-core Clusters “Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL “R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011” Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 28. Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 29. Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 30. Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 31. Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 32. Intel® Xeon Phi™ Coprocessor- Game Changer for HPC Build your applications on a known compute platform… and watch them take off sooner. With restrictive special purpose hardware “We ported millions of lines of code in only days Complex and completed accurate New learning code porting runs. Unparalleled productivity… most of this software does not run on a GPU and With Intel® Xeon Phi™ never will”. Coprocessor — Robert Harrison, National Institute for Computational Sciences, Familiar tools & runtimes Oak Ridge National Laboratory 7Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of Computational Sciences (NICS), 2011. 32
  • 33. Achieving Parallelism in Applications IA Benefit: Wide Range of Development Options Parallelization Options Vector Options Ease  of  use   Intel®  Math  Kernel  Library   Intel®  Math  Kernel  Library   MPI*   Auto  vectorizaBon   Semi-­‐auto  vectorizaBon:           OpenMP* #pragma  (vector,  ivdep,    simd)   Array  NotaBon:    Intel®  Cilk™  Plus   Intel®  Threading  Building   Blocks   C/C++  Vector  Classes                   (F32vec16,  F64vec8)     Intel®  Cilk™  Plus OpenCL*   Pthreads*   Fine  control   Intrinsics   Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 34. Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )   Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 35. Operating Environment View Intel® Xeon® processor Intel® Xeon Phi™ coprocessor Host MIC Linux Standard Base: Linux •  IP PCIe •  SSH •  NFS File I/O Sockets / OFED Runtimes “LSB” Platform SCIF ABI ABI A flexible, familiar, compatible operating environment Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 36. Programming View Intel ® Xeon® Processor Intel® Xeon Phi™ coprocessor MKL Host MIC MKL Intra-node TBB Cilk TBB Cilk parallel Plus OpenMP Plus OpenMP PThreads PThreads PCIe Intra- and MPI MPI inter-node parallel AO-MKL AO-MKL Node OpenCL OpenCL performance Language Extensions for C++/FTN Offload C++/FTN and offload SCIF / OFED / IP Same Parallel Models for Processor and Co- processor Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 37. Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )   Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 38. Programming Intel® Xeon Phi™ based Systems (MPI+Offload) •  MPI ranks on Intel® Xeon®   Offload   processors (only)     •  All messages into/out of Xeon   Data     processors     CPU   MIC   •  Offload models used to MPI     accelerate MPI ranks     Data       •  TBB, OpenMP*, Cilk Plus,     Pthreads within coprocessor CPU   MIC   Network   •  Homogenous network of hybrid     nodes:       Data         CPU   MIC             Data         CPU   MIC  
  • 39. Offload Code Examples •  C/C++  Offload  Pragma   •  Fortran  Offload  DirecBve   #pragma  offload  target  (mic)   !dir$  omp  offload  target(mic)   #pragma  omp  parallel  for  reducBon(+:pi)   !$omp  parallel  do    for  (i=0;  i<count;  i++)    {              do  i=1,10                  float  t  =  (float)((i+0.5)/count);                A(i)  =  B(i)  *  C(i)                  pi  +=  4.0/(1.0+t*t);              enddo    }      pi  /=  count;   •  C/C++    Language  Extension   class  _Shared  common    {   •  FuncBon  Offload  Example    int  data1;   #pragma  offload  target(mic)    char  *data2;                in(transa,  transb,  N,  alpha,  beta)      class  common  *next;                in(A:length(matrix_elements))      void  process();                in(B:length(matrix_elements))     };                inout(C:length(matrix_elements))   _Shared  class  common    obj1,  obj2;                      sgemm(&transa,  &transb,  &N,  &N,  &N,   _Cilk  _spawn      _Offload  obj1.process();   &alpha,  A,  &N,  B,  &N,  &beta,  C,  &N);   _Cilk_spawn                                          obj2.process();     Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 40. Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )   Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 41. Programming Intel® Xeon Phi™ based Systems (MIC Native) •  MPI ranks on Intel MIC (only)     •  All messages into/out of         Data   coprocessor     •  TBB, OpenMP*, Cilk Plus, CPU   MIC   MPI Pthreads used directly within         MPI processes Data           CPU   MIC   •  Programmed as homogenous Network       network of many-core CPUs:         Data       CPU   MIC               Data       CPU   MIC  
  • 42. Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )   Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 43. Programming Intel® MIC-based Systems (Symmetric) •  MPI ranks on coprocessor and     Intel® Xeon® processors       Data     Data   •  Messages to/from any core     •  TBB, OpenMP*, Cilk Plus, MPI CPU   MIC   MPI Pthreads used directly within         MPI processes Data   Data           CPU   MIC   •  Programmed as heterogeneous Network       network of homogeneous     nodes:   Data     Data       CPU   MIC             Data     Data       CPU   MIC  
  • 44. Spectrum of Programming Models and Mindsets Multi-Core Centric Many-Core Centric CPU Coprocessor MulB-­‐Core  Hosted   Symmetric   Many  Core  Hosted   General  purpose   Codes  with  balanced   serial  and  parallel   Highly-­‐parallel  codes   compu0ng   needs   Offload   Codes  with  highly-­‐   parallel  phases   Main(  )   Main(  )   Main(  )   CPU Foo(  )   Foo(  )   Foo(  )   MPI_*(  )   MPI_*(  )   MPI_*(  )   Main(  )   Main(  )   Many-core Foo(  )   Foo(  )   Foo(  )   Co-processor MPI_*(  )   MPI_*(  )   Range of models to meet application needs Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 45. Approved for Public Presentation A GROWING ECOSYSTEM: Developing today on Intel® Xeon Phi™ coprocessors Other brands and names are the property of their respective owners. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 46. Software, Drivers, Tools & Online Resources Tools & Software Downloads Getting Started Development Guides Video Workshops, Tutorials, & Events Code Samples & Case Studies Articles, Forums, & Blogs Associated Product Links http://software.intel.com/mic-developer Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 47. Keys to Productive Parallel Performance Determine the best platform target for your application •  Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors - or both Choose the right Xeon-centric or MIC-centric model for your application Vectorize your application Parallelize your application • With MPI (or other multi-process model) • With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.) • Go asynchronous: overlap computation and communication Maintain unified source code for CPU and
  • 48. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 49. Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logo, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright© 2012, Intel Corporation. All rights reserved. intel.com/software/products Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and *Other brandsthe property ofthe property of theirowners owners. names are and names are their respective respective
  • 50. This slide MUST be used with any slides removed from this presentation Legal Disclaimers •  All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number •  Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. •  Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization •  No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security •  Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost •  Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/ •  Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). No exemptions required •  Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900ppm bromine and 900ppm chlorine. •  Intel, Intel Xeon, Intel Core microarchitecture, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. •  Copyright © 2011, Intel Corporation. All rights reserved. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
  • 51. This slide MUST be used with any slides with performance data removed from this presentation Legal Disclaimers: Performance •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm. •  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. •  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. •  SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. •  TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. •  SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See http://www.sap.com/benchmark for more information. •  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. •  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Copyright © 2013 Intel Corporation. All rights reserved. 51 *Other brands and names are the property of their respective owners
  • 52. WRF Configuration (Backup) •  Measured by Intel 3/27/2013 (Endeavor Cluster) •  Runs in symmetric model. •  Citation: WRF Graphic and detail: John Michalakes 11/12/12; Hardware: 2 socket server with Intel® Xeon® processor E5-2670 (8C, 2.6GHz. 115W), each node equipped with one Intel® Xeon Phi™ coprocessor (SE10X B1, 61 core, 1.1GHz, 8GB @ 5.5GT/s) in a 8 node FDR cluster. WRF is available from the US National Center for Atmospheric Research in Boulder, Colorado •  It is available from http://www.wrf-model.org/ •  All KNC optimizations are in the V3.5 svn today •  Results obtained under MPSS 3552, Compiler rev 146, MPI rev 30 on SE10X B1 KNC (61c, 1.1Ghz, 5.5Gts) •  WRF CONUS2.5km workload available from www.mmm.ucar.edu/wrf/WG2/bench/ •  Performance comparison is based upon average timestep, we ignore initialization and post simulation file operations. Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners