Pages

Wednesday, 7 November 2018

AMD shows commitment to high-performance data centre computing

- Amazon Web Services introduces three new AMD EPYC processor-based cloud instances

- New 7 nm AMD Radeon Instinct MI60 and MI50 (Vega 20) data centre graphics processors

- Zen 2 CPU core and modular design methodology revealed

- Rome demonstrated for the first time

AMD has demonstrated its commitment to data centre computing innovation at its Next Horizon event in the US with new details of its upcoming 7 nm compute and graphics product portfolio, positioned to extend the capabilities of the modern data centre.

During the event, AMD shared new specifics on its upcoming Zen 2 processor core architecture, detailed its chiplet-based x86 CPU design, launched the 7 nm AMD Radeon Instinct MI60 graphics accelerator and conducted the first public demonstration of its next-generation 7 nm EPYC server processor, codenamed Rome.

Amazon Web Services (AWS) joined AMD at the event to announce the availability of three of its popular instance families on the Amazon Elastic Compute Cloud (EC2) that are powered by the AMD EPYC processor.

“The multi-year investments we have made in our data centre hardware and software roadmaps are driving growing adoption of our CPUs and GPUs across cloud, enterprise and high performance computing (HPC) customers,” said Dr Lisa Su, President and CEO, AMD.

“We are well positioned to accelerate our momentum as we introduce the industry’s broadest, most powerful portfolio of data centre CPUs and GPUs featuring industry-leading 7 nm process technology over the coming quarters.”

Zen 2 details

AMD shared more details of the modular design methodology in its upcoming Zen 2 high-performance x86 CPU processor core at the event. This modular system design uses an enhanced version of the AMD Infinity Fabric interconnect to link separate pieces of silicon (called chiplets) within a single processor package. The multi-chip processor uses 7 nm process technology for the Zen 2 CPU cores while leveraging mature 14 nm process technology for the input/output portion of the chip. The result is much higher performance – more CPU cores at the same power, and more cost-effective manufacture than traditional monolithic chip designs.

Combining this breakthrough design methodology with the benefits of TSMC’s 7 nm process technology, Zen 2 delivers significant performance, power consumption and density generational improvements that can help reduce data centre operating costs, carbon footprint and cooling requirements, AMD said. Other generational advances over the award-winning Zen core include:

• An improved execution pipeline.

• Front-end advances such as an improved branch predictor and better instruction pre-fetching.

• Floating point enhancements

• Advanced security features, such as hardware-enhanced Spectre mitigations. Spectre is a vulnerability that affects processors.

Multiple 7 nm-based AMD products are now in development, including next-generation AMD EPYC CPUs and AMD Radeon Instinct GPUs, both of which AMD detailed and demonstrated at the event. Additionally, the company said its follow-on 7 nm+-based  Zen 3 and Zen 4 x86 core architectures are on-track.

AMD EPYC on AWS

The immediate availability of the first AMD EPYC processor-based instances on Amazon Elastic Compute Cloud (EC2) was also announced. Part of AWS’ popular instance families, the new AMD EPYC processor-powered offerings feature industry-leading core density and memory bandwidth.

This results in exceptional performance per-dollar for general purpose and memory-optimised workloads, driven by the core density of AMD EPYC processors that offer customers a balance of compute, memory, and networking resources for web and application servers, backend servers for enterprise applications, and test/development environments with seamless application migration. The memory bandwidth advantage of AMD EPYC processors is ideal for in-memory processing, data mining, and dynamic data processing.

“The availability of multiple AMD EPYC processor-powered instances on Amazon EC2 instances marks a significant milestone in the growing adoption of our high-performance CPUs with cloud service providers,” said Forrest Norrod, Senior VP and GM, Datacenter and Embedded Solutions Business Group, AMD.

“The powerful combination of cores, memory bandwidth and I/O on AMD EPYC processors create a highly differentiated solution that can offer lower TCO for our customers and lower prices for the enduser. Working with AWS, the No. 1 provider in cloud services, has been amazing for the AMD team and we are excited to see the new instances come online today for their customers.”

"These new instances should be a great fit for customers who are looking to further cost-optimise their Amazon EC2 compute environment. As always, we recommend that you measure performance and cost on your own workloads when choosing your instance types," said AWS' Chief Evangelist Jeff Barr in a blog post introducing the new AMD instances.

The new instances are available as variants of Amazon EC2’s memory-optimised and general purpose instance families. AMD-based R5 and M5 instances can be launched via the AWS Management Console or AWS Command Line Interface and are available today in the US, Europe and Asia Pacific regions, with availability in additional regions planned soon. AMD-based T3 instances will be available in the coming weeks.

AMD-based M5 and R5 instances are available in six sizes with up to 96 vCPUs, up to 768 GB of memory. AMD-based T3 instances will be available in seven sizes with up to eight vCPUs and 32 GB of memory. The new instances can be purchased as On-Demand, Reserved, or Spot instances.

More about Rome

AMD additionally disclosed new details and delivered performance previews of its next-generation EPYC processors codenamed Rome:

• Processor enhancements including up to 64 Zen 2 cores, increased instructions-per-cycle1 and leadership compute, input-output (I/O) and memory bandwidth2.

• Platform enhancements including the industry’s first PCIe 4.0-capable x86 server processor with double the bandwidth per channel3 to improve data centre accelerator performance.

• Double the compute performance per socket4 and four times the floating point performance per socket5 compared to current AMD EPYC processors.

• Socket compatibility with today’s AMD EPYC server platforms.

 AMD demonstrated the performance and platform advantages of its next-generation EPYC processor with two demonstrations (demos) during the event:

• A pre-production single-socket next-generation AMD EPYC processor outperforming a commercially-available top-of-the-line Intel dual processor Xeon server running the computationally-intensive, industry standard C-Ray benchmark6.

• The industry’s first x86 PCIe 4.0-capable platform demo, featuring a Radeon Instinct MI60 processor to accelerate image recognition.

 Rome is sampling with customers now and is expected to be the world’s first high-performance x86 7 nm CPU.

World's first 7 nm GPUs

Source: AMD. Dr Su with a Radeon Instinct GPU.
Source: AMD. Dr Su with a Radeon Instinct GPU.

AMD further launched the world’s first 7 nm GPUs and the industry’s only hardware-virtualised GPUs – the AMD Radeon Instinct MI60 and MI50.

These new graphics cards are based on the Vega architecture and are specifically designed for machine learning and artificial intelligence (AI) applications such as rapidly training complex neural networks, delivering higher levels of floating-point performance7, greater efficiencies8 and new features for data centre deployments.

“Legacy GPU architectures limit IT managers from effectively addressing the constantly evolving
demands of processing and analysing huge datasets for modern cloud data centre workloads,” said David Wang, Senior VP of engineering, Radeon Technologies Group at AMD.

“Combining world class performance and a flexible architecture with a robust software platform and the industry’s leading edge ROCm open software ecosystem, the new AMD Radeon Instinct accelerators provide the critical components needed to solve the most difficult cloud computing challenges today and into the future.”

Key features of the new accelerators include:

• Optimised deep learning operations

• The AMD Radeon Instinct MI60 is the world’s fastest double precision PCIe9 Accelerator 10

• Up to 6X faster data transfer, thanks to two Infinity Fabric Links per GPU delivering up to 200 GBps of peer-to-peer bandwidth – up to 6X faster than PCIe 3.0 alone11

• Ultra-Fast HBM2 memory: The AMD Radeon Instinct MI60 provides 32 GB of HBM2 Error correcting code (ECC) memory12, and the Radeon Instinct MI50 provides 16 GB of HBM2 ECC memory.

• Secure virtualised workload support.

A live demonstration during the event showed the flagship AMD Radeon Instinct MI60 running real-time training, inference and image classification.

Open software platform upgrade

In addition to new hardware announcements AMD also announced ROCm 2.0, a new version of its open software platform for accelerated computing that includes new math libraries, broader software framework support, and optimised deep learning operations. ROCm 2.0 has been extended to Linux kernel distributions including CentOS, Red Hat Enterprise Linux and Ubuntu.

“Google believes that open source is good for everyone,” said Rajat Monga, Engineering Director,
TensorFlow, Google. “We've seen how helpful it can be to open source machine learning technology,
and we’re glad to see AMD embracing it. With the ROCm open software platform, TensorFlow users will benefit from GPU acceleration and a more robust open source machine learning ecosystem.”

The AMD Radeon Instinct MI60 accelerator is expected to ship by end-2018. The AMD Radeon Instinct MI50 accelerator is expected to begin shipping by the end of Q119. The ROCm 2.0 open software platform is expected to be available by the end of 2018.

Details:

Presentations from the event are available online.

1 Estimated increase in instructions per cycle (IPC) is based on AMD internal testing for Zen 2 across microbenchmarks, measured at 4.53 IPC for DKERN +RSA compared to prior Zen 1 generation CPU (measured at 3.5 IPC for DKERN + RSA) using combined floating point and integer benchmarks.

2 NAP-42 – AMD EPYC 7601 processor supports up to eight channels of DDR4-2667, versus the Xeon Platinum 8180 processor at six channels of DDR4-2667. NAP-43 – AMD EPYC 7601 processor includes up to 32 CPU cores versus the Xeon Platinum 8180 processor with 28 CPU cores.

NAP-44 – A single AMD EPYC 7601 processor offers up to 2 TB/processor (x 2 = 4 TB), versus a single Xeon Platinum 8180 processor at 768 GB/processor (x 2 = 1.54 TB). NAP-56 – AMD EPYC processor supports up to 128 PCIe Gen 3 I/O lanes (in both one and two-socket configurations), versus the Intel Xeon SP Series processor supporting a maximum of 48 lanes PCIe Gen 3 per CPU, plus 20 lanes in the chipset (maximum of 68 lanes on one socket and 116 lanes on two sockets).

Based on Zen 2 design parameters versus Zen1 and currently shipping products – core count increases from 32 to up to 64 per socket. Memory bandwidth goes up to 3,200 Gbps memory speed across eight memory channels; I/O leadership extends to PCIe Gen4.

3 Per Silicon Labs, provider of the PCIe Gen4 solutions. PCIe Gen4 is a new standardised data transfer bus that will double the data transfer rate per lane of the prior Gen3 revision from 8 GTps (gigatransfers/second) to 16 GTps. This means that a single PCIe Gen4 interconnection will allow data rate transfers of up to 2 GBps (gigabytes/second), and a full 16 slot PCIe Gen4 interconnection for graphics cards and high-end solid state drives will allow data transfer rates of up to 32 GBps.

4 Testing performed by AMD Engineering as of October 2018 using the AMD reference system with a pre-production Rome engineering sample. Rome scored approximately 2x higher compared to the Naples system.

5 Estimated generational increase based upon AMD internal design specifications for Zen 2 compared to Zen 1. Zen 2 has 2X the core density of Zen 1, and when multiplied by 2X peak FLOPs per core, at the same frequency, results in 4X the FLOPs in throughput.

6 Estimates based on AMD internal testing as of November 6, 2018 in the AMD Ethanol reference system AMD EPYC-based system configuration with AMD Rome Development Chassis with an EthanolX development board featuring a single next generation AMD EPYC (Rome) processor with a total of XXXXX DIMMs at XXXXX; versus Intel -based system configured with Supermicro’s SYS-1029U-TRTP; OS: Ubuntu 7.3.0-27ubuntu1~18.04; Linux: 4.15.0-36-generic; Compiler: GCC 7.3.0 with 2x Intel Xeon Platinum 8180M CPU, 24x32 GB at 2666 MHz. The AMD Rome 1P server completes the C-Ray demo in ~ XXXXX sec and the Intel 8180M completes the benchmark in ~ XXXXX sec. Benchmark testing data redacted by AMD for confidentiality purposes, full disclosure will be available after launch.

7 As of October 22, 2018. The results calculated on for Radeon Instinct MI60 designed with Vega 7 nm FinFET process technology resulted in 29.5 TFLOPS half precision (FP16), 14.8 TFLOPS single precision (FP32) and 7.4 TFLOPS double precision (FP64) peak theoretical floating-point performance. This performance increase is achieved with an improved transistor count of 13.2 billion on a smaller die size of 331.46 mm2 then previous Gen MI25 GPU products with the same 300W power envelope.

The results calculated for Radeon Instinct MI50 designed with Vega 7nm FinFET process technology resulted in 26.8 TFLOPS peak half precision (FP16), 13.4 TFLOPS peak single precision (FP32) and 6.7 TFLOPS peak double precision (FP64) floating-point performance. This performance increase is achieved with an improved transistor count of 13.2 billion on a smaller die size of 331.46 mm2 than previous Gen MI25 GPU products with the same 300W power envelope.

The results calculated for Radeon Instinct MI25 GPU based on the Vega10 architecture resulted in 24.6 TFLOPS peak half precision (FP16), 12.3 TFLOPS peak single precision (FP32) and 768 GFLOPS peak double precision (FP64) floating-point performance. This performance is achieved with a transistor count of 12.5 billion on a die size of 494.8 mm2 with a 300W power envelope.

AMD TFLOPS calculations were conducted with the following equation for Radeon Instinct MI25, MI50, and MI60 GPUs:

FLOPS calculations are performed by taking the engine clock from the highest dynamic power management (DPM) state and multiplying it by xx compute units (CUs) per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32 and 4 FLOPS per clock for FP16. To calculate the FP64 TFLOPS rate for Vega 7 nm products MI50 and MI60 a 1/2 rate is used and for the Vega10 architecture-based MI25, a 1/16th rate is used.

TFLOP calculations for MI50 and MI60 GPUs

GFLOPS per Watt
                MI25           MI50           MI60 
FP16        0.082          0.089           0.098
FP32        0.041          0.045           0.049
FP64        0.003          0.022           0.025

Industry supporting documents (PDF) and web page

AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein.

8 The Radeon Instinct MI60 contains 13.2 billion transistors on a package size of 331.46 mm2, while the previous generation Radeon Instinct MI25 had 12.5 billion transistors on a package size of 494.8 mm2 – a 58% improvement in number of transistors per mm2.

9 Pending.

10 Calculated on October 22, 2018, the Radeon Instinct MI60 GPU resulted in 7.4 TFLOPS peak theoretical double precision floating-point (FP64) performance. AMD TFLOPS calculations conducted with the following equation: 

FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 1/2 FLOPS per clock for FP64. TFLOP calculations for MI60 can be found at https://www.amd.com/en/products/professional-graphics/instinct-mi60. External results on the NVIDIA Tesla V100 (16 GB card) GPU accelerator resulted in 7 TFLOPS peak double precision (FP64) floating-point performance. Results found at: https://images.nvidia.com/content/technologies/volta/pdf/437317-Volta-V100-DS-NV-US-WEB.pdf. 

AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein. 

11 As of October 22, 2018. Radeon Instinct MI50 and MI60 Vega 7 nm technology-based accelerators are PCIe Gen 4.0 capable providing up to 64 GBps peak bandwidth per GPU card with PCIe Gen 4.0 x16 certified servers. Peak theoretical transport rate performance guidelines are estimated only and may vary. Previous Gen Radeon Instinct compute GPU cards are based on PCIe Gen 3.0 providing up to 32 GBps peak theoretical transport rate bandwidth performance. 

Peak theoretical transport rate performance is calculated by Baud rate * width in bytes * # directions = GBps

PCIe Gen 3: 8 * 2 * 2 = 32 GBps 

PCIe Gen 4: 16 * 2 * 2 = 64 GBps 

Refer to server manufacture PCIe Gen 4.0 compatibility and performance guidelines for potential peak performance of the specified server models. Server manufacturers may vary configuration offerings yielding different results.

https://pcisig.com/  https://www.chipestimate.com/PCI-Express-Gen-4-a-Big-Pipe-for-Big-Data/Cadence/Technical-Article/2014/04/15  https://www.tomshardware.com/news/pcie-4.0-power-speed-express,32525.html    

AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein. 

12 ECC support on 2nd Gen Radeon Instinct GPU cards, based on the Vega 7 nm technology has been extended to full-chip ECC including HBM2 memory and internal GPU structures. 

No comments:

Post a Comment