Companies manufacturing AI Chip for Data centers
New computing models like machine learning and quantum are getting more important for delivering cloud services. the foremost immediate computing change has been the rapid adoption of ML/AI for consumer and business applications. This new model requires the processing of vast amounts of knowledge to developing usable information and eventually building knowledge models. These models are rapidly growing in complexity – doubling every 3.5 months. At an equivalent time, performance requirements for quick response are increasing.
AI hardware within the data center is often broken into two distinct processing phases: training and inference. Each has distinctly different characteristics in terms of computing and responsiveness. The training aspect requires the handling of extraordinarily large data sets to make processing models. this will take hours, days, or maybe weeks. Inference, on the opposite hand, is that the use of these trained models to process individual inputs that are very time-sensitive; the result could also be required within a couple of milliseconds.
Because it requires large computational modeling, training is usually performed in 32-bit floating-point precision, but sometimes is handled in lower precision like 16-bit floating-point or the choice Bfloat16 if an equivalent level of accuracy is often maintained. Inference, on the opposite hand, is usually performed using integer math for speed and lower power and doesn’t require the extended dynamic range of floating-point. Therefore, accelerators are often specialized for one task or the opposite, albeit there are cases where a chip can perform both training and inference it’s typically optimized for one or the opposite.
The more computationally intensive training side of AI is presently dominated by Nvidia GPUs. Until mid-May, when Nvidia released its Ampere A100 powerhouse, it’s Tesla V100 GPU. Companies positioning themselves to challenge Nvidia include Cerebral, Graphcore, and Intel/Habana Labs.
The more time-sensitive and less computationally intensive inference side is more contentious. In fact, some hyperscale cloud providers are building their own inference solutions. The classic example is that the Google Tensor Processing Unit initially designed for low power and responsive inference. once more Nvidia features a significant offering here within the sort of the Tesla T4, but there’s also significant competition in data center inference. Currently, the overwhelming majority of inference processing remains performed by CPUs (mostly Intel Xeons).
To help solve the benchmarking issues, a gaggle called MLPerf was formed to develop up-to-date benchmarks. The group includes a mixture of both academia and industry players from startups and established companies. The industry is aligning behind this benchmark, but the training and inference benchmarks are still a piece ongoing (presently at revision 0.6). MLPerf would require more industry input and can evolve over time, but some early results are available.
The training results are dominated by Nvidia and Google, but more results are expected later this year. One limitation is that these results don’t think about power or cost, which makes the performance calculations more complex.
Data Center Training
Nvidia’s Tesla V100 may be a massive chip that represents the height of GPU acceleration for AI chip training. additionally to the individual chips, Nvidia has the NVLink interface that permits multiple GPUs to be networked together to make a bigger virtual GPU.
Given Nvidia’s early lead in training also because of the extremely mature software stack, some startups are reluctant to require in the corporate direction. But others are far more willing to travel head-to-head with the GPU leader.
Among them is Graphcore. Its intelligence processing Unit (IPU) has shown significant power and performance benchmark numbers. The IPU features a massive on-chip memory to avoid copying data on and off the chip to local DRAM. But the IPU is heavily reliant on the Polar compiler to schedule resources for max performance. this is often an ongoing challenge for several AI chip vendors that have offloaded the control complexity to the software to form the hardware simpler and more power-efficient. While it’s relatively straightforward to tune the compilers for well-known benchmarks, the important challenge is to figure with each customer to optimize the answer for his or her specific data set and workload.
Elsewhere, Intel has been acquiring AI chip startups over the past few years, including the recently completed purchase of another data center training and inference vendor. Intel’s first deal for Nervana was slow to urge off the bottom with a replacement product. Meanwhile, Israeli startup Habana Labs made significant inroads with hyperscale customers like Facebook. Intel eventually acquired Habana Labs to assist get to plug quicker. After the Habana acquisition, Intel discontinued the development of Nervana chips. While the initial Habana Goya chip introduced in 2018 chip focused on inference, the corporate has recently started shipping the Gaudi training chip to customers.
Like Nervana, the Gaudi training chip has support for the BFLOAT16 format. it’s also available within the Open Compute Project (OCP) accelerator module physical format. Gaudi’s chip-to-chip interconnect uses standard RDMA RoCE over PCIe 4.0, while the Nervana chips use a proprietary interconnect.
The Habana Gaudi processor and HLS-1 system go head-to-head against Nvidia’s V100-based cards and Nvidia’s DGX rack systems. The Habana HLS-1 uses a PCIe switch to attach multiple Gaudi processors within the HLS-1 rack versus the proprietary NVLink bus employed by Nvidia. The key to Habana’s performance is that the RoCA v2 interface using ten 100G Ethernet links through a non-blocking Ethernet switch.
The Gaudi cards are rated 300W maximum power for a mezzanine card and 200W for a PCIe card. Habana has yet to release MLPerf training benchmarks, but the ResNet-50 comparisons look very competitive with Nvidia.
Cerebras, another challenger, has built the last word training and inference AI chip – one wafer-scale processor. the answer is a whole wafer (46,225 mm2) with 1.2 trillion transistors and 18 GB of on-chip memory. The Cerebras wafer-chip system needs 20,000 Watts to power, putting it during a category all its own. Performance numbers aren’t yet available.
AMD remains a dark horse. While its GPU is often used for training, the corporate has not optimized the architecture for Tensor processing and its software stack is behind Nvidia’s CUDA. The company’s two recent design wins with the Department of Energy’s Exascale computers should help it build out a more robust software stack, but machine learning seems to be a lower priority for AMD.
While each chip’s performance is vital for training, the power to scale for larger models is additionally a critical feature. Scaling often requires interconnecting multiple chips through a high-speed link (that is unless you’re Cerebras).
The newest players are Groq and Tenstorrent, both shipping early samples. Their chips are highly configurable and, like Graphcore, highly hooked into the compiler software to deliver performance.
Data Center Inference
The needs of machine learning inference are quite different from training. While batching many roles is common for training, often inference performance is judged by latency with low batch numbers. Every new inquiry must be addressed quickly and accurately. Also, inference data is a lower resolution and neural nets are shorter. The weights for inference are provided by the model developed within the training phase and optimized for the inference workload. Here, packed 8-bit integer performance is usually key for normal workloads. Optimized compute values are often reduced even further and should be 6-, 4-, 2- or maybe 1-bit precision as long because the accuracy isn’t reduced.
The leading Nvidia solution is that the Tesla T4, available in half-height, 70W PCIe cards which will fit into a 2U rack chassis. The T4 is predicated on an equivalent Turing architecture because of the Nvidia V100 but scaled down. The T4 supports a spread of precision levels with different performance levels, including 8.1 TFLOPS using single-precision floating-point, 65 TFLOPS using mixed-precision (FP16/FP32), 130 TOPS using 8-bit integer, and 260 TOPS using a 4-bit integer. The T4 offers the pliability to handle a spread of workloads.
The Intel/Habana Goya chip has proven very competitive on performance and power. Goya’s benchmark results were strong in version 0.5 and therefore the chip has higher ResNet-50 results, but also draws over 100W. Goya is additionally flexible on data formats, and with Intel behind the chip, customers are not any longer depending on a little startup as a supplier.
The most recent new entrant within the inference race was the surprise resurgence of Via/Centaur and its x86 server chip with an ML co-processor. The CHA chip showed significant potential in last year’s MLPerf’s open category, but the corporate remains developing its software stack.
Qualcomm, meanwhile, announced the Cloud AI chip last year with production expected this year. Few details are available, but what’s know is that the design uses an array of AI processor and memory tiles Qualcomm says are scalable from mobile to data center. The chip will support the newest LPDDR DRAM for low power and is rated for 350 TOPS (8-bit integer values). Qualcomm is targeting automotive, 5G infrastructure, 5G edge, and data center inference for the chip. the corporate is leveraging its extensive experience in low-power inference from its Snapdragon smartphone processors.
FPGAs also can observe inference engines as running high-performance computing tasks, offering low latency and versatile ML model support. Microsoft has for several years been using Intel’s Altera FPGAs for accelerating text string searches. Intel’s Vision Accelerator Design includes its Arria 10 FPGA PCIe card and software support within the OpenVINO toolkit. Xilinx features an Alveo line of PCIe cards for data centers supported by its Vitus software platform. The Alveo cards are offered with power from 100 to 225W. Both companies are working to form their software and programming tools more approachable.
Following Altera and Xilinx, Achronix also brought its FPGA technology to accelerated data center computing. Its new 7-nm Speedster7t product ships within the VectorPath PCIe accelerator card. it’s also available for IP licensing, setting the corporate aside from its rivals. the corporate has focused on fast I/O with PCIe gen 5 support and serdes accelerates to 112 Gbps. Machine learning inference performance will exceed 80 TOPS using INT8 processing. The chip also will support INT16, INT4, FP24, FP16, and BFloat16. Achronix says the Speedster7t delivers up to 86 TOPS INT8 performance and ResNet-50 performance of 8,600 images per second.
Also entering the very-low-power inference market is FlexLogix, with its InferX X1 edge inference coprocessor optimized for INT8 operation. Fast (batch-1) ResNet-50 inference at only 13.5W makes the InferX the likely low-power leader.
Proceed With Caution
Nvidia remains the leading training accelerator but faces increasing competition. the info center inference market becomes more fragmented as more competitors enter the fray. Later this summer, MLPerf should release subsequent iteration of its benchmarks and more results from the newer chips will become available.
Proceed with caution, though: results are often tuned to perform well on specific benchmarks but deliver disappointing results on a wider sort of workloads.