Linley Newsletter: November 30, 2017

 weSRCH's Best of the Internet Award

Linley Newsletter

(Formerly Processor Watch, Linley Wire, and Linley on Mobile)

Please feel free to forward this to your colleagues

Issue #575

November 30, 2017

Independent Analysis of Microprocessors and the Semiconductor Industry

Editor: Tom R. Halfhill

Contributors: Mike Demler, Bob Wheeler

In This Issue:

- Xeon Scalable Scales Networks

- Microsoft Brainwave Uses FPGAs

- Centriq Aces Scale-Out Performance

- Mellanox Brings More Smarts to NICs

- Arm Gets Assertive With 4K Displays

- FPGAs Accelerate Deep Learning

Subscribe to Microprocessor Report!

Microprocessor Report is the leading technical publication for the microprocessor industry, covering PC, server, mobile, embedded, and networking chips. We publish new articles every week, keeping subscribers up to date on the newest products and technology trends.

Since 1987, readers have counted on Microprocessor Report for credible, high-level, timely, and actionable insight. MPR is exclusively subscriber-supported, avoiding any ties to advertisers, and is dedicated to providing unbiased and critical analysis of the semiconductor industry.

Only The Linley Group, the leading provider of technology analysis on semiconductors, has the resources and deep domain expertise to deliver in-depth coverage across such a broad range of processors. With extensive industry experience, our analyst team delivers informed, real-world perspective on today's critical issues.

Don't take our word for it. Check out these articles for a free sample of MPR's coverage:

We offer individual subscriptions and corporate site licenses. You can subscribe to read articles online, download PDFs, or receive a monthly printed newsletter. All subscriptions include the same content and access to a rich archive with thousands of past articles. Don't miss out! Subscribe today:

Xeon Scalable Scales Networks

By Bob Wheeler

Because cloud and service-provider markets are colliding, Intel's latest Xeon family is as important to network processing as it is to servers. We've written about many of the improvements the Purley platform delivers, including the new Lewisburg south bridge. But Skylake-SP implements many microarchitecture and hardware upgrades that increase network-processing performance, too.

The obvious changes to Skylake-SP relative to Broadwell-EP include more cores/threads as well as greater memory and PCI Express bandwidth. In addition, Lewisburg boosts QuickAssist crypto acceleration compared with Coleto Creek. For some time, Intel has been using DPDK Layer 3-forwarding benchmarks to demonstrate Xeon's generational improvements. In this case, Skylake-SP delivers 281Gbps of throughput for 256-byte packets using 16 CPUs (32 threads), whereas Broadwell-EP delivers 159Gbps using 10 CPUs (20 threads). (The test employed only a single socket for both generations.) On a per-core basis, Skylake-SP increases throughput by 10.5%.

Hardware acceleration yields straightforward performance gains, and Intel's DPDK IPSec benchmark shows Lewisburg delivering 110Gbps of large-packet throughput. But Skylake-SP also includes microarchitecture improvements that boost the performance of software-based crypto processing. They comprise AES-instruction enhancements, the use of AVX-512 for hashing (SHA), and availability of two AVX-512 units in some models. For an AES-256+SHA-1 example, Skylake-SP increases the bytes encrypted per clock cycle by 36% compared with Broadwell-EP.

Unlike most competitors' products, Intel Xeons aren't purpose-built for network processing. Their integration lags that of Cavium's Octeons and NXP's QorIQ SoCs, which provide Ethernet ports, crypto and compression accelerators, and at least Layer 2-processing hardware. But Intel's relentless architecture upgrades in each Xeon generation, combined with its ongoing investments in DPDK software, enable SDN and NFV approaches to flexibly deliver more throughput than SoC competitors.

Microprocessor Report subscribers can access the full article:

Microsoft Brainwave Uses FPGAs

By Linley Gwennap

Microsoft's Brainwave project demonstrates the advantages of the FPGA approach to deep learning. The company uses deep neural networks (DNNs) in many of its cloud services, including Bing web search, Cortana voice assistant, and Skype Translator. Microsoft wanted a flexible design that could easily adapt as its workloads evolve. Having direct access to production DNNs and large data sets, the company could experiment with alternative hardware designs and data formats to optimize performance. This experimentation led it to develop a new 8-bit floating-point format (FP8) to represent the neural-network weight and activation values.

To maximize flexibility, Brainwave is a "soft" vector processor that has a custom instruction-set architecture for DNN acceleration. Like Google's TPU, Brainwave doesn't run general-purpose code, so the ISA contains a few special-purpose instructions such as matrix multiply, vector operations, convolutions, and nonlinear activations. Each one can operate on hundreds or thousands of data units.

Because it's instantiated in FPGAs, the instruction set is easily changed. These changes are rolled into a compiler that the team created, maintaining a consistent interface to software. The compiler takes pretrained DNN models developed in multiple frameworks -- TensorFlow, Caffe, and Microsoft's Cognitive Toolkit -- and converts them to run on the Brainwave ISA.

The microarchitecture is based on the company's FP8 format, which halves the size of the compute units and register files compared with standard FP16. The design reads weights from DRAM in FP16 format and converts them to FP8 before storing them in the vector register file. The multiply units then apply the weights to incoming activation data; the results are accumulated and converted back to FP16. Two multifunction units can optionally perform additional multiply-add operations before looping the data back or storing the result in memory. These units can perform activation, normalization, and pooling.

Microprocessor Report subscribers can access the full article:

Centriq Aces Scale-Out Performance

By Tom R. Halfhill

Qualcomm has its head in the clouds, but in a good way. Early benchmarking indicates its new Centriq server processors deliver excellent scale-out performance for cloud applications and data centers. Although the company's ARMv8-compatible CPUs can't match the per-core throughput of the best x86 CPUs, they rank high in throughput per thread, per watt, per dollar, and per square millimeter of silicon. These metrics translate into bargain prices for competitive performance and power consumption -- and a strong debut for a newcomer to the nearly impregnable server-processor market.

The Centriq 2400 family (code-named Amberwing) initially comprises three models based on the same die: the 48-core 2460, the 46-core 2542, and the 40-core 2434. Clock frequencies hover within a narrow range (2.2-2.3GHz base, 2.5-2.6GHz turbo), as do the power ratings (110-120W TDP). List pricing, however, varies from $1,995 to a surprisingly low $888 -- posing a credible challenge to Intel and AMD in view of Centriq's competitive SPEC scores.

One observation is that Centriq offers relatively little differentiation within the family. Indeed, we suspect most customers will bypass the 48-core 2460 in favor of the 46-core 2452, which delivers 97% of the performance for only 69% of the price. And the 46- and 40-core chips differ in throughput by only 13%. The low-end 2434 offers a phenomenal deal in performance per dollar. Although Centriq falls short of the highest-end AMD and Intel products in peak performance, it matches well against mainstream models.

Centriq chips are shipping in limited volume this quarter and are ramping production. That Qualcomm is underpricing Xeon Scalable processors is no surprise, but Centriq also offers a good deal when compared with AMD's low-priced Epyc processors. With a smaller hardware and software ecosystem behind them, ARM server chips must offer potential customers a bargain.

Microprocessor Report subscribers can access the full article:

Mellanox Brings More Smarts to NICs

By Bob Wheeler

Is your cloud team having religious debates over the merits of FPGAs versus processors for workload acceleration? Mellanox doesn't care who wins, as its new adapters fit either approach. This quarter, it's sampling Innova-2 NICs for 25G and 100G Ethernet that integrate FPGAs. In January, the company will sample 25G Ethernet NICs based on its new BlueField ARM SoC.

Innova-2 follows last year's 10/40G Ethernet Innova NICs. Aside from handling new Ethernet rates, Innova-2 upgrades the on-board FPGA from the Xilinx UltraScale (20nm) to the UltraScale+ (16nm) generation. It also moves from Mellanox's ConnectX-4 Lx controller to ConnectX-5, which enables a new lookaside mode for FPGA access. As before, the company offers both plug-and-play security acceleration as well as a framework for customers' FPGA designs.

The new BlueField SmartNIC requires customer programming. It includes an eight-core version of the BlueField SoC, which itself integrates an Ethernet controller based on ConnectX-5. The Cortex-A72 CPUs can perform fully symmetric multiprocessing, enabling the NIC to run Linux. Customers can then program the SoC using a standard Linux environment and Gnu tools.

The BlueField SmartNIC primarily serves networking applications, offloading workloads such as virtual switches, compression, packet capture, virtual network functions, and network security. The latter application overlaps with Innova-2, which also serves IPSec and SSL/TLS offload. By providing its new lookaside feature, however, Innova-2 can also handle compute acceleration for non-network tasks. That is, the FPGA and Ethernet controller share a PCI Express slot but can operate independently.

The smart-NIC market comprises a variety of merchant and customer-proprietary designs. For processor-based NICs, merchant vendors include Broadcom, Cavium, and Netronome, so the BlueField SmartNIC faces direct competition. To date, merchant FPGA-based NICs have served only vertical applications; Mellanox hopes to crack high-volume applications with Innova-2.

Microprocessor Report subscribers can access the full article:

Arm Gets Assertive With 4K Displays

By Mike Demler

To address growing demand for 4K content, Arm has developed a new display-processing subsystem comprising the Mali-D71 DPU, the fifth-generation Assertive Display (AD-5) coprocessor, and a specialized version of its CoreLink memory-management unit (MMU-600). The new DPU can output 120fps to a single 4K display. It also supports 60fps to dual displays, along with side-by-side modes for virtual-reality headsets.

The AD-5 enables picture in picture as well as mixing of high- and standard-dynamic-range (HDR and SDR) regions of interest. It uses a proprietary tone-mapping engine, which algorithmically increases contrast and detail by analyzing image features and making adjustments pixel by pixel. The AD-5 further improves picture quality by dynamically adjusting display brightness on the basis of input from a device's ambient-light sensor. The backlight control also saves power by enabling content-dependent adjustments.

All Mali-D71 customers receive a script that automatically generates a display-optimized version of the CoreLink MMU-600. Whereas designers can use the full SoC-level model to manage the virtual-memory interface to CPUs, GPUs, and other processing cores, the company developed the reduced version for direct integration with the D71's memory subsystem.

According to its measurements, the tight integration decreases memory latency by 50% and takes 55% less die area than its previous display solution, which employs separate CoreLink MMU-500 and Mali-DP650 DPU components. The new MMU also enables the digital-rights-management (DRM) features in the latest version of TrustZone Media Protection (TZMP). All of the components in the Komeda display-processor platform are available for licensing now.

Microprocessor Report subscribers can access the full article:

FPGAs Accelerate Deep Learning

By Linley Gwennap

Nvidia GPUs are the most popular chips for accelerating deep learning, but some large cloud-service providers (CSPs) are getting better results using FPGAs. For example, Microsoft has deployed FPGAs across its data centers, where they can accelerate its deep-learning algorithms using its Brainwave project. Other CSPs such as Amazon and Baidu also employ this approach for some of their deep neural networks (DNNs). DeePhi, TeraDeep, and other startups offer preprogrammed FPGA accelerators for DNNs. These accelerators mainly target neural-network inferencing, as GPUs dominate the training function.

The FPGA's flexibility is well suited to deep learning, a relatively new field that's developing quickly. Researchers continue to revise DNN algorithms, convolution functions, and data formats. As CSPs gain more experience with their DNN workloads, they discover new optimizations. FPGAs can be reprogrammed in minutes, whereas conventional hardware designs take months or even years to update.

Although some people think of FPGAs as a pile of unstructured gates, these devices also include a set of "hard" features designed directly into the silicon. In addition to basic memory and register blocks, most FPGAs feature hard DSP blocks, because their customers frequently perform signal processing. These DSP blocks are much denser and operate at a higher speed than similar compute blocks constructed from programmable gates. They're usually configurable for either integer or floating-point calculations.

These DSP blocks offload the multiply-accumulate (MAC) function critical to many DNN algorithms. Some FPGAs have thousands of DSP blocks, generating a huge amount of compute power. Unleashing this performance, however, requires custom gate-level design, a specialized and time-consuming task. Both Intel (formerly Altera) and Xilinx offer tools to ease this burden.

Microprocessor Report subscribers can access the full article:

About Linley Newsletter

Linley Newsletter is a free electronic newsletter that reports and analyzes advances in microprocessors, networking chips, and mobile-communications chips. It is published by The Linley Group and consolidates our previous electronic newsletters: Processor Watch, Linley Wire, and Linley on Mobile. To subscribe, please visit:

Domain: Electronics
Category: Semiconductors

Recent Newsletters

Linley Newsletter: August 8, 2019

Linley Newsletter Please feel free to forward this to your colleagues Issue #664 August 8, 2019 Independent Analysis of Microprocessors and the Semiconductor Industry E

08 August, 2019

Linley Newsletter: August 1, 2019

Linley Newsletter Please feel free to forward this to your colleagues Issue #663 August 1, 2019 Independent Analysis of Microprocessors and the Semiconductor Industry E

01 August, 2019

Linley Newsletter: July 25, 2019

Linley Newsletter Please feel free to forward this to your colleagues Issue #662 July 25, 2019 Independent Analysis of Microprocessors and the Semiconductor Industry

25 July, 2019