This year has been really fruitful for AMD. Thanks to the Ryzen micro-architecture and it’s record-breaking price-performance ratio, AMD managed to thoroughly shake up both the CPU market and as well as Intel to it’s very core. We’ve been hearing reports of 4 core/8 thread i3 processors in the consumer market, which is a first ever since Intel adopted the i3/5/7 naming scheme, all thanks to Ryzen 3. On the enthusiast side, Intel has been in complete chaos, as they announced Skylake X family of CPUs, half of which are pretty much identical to their predecessors. Enough about Intel, coming back to AMD, they let loose the first two CPUs of the ThreadRipper family yesterday, namely the 1950X and 1920X, with 16 and 12 cores respectively, to be then followed by the 8 core 1900X on August 31st, and the 1920 at sometime later.
Intel has been rather stingy when it cames to the core-count, but for AMD it seems like it’s all about the cores. Ryzen’s IPC may not quite be as efficient as Intel’s but the competitive pricing and all those cores more than just make up for it. Earlier this year, AMD launched their Ryzen CPUs. The three members of the Ryzen 7 family all had eight cores plus hyper-threading, achieving performance near comparable Intel processors at half the price (or better). Next came four Ryzen 5 CPUs, competing in price against the quad core i5 parts, and on top of that price Ryzen 5 had twelve threads, triple that of Core i5. Finally Ryzen 3 hit the ~$120 market against the Core i3s, with double the cores over Intel. AMD’s EPYC family also made the debut into the enterprise space, offering up to 32 cores.
AMD is aiming TR towards the enthusiasts- gamers, streamers, you-tubers, basically anyone who might need a super-charged CPU, or just wants it. The core counts that AMD is releasing with Threadripper were till now only seen on Intel’s server products, featuring up to 28 cores for a hefty $10000. With TR, AMD is fudging the line between consumer, prosumer, and enterprise.
RYZEN: SPECS AND FEATURES
AMD’s Ryzen CPUs are build differently, compared to Intel’s. Intel CPUs are centered around a monolithic piece of silicon for all of its cores. AMD, however has designed Ryzen to be modular at the chip level. The basic building block of all Ryzen CPUs are two 4-core complexes, or CCXes, joined by AMD’s high-speed Infinity Fabric interconnect. ThreadRipper is based on the same dual 4-core complex (CCX), so instead of one chip, you get two.
To get upto 16 cores in ThreadRipper, AMD uses the same high-speed Infinity Fabric to join two 8-core dies. The 12-core version also joins two 8-core dies, but each of the 4-core CCXs has one processor core disabled. ThreadRipper reuses hardware from AMD’s 32-core, server-focused Epyc CPU. Two of those “chips” are actually dummy pieces to add structural integrity and support for the cooler that will be clamped on top the CPU.
Both AMD and Intel use private L2 caches for each core, then have a victim L3 cache that lead to the main memory. A victim cache is a cache that obtains data when it is evicted from the cache underneath it, and cannot pre-fetch data. But the size of those caches and how AMD/Intel has the cores interact with them is different.
AMD uses 512 KB of L2 cache per core, and an 8 MB of L3 victim cache per core complex of four cores. In a 16-core Threadripper, there are four core complexes, leading to a total of 32 MB of L3 cache, however each core can only access the data found in its local L3. In order to access the L3 of a different complex, this requires additional time and snooping. As a result there can be different latencies based on where the data is in other L3 caches compared to a local cache.
Intel’s Skylake-X uses 1MB of L2 cache per core, leading to a higher hit-rate in the L2, and uses 1.375MB of L3 victim cache per core. This L3 cache has associated tags and the mesh topology used to communicate between the cores means that like AMD there is still time and latency associated with snooping other caches, however the latency is somewhat homogenized by the design. Nonetheless, this is different to the Broadwell-E cache structure, that had 256 KB of L2 and 2.5 MB of L3 per core, both inclusive caches.
If you consider the specs, the Ryzen CPUs come far ahead of the Intel competition. Forget the cores, just look the PCIe lanes and the memory channels. The regular Ryzen lineup supports dual-channel DDR4 memory, while ThreadRipper supports quad-channel DDR4, and technically support up to 2TB of RAM. As for the PCIe lines, while the mainstream Ryzen chips sport 20 lanes, ThreadRipper increases the number to an almost unbelievable 64. In layman’s words, that means four GPUs along with three NVMe PCIe drives. Then have a look at Intel’s spec-sheet. The 10-core Core i9-7900X tops out with 44 lanes, the 8-core Core i7-7820X has just 28. On the other hand, AMD’s cheapest ThreadRipper, the 8-core Threadripper 1900X, rocks 64 lanes. Sure probably no one needs that many PCIe lanes, but just look at the gulf between the two competitors.
AMD is launching the X399 platform alongside ThreadRipper, much like Intel does with it’s X/E processors. This new platform will run the hulking TR CPU, with it’s 16C/32T count along with quad-channel memory and 64 PCIe support. The socket is vastly different from previous AMD sockets. Rather than a PGA socket with a simple latch system to provide enough force between the pads and pins, the LGA TR4 socket has three Torx screws that have to be removed in order . On doing so, the socket bracket immediately flips open, with a small tray – that takes the CPU. All of the ThreadRipper CPUs will come in this little tray, and there’s no need to take it out of the tray.
Because of the design of the socket and the size of the CPUs, the screw holes for CPU coolers are different as well. As each CPU is currently geared for 180W, AMD recommends liquid cooling at a bare minimum, and will bundle an Asetek CPU bracket with every CPU sold (a Torx screwdriver is also supplied).
GAME MODE AND NUMA
As discussed above, AMD TR CPUs have four silicon dies, similar to EPYC processors, making ThreadRipper a Multi Core Module (MCM) design. Two of these are reinforcing spacers – empty silicon with no use other than to help distribute the weight of the cooler and assist in cooling. The other two dies are identical to the ones in other Ryzen CPUs, containing eight cores each and having access to two memory channels each. They communicate through Infinity Fabric, which as per AMD has a 102 GB/s die-to-die bandwidth.
This configuration is referred to as a NUMA configuration: non-uniform memory access. If not optimized, the code cannot rely on a regular latency between requesting something from DRAM and receiving it. This may cause problems, which is why some programs are made NUMA-aware, so they can intelligently choose the memory required to the closest DRAM controller, lowering bandwidth but prioritizing latency.
NUMA isn’t anything new, especially for AMD who basically came up with the tech. But it has been limited to the server/workstations/professional platforms. NUMA has never surfaced in the consumer market, which is sort of an issue since there are no NUMA-aware programs, at least not-yet. However, that doesn’t leave the CPU paralyzed. In the absence of these programs, the extra work is done by the OS, helping unaware programs by keeping threads and memory accesses together on the same NUMA node in order to ensure optimal performance, but the downside is that also discourages applications from using the other die and the other 8 cores. Also, software is slow to change and its unlikely that NUMA will become common in the consumer CPUs in the near-future. On top of that, NUMA can be tricky to program for, especially in cases of workloads/algorithms that inherently struggle with these kind of cores/memory config(s). So the hiccups of NUMA are probably not going away anytime soon.
Simultaneous Multi-Threading, on or off (on by default).
Memory Mode: UMA vs NUMA (UMA by default).
The first switch is simple enough, as it turns simultaneous multi-threading (SMT) on or off. When the SMT switch is on, each core runs two threads bringing the CPU thread count up-to 32. When disabled, each core handles one thread. This alleviates issues in older programs that don’t support this high core count. The confusing thing about this switch is that AMD is calling it ‘Legacy Compatibility Mode’ in their documents. So when this mode is on, legacy games will work. That means that when it is on, SMT is off. When the legacy mode is off, SMT is on.
The second switch, Memory Mode, allows the user to switch between unified memory architecture (UMA) or non-unified memory architecture (NUMA) mode. Under the default setting, UMA, the memory and CPU cores are seen as one massive block to the system, with maximum bandwidth and an average latency. This takes care of compatibility issues but latency might take a backseat depending upon the memory blank.
NUMA slices the memory and cores into into two nodes depending on which memory channels are nearest to the core(s) that need it. The OS will keep the data for a core as near to it as possible, resulting in the lowest possible latency. For a particular core, that means it will fill up the memory nearest to it first at half the total bandwidth but a low latency, then the other half of the memory at the same half bandwidth but also at a higher latency. This mode is designed for latency sensitive workloads that where performance is hindered by high latencies. In some games – low latency can affect averages or 99th percentiles for game benchmarks by a considerable degree.
Now, for the fun part, the benchmarks!
After going through the benchmarks, certain conclusions can be drawn:
Firstly, these benchmarks are somewhat similar to the older Ryzen/i7 KL comparisons.
Secondly, in video games and single threaded applications thanks to its high IPC Intel still leads, although these apps aren’t really relevant here.
Thirdly, there are some apps like Handbrake and WinRar that don’t play nice with the Zen architecture, but that is expected to change in the not-so-distant future.
Lastly, AMD leads in the majority of benchmarks, but the gap is primarily not that significant~20% on an average. However, if you compare the prices, it’s more or less justified.
The power draw isn’t exactly one of TR’s strong suits. But again as I said earlier, given the price it’s all acceptable, at least for now, especially if you consider the impact Ryzen had on the CPU market and how Intel has been scrambling ever since.
Even when idle, TR takes consumer 50 watts, which is almost equal to the load consumption for some of the other consumer CPUs. This is most probably because of it’s NUMAness and that multi-module design.
CONCLUSION AND THOUGHTS
I can’t shake off feelings of disappointment after going over ThreadRipper. Ryzen CPUs with their efficiency and competitive prices straight up made the Intel line-up sort of obsolete before Intel stepped in and reduced the prices of-course. Maybe, it’s because I’m a consumer, or maybe because the gaming aspect of TR is less than impressive (I’m a harcore PC gamer), or perhaps it’s because TR didn’t wipe the floor with the highest end i7s. Sure they did beat them but not by big margins. I mean the real competition, the i9s are yet to roll in, sure they’ll be ridiculously expensive, but that’s where I suppose TR shines. Perhaps, ThreadRipper’s biggest achievement is that it’ll make the ultra-high end CPUs affordable-ish for the masses. Now we just wait for the Intel i9 CPUs to arrive.