My background is in the FPGA industry which gives an absolutely fantastic example of this exact dynamic. FPGA companies produce a new generation of chips every 3-5 years or so. There are only two companies, with market share split 60:40. In the last 2 decades we've seen one produce a fantastic product and the other screw up, and then it flips, and then it flips back. Over that time the market share of these companies went 60:40, 62:38, 58:42, 60:40. The actual number of customers who switch to the better product is tiny. Why?
* All the existing knowledge in the company is about one platform, it's incredibly expensive to develop the skills on the other toolchain. Port all your software etc.
* The existing products are all from one vendor so you save loads of effort if all the products are basically similar.
* There are existing relationships with the company you're with.
* You know that if you do switch, all that cost of switching may result in only a few years of using the better product.
* The IP you're buying works better with the vendor you're currently on.
* You don't know what the real world performance will actually be for your application.
So yeah, you could move from Intel to AMD, but the chip is only a tiny part of that cost.
I also work with FPGA (on the software side, but obviously I have a lot of interaction with the RTL people) and I think it's not entirely comparable.
Switching from an Intel x86-64 CPU to an AMD x86-64 CPU is absolutely trivial compared to switching between, say, a Xilinx Zynq SoC and an Altera Arria10 Soc (to take two chips I'm familiar with).
Switching from one FPGA to an other means changing all vendor-provided IPs (things like serializers/deserializers, encoders, decoders, MACs, DMA etc...), you probably have to change almost all your software stack, you need different tools, you need a different kernel, a different bootloader. You need to redesign your motherboards because obviously the pinout is different. And of course even beyond that it's never a 1:1 match it terms of capabilities, there are not always exact equivalents of one chip in the other vendor's inventory. So you may end up having to buy a more expensive chip to have the same number of memory blocks, I/O or whatever.
I think the main reason people are not switching to AMD en masse for servers is mainly inertia. The vendor lock-in is pretty tame compared to FPGAs.
I agree with you but would somebody care to enumerate all the ways in which switching is difficult?
All i can think is
* Intel amt (all enterprises use)
* avx-512 (nobody uses)
There are a couple more reasons, all related to loss of optimization that can offset any nominal price-performance gains. Counterintuitively, people who are the most sensitive to performance often have the most to lose.
AMD implements some important scalar instruction set extensions as microcode, not in silicon, so if you have an application that uses them heavily (and some of these instructions are significant optimizations over generic C code) you will see a drop-off in performance.
Highly optimized/efficient code for Intel microarchitectures become a lot less so on the significantly different AMD microarchitecture. The effects are not small and re-optimizing for a different microarchitecture can be a lot of work depending on the application.
> AMD implements some important scalar instruction set extensions as microcode, not in silicon
Do you have any examples other than pdep and pext? Although these happen to be my two favorite scalar instructions, I would hesitate to call them important. Compilers won't just generate these from normal source , and I would call their use extremely niche at the moment (things like chess engines, I'm looking at you). They aren't even available on Intel Ivy Bridge and Sandy Bridge machines, which still make up a big enough fraction of data center machines.
So I'm pretty sure the number of entities avoiding switching to AMD because of heavy pdep and pext use is pretty close to zero.
Maybe you have some other instructions in mind though?
> Highly optimized/efficient code for Intel microarchitectures become a lot less so on the significantly different AMD microarchitecture.
This was somewhat true in the past, and probably hit its peak in the P4 vs Athlon/Opteron era. However, it is pretty much incorrect for Zen. Although the details of the hardware implementation might differ (and unless you are an insider you can mostly only guess at this), as an optimization target for software, Zen is very similar. It has a similar width, similar cache design both for data and instructions, similar instruction latencies and throughput, and so on. In fact something like Zen is as similar to Haswell as Haswell is to say Ivy Bridge.
The primary exception is AVX/AVX2 code, where Zen implements everything internally as 128-bit operations. In this area you might make some different decisions if targeting Zen - but the gap is not huge.
 What I mean is they won't generate them any scenario other than directly calling the x86-specific builtin/intrinsic for that exact instruction.
PDEP/PEXT are the big ones, they are extremely important for real-time sensor and event processing (plus a few other things like join parallelization). Those instructions let you trivially compute ad hoc intersections between arbitrary and mixed dimensionality constraints in high dimensionality spaces that would lead to some very ugly and much slower code in pure C++. Also useful for massively parallel graph analytics. Ironically, the instructions were not designed for this purpose. We are talking about a 10x improvement in throughput, it isn't trivial.
I lived in the HPC world prior to the existence of these instructions. I wouldn't want to go back. I used to design insanely complex and inscrutable bit-twiddling libraries to achieve the result of what is a handful of instructions now. It is one of the very few intrinsics I can't live without for most of the high-performance codes I write. The only other non-standard instructions with similar value are the AES intrinsics (which are useful for more than encryption).
Vector instruction support is important but more spotty in its value, at least in my case. I have applications where I expect the details of vector performance will matter but I have insufficient data thus far. Early AVX implementations were marginal but I could see use cases for AVX-512, though I have no anecdotal data to support that conjecture.
Thanks, that is really interesting. It is hard to believe that pdep/ext alone could result in a 10x throughput improvement - but I acknowledge it is possible since that is one very slow to emulate instruction in the general case, and if you needed exactly that...
It actually isn't clear to me exactly what Intel was targeting with that pair of instructions, but they sure is useful in all sorts of scenarios.
> The only other non-standard instructions with similar value are the AES intrinsics
If I can ask, what are the interesting uses outside of encryption? The main use I am aware of is as a handy fast and high-quality hash function implemented in hardware (and you don't need all the rounds when you are just after quality, and not adversarial collision resistance).
For PDEP/PEXT it is the general case of ad hoc and unpredictable bit extract/deposit sequences. A decade ago, I spent a lot of time designing clever libraries that could dynamically effect this but even if you could amortize the overhead of setting up the machinery, it still was ~20 cycles. These instructions eliminated the need to code gen at all, and each instruction runs a lot faster than ~20 cycles. When those instructions showed up with Haswell, it wiped out a lot of code I had written, and in a good way. You can compose them to effect algorithms that would be very complicated (and slow) to implement otherwise.
I've read some things from Intel that suggest PDEP/PEXT were designed for cryptographic applications. However, they are a straightforward implementation of generalized shift networks (there is literature on this), so their potential applications are much broader.
For AES, those instructions have interesting properties for integer manipulation beyond encryption, and even beyond providing the basis for the fastest generic non-cryptographic hash functions currently available for both large and small keys. For example, you can compute a perfect hash (e.g. collision-free hashing from 32-bits to 32-bits) in a few clock cycles for scalar primitives using AES intrinsics. If you understand the construction, which superficially seems like it should not be possible, the result is virtually ideal statistically. Brilliant for hash tables, which still spend a lot of their time hashing, so I am surprised no one seems to be doing it (I figured it out myself, studying the statistical peculiarities of the AES instructions).
> Do you have any examples other than pdep and pext?
They're my favorite instructions too, and I really hope Zen2 fixes their performance problems. But as you say: they're not really used much in practice. I can only point to Stockfish, which uses pdep / pext to calculate where bishops and/or rooks can move on 64-bit (8x8) chess boards.
Side note: Figuring out where bishops / rooks can move damn cool. https://www.chessprogramming.org/BMI2#PEXTBitboards
"occ" is an occupied square. Remember that in Chess, bishops and rooks are blocked by both allied and enemy pieces. EnumSquare is a value between [0 and 64) that represents where the Bishop (or rook) is located.
The other instruction I came across that's microcode based was:
1. vgather -- I'm pretty sure Intel is microcode based as well however.
2. PCLMULQDQ -- Carryless Multiply, used for GCM mode encryption. Intel's allegedly has 1-clock-per-instruction bandwidth, while I've measured AMD's to be ~2 clocks per instruction, and AMD claims its microcode (it doesn't say which FP pipelines are used)
Neither are scalar code though.
Allegedly, Intel improved the integer-division instruction to ~20 clock cycles on the 9000-series, but that isn't implemented on servers yet. So I guess 64-bit division / 64-bit modulo is now a major advantage to Intel. But this is a very recent event and not widely deployed yet.
> The primary exception is AVX/AVX2 code, where Zen implements everything internally as 128-bit operations. In this area you might make some different decisions if targeting Zen - but the gap is not huge.
Even then, AVX2 code is more efficient to decode and run. So even if its emulated on AMD's platform, there are benefits to writing AVX2 code.
Remember that its not a pure win on Intel systems either: use of any YMM register begins to downclock the whole chip, since those registers draw significantly more power. There's also some vzeroupper issues (mostly used to avoid this downclocking problem).
In effect: you need to use AVX2 and AVX512 code with a degree of caution on Intel platforms. Its probably a win if you're reaching for the button, but for very small loops, the downclock may slow down the rest of your scalar code.
Otherwise, I think I agree with you fundamentally. Optimizing for Zen or Skylake is incredibly similar: use SIMD where possible and cut dependencies.
The Branch predictor is different, but I don't think anyone (aside from Meltdown / Spectre code) relies on the details of either branch predictor. The number of execution pipes are different, but the programmer's focus should remain on cutting dependencies and maximizing ILP, regardless of the number of execution pipes that exist.
> They're my favorite instructions too, and I really hope Zen2 fixes their performance problems.
Me too. AFAIK their slowness is probably due to requiring a specialized functional unit to implement. Something like the unit described in this paper .
> Allegedly, Intel improved the integer-division instruction to ~20 clock cycles on the 9000-series
Do you have a reference?
That would be weird if it applied only to the 9000 series, and not other Coffee Lake cores. After all, it's the same core, reportedly unchanged all the way back to Skylake , so how could the divider be faster?
FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).
> Even then, AVX2 code is more efficient to decode and run. So even if its emulated on AMD's platform, there are benefits to writing AVX2 code.
Yes, that's why I said you _might_ make _some_ different decisions, such as in any algorithm that doesn't scale cleanly to 256 bits, but still ends up faster when the CPU offers full 256 bit ALUs (so 256 bit and 128 bit ops have the same performance).
One real-world example would be something that uses a vector-width lookup table, say for a shuffle mask. With 2 possibilities for each DWORD element, a 128-bit shuffle mask only needs 16 entries, but 256-bit masks need 256 and they are twice as large (8 KiB in total!). With fast 256-bit units you might suck up this penalty, since it might end up faster overall, but with 128-bit units you might be better off going with the much smaller table and 128-bit lookups, at the same total throughput.
> Remember that its not a pure win on Intel systems either: use of any YMM register begins to downclock the whole chip,
Well not really anymore. Most (all?) recent chips don't downclock for use of 256-bit registers (not counting "high lane powerup"). Only some server chips downclock for "heavy" AVX2 use, which really means a lot of back-to-back FMAs or other heavy FP operations. In general the penalty for 256-bit instructions is small on recent cores (a larger penalty is paid for AVX-512), and compilers generally use them freely (the same is not true for 512-bit) and effectively.
 I think there must be some small changes, since the LSD was re-enabled, implying that they fixed the bug where registers could be corrupted when using the high half of the GP byte registers.
> Do you have a reference?
Yes and no. Apparently, my mind messed up my memory. So it was a leaked post on /r/intel. I thought it was for 9000-series, but apparently it was a leak for Cannon-Lake. So I was mistaken.
Second: the post has since been deleted. You can see the claims in the comments still however.
> FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).
The post used to alleged 20ish cycles for 64-bit division (!!). So I guess that's something to look forward to testing.
I just ran some tests on CNL and indeed the behavior is very different than earlier chips. I am seeing ~15 cycle divs with no pipelining (i.e,. the latency and inverse throughput are both 15), versus 36+ cycles latency and 25+ cycles inv throughput on Skylake.
Interesting. I found only a few other changes beyond that, so far.
CNL also added another AES unit, so you can now dispatch aesenc and its ilk to ports 0 and 1.
>avx-512 (nobody uses)
An example of people using those cpu instructions would be buyers of Intel's proprietary C++/Fortran compiler. The reason companies pay ~$1700 license instead of using free compilers such as GCC and Clang is to specifically take advantage of the latest advanced Intel cpu instructions.
Example buyers of Intel's C++ compiler would include high-frequency trading firms and HPC labs. I wouldn't be surprised if Google, Facebook, and Amazon also bought Intel Parallel Studio compiler licenses for some of their workloads.
Recent LLVM and GCC make good use of avx512. Although for GCC, you need to use the option -mprefervectorwidth=512.
Unfortunately benchmarks on websites like Phoronix do not make use of them.
But when it comes to numerical computing, it is a boon.
Much easier to take advantage of than the GPU.
I run (Monte Carlo) simulations that take hours or days. These can be vectorized, but I've never heard of someone being able to run them on a GPU.
However, folks bring graphics cards up every time I mention (my love of) avx512. There is always a first, so I do really want to find the time to play around with it, and see how many mid-sized chunks can be woven together. And how memory/cache plays out when breaking things into small pieces.
The last Monte Carlo simulation I ran took a few days to get 100 iterations.
The MC iterations themselves were chains of Markov Chain Monte Carlo iterations. Each of these MCMC iterations takes several seconds.
Therefore, to move to a GPU, I'd like to parallelize between MC iterations, and also within the MCMC iterations.
On a CPU all you have to do is vectorize the MCMC iterations, and then run the chains in parallel.
> I've never heard of someone being able to run them on a GPU
I'm surprised; intuitively (though, mind you, as someone who has never done GPGPU programming, only read articles about it), I'd think some combination of 1. a CPU-RNG-seeded simplex-noise kernel, for per-core randomness; and 2. a cellular-automata kernel embedding of your simulation logic, would let you do MC just fine.
I've often brought up GPUs while talking to folks, because they're interesting and offer a world of potential.
I have two computers, one with a Ryzen 1950X, and the other an i9 7900X. Both CPUs cost about the same, but the i9 (with avx-512) is close to 4 times faster at matrix multiplication. Yet it is still about 10x slower than a cheaper Vega 64 GPU.
But the folks I talk to aren't generally computer scientists. They're statisticians and academics, mostly. A few have tried, but they haven't been successful.
There are libraries like rocRAND / cuRAND for random number generators.
It's probably possible, and I just need to sit down and really experiment. For the MCMC chains (going on within MC), Hamiltonian Monte Carlo sounds more feasible than Gibbs sampling. In Gibbs sampling, you need lots of different conditional random numbers. You often get these from accept/reject algorithms -- ie, lots of fine grained control flow.
And ideally, each MCMC run has at least an entire work group dedicated to it. You don't want the entire work group calculating a small handful of gamma random number (with all the rest masked). The parameters of the gammas are not known in advance, so they cannot be pre-sampled.
Hamiltonian Monte Carlo is probably much friendly. However, I have heard concerns that the simplectic integrator used needs a high degree of accuracy to avoid diverging. That is, that it needs 64 bits of precision.
GPUs with more than 32 bits are well outside of my budget. Although, I could look into tricks like double-singles for the accuracy-critical parts of the computation.
The simulation I mentioned in my previous comment was using Hamiltonian Monte Carlo. However, each iteration was rather involved, and while much is vectorizable (eg, matrix factorizations and inversions), doing so on a GPU is AFAIK not trivial.
It seems like a gigantic leap in complexity.
clang and gcc supported AVX-512 before any chips using it were even available.
icc might still have an edge on vectorized code, but it is not that big.
Parallel studio is more than just latest ISA support though. VTunes tells you all about performance at the processor level, so you get much more accurate perf profiles from the hardware itself.
You can also get VTune as part of System Studio through a 90 day perpetually renewable community support license - https://software.intel.com/en-us/system-studio/choose-downlo...
No talking to Intel's engineers, but kind of cool if you're alright going without a real support channel.
This isn't a huge one, but we're a VMWare shop, and you can only VMotion (move live machines) VMs between server on the same chip class or lower. When we had a mix of AMD and Intel servers, we had to put them in different clusters, and could only migrate between clusters when the VM was powered-off.
You can, however simply power down the vm and then use vmotion to migrate across from intel to amd. Not "live" but pretty close to it when migrating to the new cpu cluster.
> * Intel amt (all enterprises use)
Is that true?
I work for a hundred billion $ tech company, and Intel AMT is disabled fleet wide due to security issues.
Tech companies would tend to disable them. Non-tech enterprises like things like that.
I work at one of those as well. It's disabled on all our machines to.
Odds aren't that bad that you work for the same one :D
Given that I mostly work on embedded software I'm not the best person to ask but I suppose that if you're writing software that uses heavy performance optimizations going from one CPU model to an other is not necessarily trivial.
On top of that most server farms wouldn't completely ditch their old computers to replace them in one go, so you end up with two different vendors to deal with, potentially two versions of your software etc... That increases the maintenance burden quite significantly.
Intel might be giving major buyers sweetheart deals on their chip prices. So while an EPYC 7371 might get discounted 20% to an Amazon, Intel may be discounting 50-70% (Gain of salt: just speculation on my part). And when you consider these CPUs are in systems where most of your costs are likely in RAM, NVMe, and supporting infrastructure.
These are high margin parts, so I imagine there's wiggle room for volume.
Not a chance. Intel is selling every CPU they can print. Enterprise customers are paying a premium at the moment.
STH previously reported that Intel began offering Xeon discounts to smaller organizations that asked for AMD EPYC price quotes.
That's a niche market. Intel has the larger players by the balls. Very few companies are comparison shopping AMD vs Intel.
You say that, but AWS has still launched EC2 instances with both AMD and ARM CPU cores recently. I imagine Amazon buy enough Intel CPUs for them to take notice.
I wonder if the primary driver for buying and offering alternative platforms for EC2 instances was to send Intel a message. "We aren't afraid to go to AMD if you don't make your offering more competitive."
That is not because there is far more demand than they can cope, it was because they had to fulfil Apple's modem order.
Its an open secret that Intel adds custom accelerator hardware to their server chips for the big corps, that are undocumented/disabled for everyone else. I have no idea if AMD does this too.
In the HPC segment, there might also be compiler issues.
Move aside, I'm an HPC administrator! :)
In the HPC world, things are not as clear cut benchmarks, or the vendors' own marketing materials/numbers.
First of all, the application you're running may be developed for a specific compiler, and the code sometimes depends on optimization behavior of a compiler. So, changing compilers changes a lot of things. This is why we have both Intel's tools, and GCC toolchain fully supported. For example, LAPACK and its siblings take compiler behavior and CPU specifications into consideration while compiling in an optimized way to maximize its performance IIRC.
Also, there's no guarantee that Intel's compilers are fastest on Intel hardware. In the days of Opteron 6100s, using Intel compilers, we were able to beat Intel processors of the same era. You heard it right: Compile using intel compiler with specific flags, run on AMD CPUs, get higher performance, profit!
Intel's AVX512 is well used and abused in HPC world, however AMD's HPC performance is not as bad as jandrewrogers implied in his comment . AMD is originally an FPU company, and while their scalar instructions may lack on paper, they run really fast.
In HPC world, the CPU/board architecture becomes irrelevant after some point. SpecCPU benchmarks are the ultimate benchmarks, because their behavior is compiler agnostic and push every aspect of the CPU very very hard. If you can get the same SpecFP with an Intel part, you can get the more or the less same performance on real workloads.
If you have any other questions, you can AMA. I'll try my best to answer.
Funny addenda: We have some applications used widely by users, and when fully optimized, some older Intel CPUs outpace the newer ones by a significant margin. This is some heavy handed, exotic optimization.
> Intel's AVX512 is well used and abused in HPC world, however AMD's HPC performance is not as bad as jandrewrogers implied in his comment .
I know earlier AMD processors didn't actually have a 256-bit support, so AVX instructions were actually implemented by soaking up two 128-bit lanes (it helps that AVX doesn't have many instructions that actually permit you to move data between the two 128-bit slices of a 256-bit vector). For their AVX-512 performance to not be absolutely horrible, I take it they've actually built real AVX-512 units at some point?
They're doing the same thing they are now with AVX2, but with AVX-512. So 512b instructions will translate into two uops.
It's still useful to implement the AVX-512 instructions because they fill in some holes in the existing AVX instruction sets (eg lack of scatter/broadcast instructions) and implement a new SIMT-like op-masking functionality.
Wikichip doesn't list AVX512 as supported. Am I missing something in the spec sheet?
Maybe compilers are emulating AVX512 behavior with other instructions when targeting Zen/Zen+/Zen2 directly?
From what I've read now , it looks like AMD still uses 2 x 128bit AVX units to execute AVX2 instructions. Also, AMD is always coming a generation behind Intel in terms of FP instructions sets, so Zen doesn't support AVX512.
According to WikiChip , Zen 2 actually has 256 bit FPU paths. I was unable to find a credible benchmark for Zen 2, so I can't talk about its performance. However, when analyzed from the perspective I've given below, it's not hard to assume that Zen 2 is a heavy hitter in terms of floating point performance.
However, the interesting part is, when you look to SpecCPU 2017 FP Rate , AMD Epyc 7601  system has a similar per core performance with a much bigger Intel Xeon Platinum 8180  system.
* AMD's per core base (lowest) rate is 4.1875.
* Intel's per core base (lowest) rate is 4.3482.
* AMD is running GCC compiled code.
* Intel is running Intel compiled code.
* Intel has higher clock speed.
As I said before, it looks like Zen 2 is going to be a better HPC processor than Zen. Zen looks like a very good Enterprise processor now.
So with my hat, I can conclude that not having 512 bit hardware is not a crippling omission.
Addenda: I forgot to say that Intel has something called "AVX frequency". Since AVX, AVX2 and AVX512 has tremendous power requirements when compared to other operations, Intel lowers CPU to an undisclosed frequency. When I last checked, AVX frequencies of Intel CPUs that we use weren't in the technical guides and were not public in any way. So, the peak SpecFP Rate is not very different from the base ones.
Also, since the CPUs thermal budget is very constrained during AVXx operations, other ports' speed is also reduced. So at the end of the day, AVX512 is not a free turbo boost in HPC environments and heavy/continuous loads.
A large part of the reason why AMD can reach similar sustained real throughput to Intel despite having a fraction of the FPU throughput is that they run the FPU as a separate unit on different issue ports, and their core is slightly wider when you measure the amount of instructions it can retire.
So even though the Intel CPU can in theory do 4x the computation AMD can in the vector units, in reality even the tightest real vector code does all kinds of things other than vector computation, in the middle of that vector stuff, like computing addresses for loads and stores and managing loop variables. On AMD, those intermixed scalar instructions go into separate scalar ports, on Intel CPUs they take space in the same issue slots that the vector code uses.
Then on top of that, the memory bandwidth is a great equalizer. Doesn't matter how many multiplies you can compute if you cannot load the operands, and the AMD systems are much closer there than they are in the pure computation, especially as they have a lot more L3 cache per core.
On Zen 2, AMD does two big things that are going to really help them in HPC loads. They are doubling vector unit width, and they are doubling the amount of L3 per core. I honestly think the second change will help more than the first.
You're right. Also Intel's AVX implementation is very power heavy, and they need to lower CPU frequency to fit into their thermal budget (see "Addenda:" in my previous comment).
Also yes, AMD's memory subsystem has much lower latency, and has higher bandwidth. Also their direct-attach approach is better than Intel. I forgot that advantage TBH :)
However, I can argue about L3s effect on speed. In some cases, the code and the data is so small, but the computation is so heavy that, you can fit almost everything into the caches. I had a 2MB binary which required 200MBs of memory at most, but it completely saturated the CPU in every way imaginable.
So, in some cases caches have great affect on speed. Especially if the data you're invalidating and pulling in is huge. However, if the circulation is slow, a faster FPU always trumps a bigger cache.
> Also yes, AMD's memory subsystem has much lower latency
No, AMD's latency is generally worse than Intel's on Zen chips. Here's the first example I could Google , but the same trends repeat themselves across many benchmarks.
My overall impression is that the typical gap is 5-10 ns.
Thanks for the link. I will take a look.
As I clarified below, in my other comment, we were unable to get new Zen systems. So I’m not knowledgeable about their behavior.
However, I need to make my own benchmarks to see how this increased latency affects performance of different work loads and scenarios.
Out of interest, what do you mean by this? Are you talking Zen1 or Zen2, because in my experience playing with Zen1 EPYC the memory latency was worse than Xeon Broadwells, and on top of that you had worse NUMA issues that could affect certain cores which weren't directly attached to the memory and this added additional latency more than on the Xeons I was comparing against.
> Are you talking Zen1 or Zen2...
Unfortunately, neither. The last AMDs I was able to play with Opteron 6xxx series. The later ones weren't as fast, and Zen 1 was not easy to obtain, so we were unable to acquire them.
The last ones I used were better from their competitors of the era. I also had a desktop system from that era which was way better, at least for my workloads.
I'd love to play with Zen 1/2 and compare "benchmarks" to "real workloads", because as I said before, in HPC, benchmarks are just numbers.
e.g. Your memory bandwidth may be low, but if it's low latency & you're hammering the bus, bandwidth may not be limiting. OTOH if you're streaming something continuously, your latency becomes moot, because the bus has already queued up everything you need and can continue piling up stuff you need until you process the ones at hand. For the second scenario, I have listened to a talk about an embedded system, which the developers were able to accelerate the system 10x by using an in-cpu accelerator unit to copy required memory segments to cache independently from the CPU.
HPC compilers have supported AMD pretty well for the past 15 years, from back when Opteron was the best x64 for a couple of generations.
That's true, universities for example will often use Intel compilers on academic licenses, other sites may use other performance-oriented commercial compilers such as PGI.
avx-256 performance per core is half of Intel's, and people use that.
AMD Zen has 4 128-bit SIMD units, while Skylake-S (and earlier) Intel chips have 3 256-bit SIMD units, and Skylake-X/SP (hereafter SKX) chips additionally have 1 or 2 512-bit SIMD units (which overlap with the 256-bit units).
Now not all units can run all instructions. E.g., Intel chips run FP instructions on only 2 of those units, but AMD can run FP instructions on all 4, so in that sense they are even.
However, not all FP instructions can run on all units on AMD: FP multiplications run on only 2 units, and FP additions run on two units - so if you are doing all multiplies, both AMD and Intel can do 2 per cycle and since Intel is twice (quadruple on SKX with 2 FMA units) as wide, then Intel is twice as fast. If you do a 1:1 mix of mul and add, however, then AMD and Intel may be tied - but then AMD is further hamstrung by only have 2 128-bit load units, vs Intel's 2 256-bit load units (512-bit on SKX) - so it is entirely possible that many kernels are limited by load/store throughput on AMD.
For things like in-lane shuffles, AMD and Intel (pre-SKX) have the same throughput: AMD has 2 128-bit shuffle units, and Intel 1 256-bit one. For cross-lane shuffles the 128-bit AMD units really struggle since multiple ops are required and Intel wins big.
So AMD's AVX-256 perf being half of Intel's is more or less the worst case, and many cases will see closer performance. If Zen 2 doubles everything to 256-bits everything will change dramatically.
SKX is what Zen competes with, for now. I don't think Zen 2 will arrive for Epyc any time soon, and when it does, maybe Intel will have a new part out. The current timeline for Zen 2 is desktop parts maybe 1Q2019, server parts traditionally lag. We have to look at what's actually in front of us.
No one has announced 256 AVX in Zen 2; while I'm sure it's possible, I'd think AMD would be advertising that at some point if it were happening. On the other hand, they've doubled the chiplet per socket count, which may compensate somewhat for the reduced AVX width per core.
For the record, I'm firmly in AMD's camp here; I appreciate both the underdog aspect and the renewed competition in the x86 space. I own a first-gen threadripper myself. But it's still important to acknowledge where Zen falls short compared to Skylake-X.
I mentioned Zen2 in a single sentence at the end of my reply, in the context of if Zen2 has 256-bit units (I think the CTO has confirmed it will), the comparison wrt to Intel will change. Intel may have another chip out at that point, but it seems very likely that 2x 512-bit FMA units will still be the top end. My comment was also in the context of Zen2 chips in general, not necessarily just EPYC (since at this point we are comparing uarches). It seems relevant enough to mention it since the first of these chips are apparently imminent.
The rest of my comparison was Zen vs SKL and SKX. Zen competes against both of those, SKL in the laptop, desktop and (some) workstation space, and SKX in the server, (some) workstation and HEDT space. As a practical matter, for things like choosing on which hardware to deploy to in the cloud, it still also competes against Broadwell all the way back to Sandy Bridge, since chips of that era still dominate in the data center (and Intel still sells a ton of those chips).
> But it's still important to acknowledge where Zen falls short compared to Skylake-X.
I don't see how you could read my post and come to another conclusion? AVX-256 performance falls somewhere between half (worst case) and approximately equal (best case) to Intel, depending on your load. FMA-heavy, and L1/L2-hit load heavy will be close to worst case, and some integer, suffle/permute heavy or memory bound loads will be closer to best-case.
Yeah, all of this is mostly fair. I dispute the idea that SKL and the laptop/desktop space is significant in a discussion of AVX — I don't think the laptop/desktop space cares much about AVX, and thus SKL (and non-EPYC Zen) is mostly irrelevant (IMO). That isn't a hard fact, though, and your opinion is also reasonable.
I'm not sure it makes too much sense to compare against older Cloud platforms — the same reason older Intel µarchs dominate (deployment of newer hardware takes time and money) also limits the availability of new AMD µarchs. But it is a reasonable point, if the relative prices of the offerings don't reflect the cost of new deployment in the way I imagine they would.
> I don't see how you could read my post and come to another conclusion?
I guess I misread your post! I'm sorry. My initial impression was that it was highly defensive of AMD. But I think I read too much into it. I'm sorry about that.
> I dispute the idea that SKL and the laptop/desktop space is significant in a discussion of AVX
Well I think it is significant. I'd say that overall the laptop/desktop space makes reasonable use of AVX and AVX2, probably more than your average load in the data center.
HPC certainly makes the most use of AVX/AVX2, but on the laptop/desktop you have at least:
- Media encoding and in some cases decoding (this also happens on GPU)
- Rendering and graphics work (this also happens on GPU)
- All sorts of random AVX2 use in compiler generated code and runtime libraries (e.g., AVX2 is used widely in perf sensitive libc routines like memcpy) and even in JIT-generated code
Yes, good points. To nitpick (sorry):
> AVX2 is used widely in perf sensitive libc routines like memcpy
Depends on the libc. I believe FreeBSD's libc avoids AVX to avoid the additional context switching cost for libc-using programs that don't already use AVX.
All Skylake-X chips actually have the second 512b SIMD unit, including the i7s. Intel's initial documentation here was incorrect, InstLatX64 determined that both units are enabled.
Yes, for now - but I don't think I implied otherwise?
I said Skylake-X/SP. Certainly not all Skylake-SP (aka "Scalable Xeon" or whatever Intel calls them) server chips have 2 FMA units.
BTW, I believe Intel also documents this fact about X chips on ARK now as well.
I almost know zero about FPGA market but it seems that it's infinitely harder to port programs between FPGAs than from CPUs.
Other than that thanks for the insights, interesting facts.
That is probably generally true but if you have some serious compute workloads that you specialized for your server's CPU it could be hard to port between AMD and Intel as well. When people have customized the code for particular cache sizes, SIMD instruction sets, core counts, throttling schemes, ram throughput, and branch predictors and so on.
It's quite true, but what I've heard from FPGA users is that toolchain and conventions are aliens to each others, I think these are both two different kinds of hell. I don't know what would cost more to a company.
When you deliver an FPGA you’re normally also making your own hardware. That’s a big part of the FPGA porting cost, you have to redesign your boards. I don’t think the same is generally true in the x86 realm. Even if you’re specifically using some feature of the intel micro architecture it’s still essentially a code change rather than an full redesign.
Only when you design with unportable IP which the FPGA vendors are dying for you to design in. Some proprietary things like clock multipliers are unavoidable but they can be isolated in portable wrappers. Most FPGA devs are not this forward thinking and lock designs into a platform that makes future migration a problem. No different than writing code against Win32 vs Posix.
It is quite routine for ASIC development to prototype on FPGAs and that code will necessarily be made more portable to accommodate the architecture changes an ASIC requires.
But isn't the biggest customers of these chips are server owners? It's not like these companies/people will program the CPU itself and software support for Intel and AMD is exactly the same (is there anything that gcc can compile on Intel but not on AMD? Maybe one can argue the other way around due to recent security bugs, Spectre et al). So I can't see how this can be a factor.
When it comes to servers its even more true as you're buying the platform as well as the CPU, the CPU is important to be sure but the platform features are even more so for running the hardware at scale.
A simple example, for a long time you couldn't PXE boot a server from a 10Gb NIC while using AMD chipsets. So every AMD system needed a 1Gb NIC cabled and maintained just to build the server vs a single 10Gb NIC on an Intel platform. That scales out for hundreds of servers now you need a 1Gb fabric and the associated switches etc just to allow you to build your systems.
That, uh, doesn't sound right. None of your networking equipment can do auto-negotiation?
I think that he meant that the workload still required the 10Gb link, so the AMD servers ended up with two NICs and two sets of cables (a 10 Gb connection and a 1 Gb connection). This leads to higher resource requirements, as they need effectively double the # of switches.
I'm not familiar with the problem at all, that is just how I interpreted his statements.
If you use an intel CPU you can use the intel compiler which in some workloads does a better job of optimization than gcc. The intel compiler doesn't know the right optimizations for AMD processors (at best, intel has been known to pessimize AMD processors).
Though overall you are correct that switching AMD to Intel is generally not a big deal, the above factors need to be accounted for.
> which in some workloads does a better job of optimization than gcc
Is this really true in a meaningful way? I mean, unless you happen to have an identical workload to the benchmarks?
I ask because the benchmarks and reports I've seen over the past ten years or so have really been quite disappointing in terms of icc's performance. I always assumed it must produce better performance code because its a proprietary product by the people who created the architecture and so they must know how to optimise for it, but from what I've seen, there's really not a big difference at all: some things icc has a slight lead and other things gcc does, but overall, they really don't seem to perform much differently. Unless you're making use of all the Intel libraries (Performance Building Blocks, Math Kernel Library, etc), in which case, I imagine they've tuned it for icc, there doesn't seem to be much reason to use icc especially since they've been known to pessimize for AMD cpus).
Am I wrong?
I think icc's biggest advantage is automatically translating appropriate calls into calls to those performance libraries and linking them.
If you manually make calls in GCC, I bet you'd see similar performance much of the time.
icc also seems to more aggressively parallelize, even places you didn't ask it to.
Depends on your workload. A 1% difference between compilers can be very significant in a tight loop across a few thousand cores over several days. In other cases even a 100% difference wouldn't be noticed.
The Intel compiler is not testing for capabilities so it just not enabling optimizations for AMD CPUs, https://software.intel.com/en-us/articles/optimization-notic...
I am wondering why Intel is using images instead of text on that page, do they want to hide the information from web search ?
According to https://www.agner.org/optimize/blog/read.php?i=49#49 it also checks the CPUID.
There was also a guy who benchmarked a Blizzard game in a VM where he changed the CPUID of his AMD CPU to an Intel one and received better performance than on the host (with GPU pass-through btw).
Yeah, that is what I meant, the Intel compiler does not check if the compiler supports a feature or not, instead checks the CPU ID and apply optimizations for the specific compiler, but offcourse the AMD CPUs get no optimizations,
I was hoping someone would have an alternative reasons why those web-pages are images and no text or confirm my suspicion that is SEO related .
Oh, I've missed the "not" in your previous comment, sorry ;)
I do work with FPGA's as well. The tool chains are a lot more tightly coupled to the device. Also IP tends to uses chip specific features a lot of times to get better performance or space utilization. So it's a lot harder to justify a switch. AMD's CPUs cover the cases of what intel provides 99% of the time.
This is especially apt since altera is now owned by Intel, and they're the one this generation that put out a dud.
Given the future doesn't look very bright for Intel, given ARM at one hand and AMD at the other (with their 10 nm and 7 nm having multiple setbacks) I wouldn't count on Intel doing well in the future. It might end up like Nokia or Microsoft. I wouldn't invest in Intel...
Microsoft Windows is losing relevance, and Microsoft has been unable to leverage any market position with Windows Phone.
Microsoft is #1 company by market cap right now. Its... kinda weird to claim that they're dead and/or dying.
I never said they're dead or dying. They're losing relevance because they're losing their dominant (monopolist) market position (Windows and Office).
If Intel loses their dominant monopolist market position in the processor market but stays relevant in e.g. the GPU, network card, SSD, and what have you business then I (from my consumer PoV not paying too much plus ethical PoV) am happy. But the shareholders would gain far more from a dominant monopolist market position...
But Microsoft has been successfully repositioning itself as a cloud provider, it will be a few decades before they become irrelevant.
Latest news is that Intel's 7nm is doing quite well, and is on track to start high volume manufacturing in 2h 2019. And given that Intel 10nm is roughly comparable to TSMC 7nm, it's certainly possible that Intel will regain their process advantage very soon.
There are caveats in the above statement, but I certainly wouldn't bet against Intel.
> Latest news is that Intel's 7nm is doing quite well
This was the very first time I ever heard of this. Do you happen to have a source?
I think parent typo'ed. Latest news is volume production for Intel 10nm (~ TSMC 7nm) in 2H19.
He's probably referring to the likes of this:
You've quoted a press releas by Intel on how Intel is awesome and there is no problem and everything is peachy.
Not the most trustworthy source of info, I may say.
Press releases are material statements; if Intel is lying or being unrealistically optimistic then investors can sue them and likely win.
I got a bridge to sell you if you think press releases are reliable sources of information.
There are more than two companies.
What other player exists in this space, producing comparable performance in the same cost ballpark?
Performance and price are not the only characteristics that might be important for a project. In the aerospace industry microsemi has a large market share with their range of rad-hard/-tolerant FPGAs.
Are rad-hard CPUs going to generate billions in sales to replace intel and AMD gear?
For the market segment we are talking about there is exactly 2 actual choices.
A large market share in the aerospace doesn’t translate into a large market share.
I think most people know that there are more than 2 FPGA suppliers, but all the other ones are essentially niche or low-end market players.
The moment you’re looking at FPGAs with a serious amount of logic, you’re restricted to Intel and Xilinx.
Microsemi is no more. They got purchased by Microchip.
It's probably more that they're clearing out the old chips before launching their 7nm-based Zen2 Epyc chips next quarter. They've already sampled out to hyperscalers.
But, generally, Epyc does have some deficiencies. It's essentially NUMA-on-a-package, and each NUMA node is itself essentially a pair of 4-core processors jammed together (each with its own cache, with all the usual problems that brings). That doesn't work for everything... for example GPGPU compute doesn't really like to be split across NUMA. Inter-core and memory latency is also much higher than on Intel platforms. Intel is playing with much larger building blocks, their die is 28C and they can scale up to 8 sockets, while AMD has 32C per package they can only scale up to 2 sockets because they are actually four dies inside already (both systems scale to 8 dies).
A lot of stuff is fine with those tradeoffs, particularly the stuff you use server processors for. And it's pretty cute if you want a lot of storage or lanes. And it's pretty cheap to manufacture due to their smaller dies (although of course TCO is much larger than just the cost of the CPU). But it's not for everyone.
Next quarter they're moving to 8 dies in a package, on 7nm, with updated AVX2 and probably support for the AVX-512 instruction set (at half throughput).
The Zen 2 line is definitely looking interesting to say the least. I think where the Ryzen is going could be interesting in the server space as well. Ryzen 3600G rumored for example having GPU and CPU cores through the common IO interface. It wouldn't surprise me to see similar approaches in the workstation/server space. Wonder if there's been consideration for ARM compute units even.
It's definitely an interesting approach.
The mindshare of mid- to upper-level execs within its customer's organizations. EG, in today's "Nobody ever got fired for buying IBM", "IBM" is Intel.
Why do you think $1500 for a 16 core part is giving it away? There was a time not long ago where $1500 for an intel xeon was a solidly high-end part. This part is really nice, but it looks pretty low end when compared with dual socket machines with 128 cores.
Its likely AMD is making a healthy profit on it and has an estimate about how much of the market they are going to gain by significantly undercutting intel vs the profits per part. AKA if knocking 10-20% off their proffit increases their market by 30-40% its a no-brainer.
AX2 and AVX512
But probably more of a factor is that people are used to them and trust them.
ISTR seeing some benchmarks where using AVX512 really hurts your throughput if you're running multithreaded code. Again, vague memory, but the reason was that it winds up heating up the core that's running the AVX512 code and causing all the other cores to thermally throttle.
Yes, exercising AVX512 on intel (very much at all) thermal throttles in most workloads.
Avx512 is very new, especially to the server market. There likely aren't many people absolutely relying on it yet.
Isn't Zen 2 adding support for AVX512?
Yes, but a slow one. They just bump AVX units from 128-bit to 256-bit, making it Haswell-style. Skylake-X has a full 512-bit wide AVX unit.
Doesn't Skylake-X have the whole downclocking issue when running AVX-512 workloads though? Unless you can use almost exclusively AVX-512 then it doesn't make sense.
AVX and AVX512 both do downclocking.
Likely AMD will have to downclock too, pretty fundamental limits at work there.
AMD verified they would be running AVX2 at full clocks. Doing AVX-512 would happen at half-rate, but the processor would continue at full speed. In contrast, skylake downclocks base clocks by 30+%.
Not only does the AVX calculation downclock, but all the other calculations in other ALUs/FPUs also downclock. For pure AVX-512 workloads, Intel would be a little faster, but for mixed loads, AMD should be much faster overall.
AMD says they don't downclock AVX2 (about 2/5 down the page)
See page 13 for AVX clockspeeds
With 7nm they might not need to downclock that much.
AVX-512 isn't just a wider vector, it also fills in some holes in the AVX instruction sets. It's still useful to support it, even if they are only doing it at AVX2 throughput levels.
Citation please. I never heard of Zen2 supporting AVX512.
A near monopoly
Brand value, I guess? All AMD hardware (CPUs and GPUs) I've bought in the past was crap, I'm never buying anything from them again, no matter what they do. It's always the same: on paper they are better and cheaper, but once I have them at home they are worse and, in case of the GPUs, they have pathetic drivers. I've always felt ripped off after buying AMD, it's never worth it just to save some pennies.
This has been my experience, I wonder if others have felt the same.
Whenever AMD's CPUs have been price/performance competitive they've always been a fantastic choice. Especially now, when with Ryzen/Threadripper/Epyc you not only got more cores for the less money, you also got better security.
AMD's GPU drivers have been perfectly fine since the 9700 Pro. That's a 16 year old myth at this point, let it go.
ATI started rewriting their OpenGL driver in 2004-2007. Long after 9700 Pro’s prime.
As told by the author, it was a long lasting disaster.
Do you have an actual argument here? The story, while interesting, doesn't really contribute either way. Per the author's own story the switch to the new driver broadly didn't happen until after it was finally stable, and the legacy driver continued to receive performance optimizations in the meantime.
It's an interesting story of project management nightmares, but it doesn't provide any argument to the state of AMD's drivers as experienced by end users either way.
In terms of stability we do have some large-scale metrics on that front, such as Vista's crash blaming. Those metrics don't support claims that AMD's drivers are less stable than Nvidia's, as the Nvidia was responsible for 28.8% of Vista crashes while ATI was 9.3%. Given more Nvidia than ATI users that's not necessarily damning, but it also clearly disagrees with the notion that ATI is deeply unstable while Nvidia is rock solid.
And keep in mind during the 9700 Pro's prime all the way up to today OpenGL is used by nearly nothing on Windows. We're already talking about the niche use case.
If the driver became highly unstable after 2004, you can’t claim that the driver has been stable since 2002. That is all.
> driver became highly unstable after 2004
The ATI dev on twitter made no such claim. In fact he never made any claim about stability at all. Just that the new driver was missing functionality, so only specific things (like Doom3) got it and they were shipping 2 OpenGL drivers as a result during 2004-2007. End-users weren't broken during that timeframe. The old driver didn't suddenly break and get super unstable. If anything the complaint is that the old driver was too stable, it wasn't getting new features & changes fast enough.
Its really not a myth, AMD drivers on windows have completely terrible openGL performance for instance.
Do you have any benchmark comparisons to support this claim? All I can find in searching are random threads of people saying this, but nobody providing any actual evidence or comparisons.
The few games using OpenGL on Windows I can find show AMD's performance being perfectly competent:
The nvidia cards were on average a bit faster, but that was true in DX as well. Meaning it wasn't just "lol amd opengl drivers"
And of course Linux testing, where OpenGL is far more common, shows no major disparity, either, which you already know hence the qualifier of "on windows" but that qualifier makes no sense. It's going to be the bulk of the same code between windows & linux for the driver, as has also been rather well tested & verified.
Which gets back to "it's a goddamn 16 year old myth" that literally spawned out of Nvidia's hyper aggressive opengl optimization of Quake 3.
You’re confusing performance with stability. See my other reply about OpenGL drivers.
I’ll take the word of a well respected former ATI employee.
You're clearly confused as to the topic you're replying to. I responded with benchmarks to someone who said performance was bad. Stability was not the topic of discussion in this subthread.
I've had both Intel and AMD cpu's and both AMD and Nvidia GPU's (I guess you could add Intel too for their integrated gpu's) and my AMD experience has been generally quite pleasant and problem-free.
Now that AMD has closed the gap, and passed Intel’s per core performance in the 16-core CPU market, it has a platform with more RAM capacity and more PCIe lanes along with more performance than the Intel Xeon Gold 6142M, at around a quarter of the price.
What does Intel have so far up its sleeve that AMD has to virtually give away its chips like this?
not only amd is providing a good performance, value and features(omg, so many pci lanes), amd can actually deliver their chips without the wait.
the lead time for intel desktop models(i7-9700) is months. i know that enterprise vendors are also experiencing intel cpu shortages and long wait times. Some discussion https://www.reddit.com/r/sysadmin/comments/9ea8y2/intel_cant...
If you need that many cores why don't you get a dedicated server?
I don't need that many core as in 64 Core or 128vCPU. I just hope we get better pricing on Core Count. Now I do want a Dedicated Server but having Cloud / VM is much easier for Scaling. It would have been great if there are CloudVM provider that has Dedicated Server as Baseline, but so far only Vultr has it.
I can't wait for DO to offer EPYC, or hopefully Zen 2 as they are very close to launch. I need more CPU Core but instead most of the Cloud Vendor offer me 1 Core ( Actually 1 Thread ) and 2GB Memory. I would much rather see a 1:1 Core and Memory Config.
SQL Server switched from CPU-based licensing to core-based licensing as of their 2012 version and they included a "core factor" that reduced licensing costs if you were running on AMD cores to 75% of the cost for Intel cores to account for AMD's then-lower per-core performance.
Are you sure that lower core factor applies to EPYC ?
The document http://download.microsoft.com/download/4/4/5/445627B4-9AB0-4...
mentions only old Opterons.
That applied only to certain AMD CPUs that were available in 2012. It's just an example in which there was some effort to take a "fairer" approach to per-core licensing. It's especially notable since the EPYC line is evidence of the great strides that AMD has made since then when MS just gave you a blanket discount for running SQL Server on AMD cores.
OT, but isn't the entire "per core" licensing scheme completely absurd for many products?
Since we know that in any significant system the effects of latencies in various cache hierarchies, buses, networks, storage, and cross process synchronization makes enormous difference in performance it's easy to see the CPU performance alone is not really a good predictor of system performance.
Thus, with a price calculated per CPU core the solution space narrows significantly, and experience tells us that solutions where per core licensing is involved tend to include expensive, hard to support networking and storage hardware which few would buy if not per core licensing had taken that option away.
If you want to charge per unit of work, then figure out how to do just that, and don't charge for what is effectively a theoretical peak unit of load?
IBM does this with their PVU model, everyone HATES it because it's a logistical nightmare to deal with (except IBM, of course).
Oracle tried that. It was called "Universal Power Unit" pricing. http://houseofbrick.com/oracle-universal-power-unit-licensin...
Interesting. Why did they get rid of it? "Moore’s law could quickly make this type of licensing financially unattractive to the licensees. At the rate in which compute performance increases, the cost of a UPU license doubles about every two years. The industry quickly caught on to this and called Oracle out, which is why the UPU licensing model was very short lived." Well, clock speeds are no longer changing very much, so today I don't think there would be that much difference between "price = cores * speed" and "price = cores", so I suspect UPUs would be fine.
Currently, "price = cores" gives this weird market distortion where there's an incentive to get individually-fast, disproportionately-expensive cores to run software that might well be embarrassingly parallel. Perhaps it functions as price discrimination for those willing to put in the effort to customize systems like that. Eh, who knows.
These days, there are more viable alternatives to Oracle than there were back then. UPU was complex to administer, and had the primary purpose of extracting more money from customers. I suspect Oracle doesn't want to give customers a reason to start looking for alternatives.
I'm not sure there are really viable alternatives. If all you need is a big CRUD database, sure there are alternatives, but there always were. Anyone buying Oracle for that was throwing money away. If your enterprise is more fully committed to the Oracle platform, or if you need a support model where you can open a ticket had have people working on it 24x7, possibly on-site, until it's resolved, then no, you can't just drop in PostGres or EnterpriseDB.
Teradata a competitor at the high end and Microsoft SQL server is an example at the low customers still want significant support. The enterprise market is not competitive the way gas stations are, but major vendors do compete.
In many ways Oracle is the low end of these solutions, but few companies actually need giant databases with thousands of drives.
Nit-pick, it's "Postgres" not "PostGres". :)
Whatever metric you use people will game it.
Sum of all frequency and you might see AVX-512 being used a lot or something
Great point about core frequency being important for per-core licensing.
I wonder if any software with per-core licensing has tried to take a possibly 'fairer' approach, for example by summing the frequency of all cores? E.g. 4x cores at 2GHz is 8GHz?
It's not that straightforward a comparison, I know, just wondering if anyone has tried something different here.
Intel will likely respond with the minimum amount of changes that combined with their brand inertia will result in maximum revenue for them.
That, at first, will likely seem like the "rational" way to balance this "new competition" from AMD with keeping investors happy, but they're forgetting one thing -- in that equation the "brand inertia" they love to take advantage of is going to erode with each such "minimal response", until there is none or very little left.
What I'm trying to say is that consumers will put up with a company releasing sub-par/low-value products because they "trust the brand" only for so long, before they give in and start embracing the competition's brands -- as they should.
I mean, unless you, as a consumer, suffer from the Stockholm syndrome, you shouldn't be rewarding Intel for being forced to lower prices on some of its products or add more cores, just like you shouldn't have rewarded Comcast for offering fiber in places where Google Fiber arrived. You should be rewarding the competitor that caused that to happen -- that is if you'd still like that competition to continue in the future.
In Google Fiber's case, that competition disappeared because people were unwilling to reward it and stuck with Comcast/AT&T. And now they'll suffer from it for another decade or so, until another major disruption/competition appears (SpaceX satellites maybe?).
Intel has created a situation for themselves where they seem to have forgotten how to care about the purpose of innovation.
I love how you you are just making up some possible reaction by Intel, and then lambast them for that atrocious choice they did not make.
If you continue to post uncivil and/or flamewar comments, we are going to ban you. I just warned you about this.
This isn't completely made up. For example, in the desktop market AMD went from 4 cores to 8 to (rumored) 16 while Intel does 4, 6, 8, 10.
Nobody is selling them in quantities/systems that could in any way threaten Intel. AMD might have a great tech right now (and possibly even better with Zen 2), but it won't help them financially if nobody can buy them or only in overall inferior offerings to Intel ones. EPYC has still a lot to overcome in DC/server space. I am happy with my TR in a Deep Learning machine, all the PCIe lanes for multiple GPUs are giving me insane value, however server contracts are way more complicated than enthusiast space and Intel has a firm ground there.
Amazon, Microsoft, and Baidu would have to disagree with you, since they all have already done large Epyc implementations.
Also, I can go on CDW right now and get an Epyc server, no problem.
Availability isn't an issue for AMD.
...and that's why AMD didn't miss revenue forecasts, right? EPYC was pretty anemic, I would have expected an explosion in sales with such a product, not the underwhelming sales performance it experienced. There are obviously other factors holding it back.
How do you know AWS/Azure etc. aren't using them just for price haggling with Intel, as it was done with AMD in the past all the time? The fact you can get EPYC servers doesn't mean they are wide-spread anyway.
>I would have expected an explosion in sales with such a product
That is just unrealistic expectations in the server space. These aren't consumer products where adoption is fast. Business don't upgrade their infrastructure as quickly as consumers, and when they do decided to upgrade it is months of planning. No company is going to jump ship to AMD when their current servers aren't fully depreciated by their accounting standards, and they still have several years left on their support contracts.
The EPYC sales will come, but not overnight.
EDIT: And I am not sure why are are so disappointed here. Epyc sales and adoption has been in-line with what AMD has given as guidance. Why would you expect adoption to wildly exceed AMD's own guidance? I think AMD had aggressive but realistic guidance and so far Epyc has been great success and will only continue to chip away at Intel. Also not sure what revenue miss you are referring to. Overall AMD beat their expected earnings per share by 1 cent last quarter. If there was a slight miss on the Epyc sales then they made up for it somewhere else, but it must not have been a very big miss otherwise they would have missed the EPS target.
obviously this dude bought the fomo at $34 and now mad he stuck bag holding. I dare you to explain it otherwise
They missed revenue because of video cards effecting them more then they expected, look at Nvidia, they expect backstock into several quarters next year. Stop spreading fud, their cpu revenue, ip etc is right on target where they said it would be.
2% market share for EPYC. Do you think I bought Threadripper to spread FUD about AMD?
I read it on optocrypto.com it must be true!
You have literally source of that stated in the article. Also, I hold no positions/stocks in either red, green or blue team.
Amazon seems to be pushing AMD-based instances lately, at a lower cost than the Intel-based counterpart. So things are definitively changing in this spectrum.
All providers. Including azure
Google Cloud Platform?
*not google cloud platform
Long term supply contracts - Intel has many gigabucks a year in revenues from them. This is what AMD has to attack.
On tech side, AMD is going for continuous increase in density. A lot depends on cooperation of memory makers. Cheap HBM2 or HBM3 on package with CPU will probably the only thing that will endanger Intel's position.
Linked from the article: https://www.servethehome.com/intel-is-serving-major-xeon-dis...
Will be interesting to see the Intel response.
AMD is doing really well.
They already have a foothold in the server market: AWS just released new families of instance types based on AMD CPUs.
AMD needs to go as low as possible, get a foothold in the server market then start increasing prices.
They stumbled on 10nm is what happened. They expected their manufacturing prowess to continue to let them build massive dies.
So now they're stuck with CPU architecture designs that mandate huge dies to scale up the core counts, which they can't manufacture with anything close to reasonable yields on 14nm.
Rock meet hard place.
AMD by contrast designed with the expectation that they couldn't make big chips, so went with a "glue a bunch of small ones together" design. Which seems to have played out stunningly well for them. Now they can bin the golden cores, slap them together, and ramp the clocks.
It's not so much Intel were ahead in terms of architecture design, but rather AMD was way behind. Bulldozer was a disaster, and Zen is a ground-up redesign that means AMD have a decent core at the level of Intel's again.
But AMD have two advantages Intel don't. Their Zen architecture is designed for multi-chip module scalability, so they can deliver higher core counts at much better yields (especially important on new process nodes!) and thus manufacturing costs than Intel's monolithic designs. And AMD uses 3rd-party fabs that, unlike Intel, are already doing great on the new process node.
2019 will be a reckoning for Intel.
> Bulldozer was a disaster
I dunno if it was as much a "disaster" as it was Intel's Sandy Bridge doing so well.
Intel's Sandy Bridge (2600k / 2700k) were HUGE improvements back in 2011. Bulldozer was a step-wider (the cores were roughly the same as K10 but you'd get 8-cores instead of 4), and Piledriver / Steamroller incrementally improved on the formula.
Bulldozer managed to increase core counts from ~4 (AMD Phenom) to 8 with Bulldozer. Sure, the 8 "cores" of Bulldozer shared a decoder and perhaps was more appropriately a 4-core with hyperthreading... but it was still a core-to-core improvement compared to K10.
But the incremental upgrades to AMD's K10 were just no match for the 20%+ boosts that Intel was doing with their Sandy Bridge architecture. Ultimately, Intel's Hyperthreads (4c/8t) were roughly the same as AMD's "8 core (4-decoders)" setup... because Sandy Bridge was just so far ahead of the game.
Bulldozer was less performant than its predecessor once you factor in the process node. That's why it was a disaster: AMD managed to design a worse processor and were stuck with it for years. They did make it less bad over time, but it still sucked.
This. Bulldozer was (ironically) AMD's Netburst moment. They made a speculative play ("modules" in the case of Bulldozer, long pipelines in the case of Netburst) to chase high core counts/clock speeds (respectively) and the technology didn't end up panning out like they expected on top of performing worse out-of-the-gate than the architectures they were meant to succeed.
The difference is that Intel had the cash and political clout to wait out Netburst and force the market to take it while Bulldozer nearly killed AMD (which only held on thanks to its GPU division, itself now in crisis due to lack of competitive products due to under-investment).
As others have pointed out, it's not entirely about Intel doing badly, but rather Intel putting out a mediocre performance and AMD finally really succeeding.
And while the various technical reasons for that are interesting, I do think that at least some of the credit/blame needs to go to the leadership. AMD had a string of bad leadership that led to the company to near bankruptcy, but then they hired Lisa Su, who is basically a superstar engineer-turned-manager who has excelled in everything she's touched, from doing very low-level transistor research to being the leader of large teams (and now the entire company). At the same time, Intel hasn't actually had a good CEO for a long while now, with the top leadership in the company chasing the latest fads and spending billions on weird acquisitions that only get written off later, while the key areas of the business are not doing nearly as well as they did under their predecessors.
AMD was essentially dead in the water in the CPU market, and struggling in the GPU market.
Then Zen happened, and this current resurgence.
Companies aren't dead until they're dead. Intel has fingers in a lot of pies and good revenue streams from all over the place.
Yeah, if AMD could run on fumes for the better part of a decade, when not even dominating the market previously, just having reasonable alternative choices to the market leader, I'm pretty sure the 40 year long market leader can survive a bad arch refresh cycle or two.
Intel isn't finished, not by a long shot. Its very likely they are working on a chiplet design of their own.
Next iteration is hard. Breakthroughs are even harder. When Intel (or whoever) does it, they will have years of advantage - unless the information leaks. (And it probably will. At the latest when the true next gen will hit the shelves.)
Other than that, it's business as usual. News are just shiny mirrors, spectacle. We still don't know how durable AMD's "luck" is, how sales and stock price and other relevant numbers will ebb and flow, and so on.
They still sell more processors than AMD. They need like five years of bad performance to give away any market share. Sadly.
So what happened to intel that they’re now no longer way ahead in both performance and manufacturing technology? They’re being squeezed from all sides, and don’t seem to be pulling ahead...
Small nitpick, but if I'm reading this correctly, the EPYC 7371 is a 16-core part. The Threadripper 2950x with 16 cores and the 2970wx with 24 cores both have full speed memory access for all cores. It is only the 32 core 2990wx that has half the cores running without direct access to memory. Do correct me if I'm wrong.
You are missing important details. The fundamental building block of AMD is the CCX: CPU Complex. AMD's design has 2-CCX per manufactured die, and then they glue dies together. Each CCX is manufactured with 8MB L3 cache + 4-cores.
All of AMD's chips are combinations of this "Zeppelin" die. AMD Ryzen is 1xdie, Threadripper is 2x or 4x dies, while EPYC is always 4x dies per chip.
Threadripper is 2-dies of (4+4)x2 == 16 total cores. The 2970wx is 4-dies of (3+3)x4 == 24 total cores (1-core broken in each CCX). Ryzen 2600X is a single die of (3+3) == 6 core.
The EPYC 7371 is the "most broken" of them all, (2+2)x4 == 16 total cores. But each 2-core CCX has the full 8MB of L3 cache to work with still, so it ends up being a good performer anyway. And I guess with so many cores broken, they can send a lot of power into the part and ramp up the GHz.
In short: the 7371 can command a low price because its technically an incredibly defective design. Of the 32-cores which were attempted to be manufactured, only 16 of them passes quality control. AMD then configures them to have a high GHz and... what do you know? It performs pretty well.
Thanks for the detailed breakdown, it makes a lot more sense now.
The 24-core Threadripper doesn't have full memory bandwidth from all dies. It's build from 4 dies with 6 enabled cores each, and like for the 32-core Threadripper, only 2 of those dies have memory channels enabled.
You're right about the 16-core Threadripper being built from 2 dies, though the EPYC should still be able to do better at I/O, since I believe it's built from 4 dies, all of which have active memory controllers. So you basically get 2x the memory bandwidth.
Direct access is half the equation, 8 vs 4 memory controllers means half the memory bandwidth.
You could be right. I was under the impression that all mainstream parts even the threadripper ones were artificially handicapped.
EPYC supports up to 8 memory channels where as Threadripper only 4.
Sure, but that's 32-core EPYC (4 dies, each die has 2 channels). Does this 16-core part use 2 dies or 4 dies (with half of each die's cores crippled)?
All EPYC cpus support 8 memory channels, and the full set of PCI-E lanes. AMD has chosen to segment the market so that every CPU plugged into every existing SP3 MB can make full use of all it's features.
This CPU has 4 dies, each of which have 2 CCX, each of which have 2 cores and 8MB of cache.
Great value here. All modules get full memory bandwidth versus the desktop chips. I’d get one for sure.
I have the thinkpad a485 with ryzen pro. It supports ECC but I haven't tried that yet.
I would hold out though. AMD is really bad at managing drivers for the video card. It's supposed to be fixed soon new, but there is also a new chip on the horizon.
Battery life is about 8 to 10 hours depending on what I'm doing.
Now, does anyone sell them in a laptop, with ECC memory? I have no idea. But that's the APU (CPU+GPU) mobile part you'd want.
If AMD came out with a platform with decent power/battery/performance and ECC memory available in a laptop it would go to the front of my next purchase list.
For realtime audio processing Intel still bears the crown. Hopefully AMD will get closer so that Intel get on improving.
And what would be an Insane value? Something like "I want to see my enemies driven before me, hear the lamentations of their women," something like that? :)
You'd have to ask Crazy Eddie https://www.youtube.com/watch?v=4yYGoO5imyY
Iirc, his prices were irs-trouble insane
Just to be clear, is your main point about grammar? I'm genuinely unsure.
And 'a savings of $...'
By Grabthar's hammer, what a savings!
I am not very good at article positioning. Care to share more?
> AMD EPYC 7371 Pricing Update [Is] An Insane Value
Off topic, but I can't take it anymore. Enough with "it's a good value". "A" good value? "Excellence deserves admiration" is a good value. $1550 for an EPYC 7371 is just ... good value.
> AMD EPYC 7371 Pricing Update [Is] Insanely Good Value
After leaving my previous job 12 months ago, i've had some good luck to learn about this website which was a life-saver for me... They offer jobs for which people can work online from their house. My latest paycheck after working for them for 4 months was for $4500... Amazing thing about is that the only thing required is simple typing skills and access to internet...Read all about it here...