Everything old is new again. Popular High-Level Synthesis compilers like to generate systolic arrays, because they map so well to FPGA architecture.
I believe the Google TPUs using systolic array architecture.
It seems that systolic arrays trade flexibility for performance. This makes me wonder in what way TPUs are less flexible than GPUs.
HT did work on a systolic machine with Intel https://en.wikipedia.org/wiki/IWarp
from what I can recall, there was never a magic compiler to lay out arbitrary problems on the 2d mesh. there was a 2d image processing framework that (I think) presented a simd-like kernel model.
I have long wondered about systolic arrays. It used to be hard to find out much about them because they were almost always military.