Wednesday, May 21st 2025

AMD Updates ROCm to Support Ryzen AI Max and Radeon RX 9000 Series
AMD announced its Radeon Open Compute (ROCm) platform with hardware acceleration support for the Ryzen AI Max 300 "Strix Halo" client processors, and the Radeon RX 9000 series gaming GPUs. For the Ryzen AI Max 300 "Strix Halo," this would unlock the compute power of the 40 RDNA 3.5 compute units, with their 80 AI accelerators, and 2,560 stream processors, besides the AI-specific ISA of the up to 16 "Zen 5" CPU cores, including their full fat 512-bit FPU for executing AVX512 instructions. For the Radeon RX 9000 series, this would mean putting those up to 64 RDNA 4 compute units with up to 128 AI accelerators and up to 4,096 stream processors to use.
AMD also announced that it has updated the ROCm product stack with support for various main distributions of Linux, including OpenSuSE (available now), Ubuntu, and Red Hat EPEL, with the latter two getting ROCm support in the second half of 2025. Lastly, ROCm gets full Windows support, including Pytorch and ONNX-EP. A preview of the Pytorch support can be expected in Q3-2025, while a preview for ONNX-EP could arrive in July 2025.
AMD also announced that it has updated the ROCm product stack with support for various main distributions of Linux, including OpenSuSE (available now), Ubuntu, and Red Hat EPEL, with the latter two getting ROCm support in the second half of 2025. Lastly, ROCm gets full Windows support, including Pytorch and ONNX-EP. A preview of the Pytorch support can be expected in Q3-2025, while a preview for ONNX-EP could arrive in July 2025.
24 Comments on AMD Updates ROCm to Support Ryzen AI Max and Radeon RX 9000 Series
rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
There is no doubt what AMD does, copy paste Nvidia tech, at some point they will get sued.
I don't know much about how this works but i do know AMD would want CUDA workloads and software to be easily translated to ROCm with little to no performance penalty, if they are similar enough they could just do this.
I also know FSR4 is a copy paste job, recently announced FSR Redstone also a copy paste, they called it machine learning ray regeneration, Nvidia calls it ray reconstruction, Amd will introduce neural radiance caching, Nvidia did that in 2021 for path tracing.
Do you see a pattern here ? RDNA4 has everything, it just needs the ML models and support.
FSR4 for example uses both CNN and Transformers. In the end, both DLSS and FSR4 were based on image reconstruction techniques that have been on the market for a long time.
RDNA 4 still doesn't use Matrix cores as beefier as CDNA 3 (much stronger than Nvidia's equivalents), UDNA is where that changes.
On the above logic, programmable shaders are a copy paste job, too, and Nvidia and AMD should have sued each other on that long ago, but alas, they didn't.
This will change next gen with UDNA (unified-DNA).
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts. Just because FSR4 uses machine learning, it doesn't mean that anything that can run "machine learning" can run FSR4 out of a sudden. FSR4 requires some specific operations to run, likely ones that are exclusive to RDNA4 and no equivalent has been made for previous generations, nor been validated for those.
The OP also has nothing to do with FSR4 whatsoever, ROCm is a stack with drivers and runtimes to train and run models in a generic manner, but from a developer point of view, not a final-user one.
As said above, CDNA is a totally different thing from the consumer offerings, and ROCm has had way better support for such product rather than the consumer products.
Yes the RDNA3 and 4 cards have matrix cores, they were just more limited than the CDNA cards, the RDNA4 cores are more beefy. They called them AI cores and access them via the WMMA instruction
Wave Matrix Multiply Accumulate so... same same but different, its like saying v100 doesn't have tensor cores because they are insanely limited compared to modern tensor cores.
9070xt's have been usable on windows via LM studio, AMD has a small subset of hip applications from rocm on windows, as well as suporting accelerated code paths through ms's api's. That said they are bringing full ROCm to windows Q3? I expect delays. It finally gets official ROCm support in linux now. (active yesterday) as does strix halo.
IDK where you got that idea that FSR4 requires "900tops", that seems non-sense, specially given that a 9070xt won't be reaching nowhere that in FP8. That's the number with sparsity. Without sparsity the 9070xt does ~390TFLOPs FP8/TOPs INT8. Double that for INT4, half that for FP16 (so ~195TFLOPs fp16 vs ~120 for the 7900xtx).
Btw, those numbers from AMD are really misleading since they assume one is able to do dual-issue VOPD or pack stuff into a Wave64, which are often not the case. HIP even has Wave64 disabled for any RDNA GPU (source), so you can pretty much halve all of those compute numbers for any RDNA product. RDNA3 does not have any kind of tensor/matrix cores. The WMMA instructions are executed through the regular ALUs, not any other different unit, as this is listed clearly in the programming guide: RDNA4's ISA guide has no such note, but I could not get a clear info if RDNA4 has proper new units for that, or it they're just beefy, so I'll leave it at that.
V100 does have proper tensor cores and those are different units than the other regular scalar/vector ALUs, meaning that you can dispatch independent instructions to the tensor cores and the math units in tandem. With either Vulkan or OpenCL, afaik (or even DirectML if you want worse than CPU perf). There's no support for ROCm on Windows for RDNA4 for either LMStudio nor llama.cpp/ollama.
Linux support has been added recently. Are you talking about DirectML? If so, that's pretty much useless given the low perf. Sad part is that the performance for strix halo is really underwhelming due to the lackluster software.
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.
AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.
AMDs windows support has been ROCm through WSL and very very specific targeted application support.
rocm.docs.amd.com/projects/install-on-windows/en/latest/
I stay on linux.
I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.
Support is allegedly there in the driver even if the guide is outdated with rocm6.3.4 vs the current rocm6.4.1
Though with anything, documentation is theory till someone shows it in practice.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.
RDNA 3 has Matrix (AI) cores, but such operations are mostly done on the shader cores, with the AI cores only being there as an aid. They also lack specific instructions needed for FSR 4.
RDNA 4 has Matrix (AI) cores that are a sort of separate entity from the shader cores now, they are also a lot beefier than the ones on RDNA 3, and have the instructions needed for FSR 4.
This is my understanding of it.
And yeah, you bring a good point, raw flop number is meaningless otherwise the 9060 wouldn't make the cut. But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself. My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known. They kinda can, that's the whole "dual issue" when doing sheer fp maths, the throughput is indeed doubled compared to Turing. But yeah, still kinda misleading since int+float is still the older rate.
But that's another point that AMD is failing on, their wave64 is being outright disabled throughout their stack and their compiler has a really hard time achieving dual issue vopds. It is a physical unit, but a not so capable one.
It also has shared ports, which limits throughput. Part of the CU, yes, but this still allows the scheduler to issue instructions to the FP block in an independent manner, that improves things in the end.
The unit itself is more capable as well. We could nerd out about it, but what matters is the final throughput you manage to reach.
If AMD can achieve good numbers in practice with those built-in units, then great! But so far that hasn't been the case. No, that's the theoretical performance for SH.
With ROCm they achieved 5.1 TFLOPs, but with a custom docker image they did manage 36.9 TFLOPs (64.4% efficiency, even a bit above the avg when it comes to AMD products)
It does manage to make better use of memory bandwidth tho, at 70~73% efficiency, which is not bad at all:
llm-tracker.info/_TOORG/Strix-Halo It should be vulkan, directml would be shit as you mentioned. Same tbh. I used to work with embedded systems in the auto industry but jumped into data because $ and work from home :laugh:
But I'm a bit all over the place professionally, my master's is in computer vision with MTL models (which should also be the focus of my phd next year), and I work as a data/ML engineer for startups, so I go from building backends, large-scale distributed pipelines, to deploying models for "cheap" within k8s. That'd be a good tldr all things considered, yes. For RDNA3 there would be the cost of "porting" the model as well to work within its capabilities. Minor nit: the AI cores ARE the shader cores. otherwise, correct.
I thought the 59Tflops was the mamf not theoretical, mb.
With hipblast it hit 60% I get the feeling there is an architectural bottleneck as much as software one. RocBlas is. yeah bad, kinda known to be.
Would you mind sending me what you sent your friend, curious to see if the official support changes anything on the 9070xt or if my mi100hive does better with the IF linkage.
AMD has lots of open spots they are trying to fill... I keep getting hit up for stuff like... (This is a position within the AI GPU Software Group (AGS) responsible for AMD's ML SDK initiatives, with focus on development within the ROCm Profiling Tools) I keep replying, please find someone lol.
I have been using ROCm since V64, and its been broken for a lot of that time. It makes massive strides forward but not in a universal sense but in a focused sense.
If you are doing particular things there are competitive solutions to be had with Instincts, if you are not..... SOL.
They are working on training rn, but it need an architectural switch, and UALink really. the IF links aren't stout enough for traditional training.
I went from top3 server vendor pre-sales engineering for HPC-AI gear to edge-ai startup, embedded world is fun, but I missed HPC, and my homelab was strongly hpc, and got me back into that world.
I think the issue with strix halo is the same as all other GPU product of theirs: a bad software stack. Sure. FWIW, they used the latest ROCm on Linux with the official changes since the benchmark relies on pytorch.
Here's the code we used:
github.com/shisa-ai/mamf-finder
More info on the benchmark itself and other results if you're interested:
github.com/stas00/ml-engineering/blob/master/compute/accelerator/README.md
github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/README.md Too bad most of those are not remote, worldwide positions :p If you have the expertise and enough throughput to get your hands dirty, you can make great use of AMD products.
But if you are more focused on getting an overall project to work in a fast manner, yeah, you're SOL and better off with any other offering.
Fun fact: AMD will be moving most stuff to SPIR-V so we won't have that mess of multi-gigabyte stacks with compiled stuff for each µarch. Downside is that's likely to break tons of things in the near future. Yeah, multi node scaling is not great on their end, and I don't think UALink will be ready in a timely manner.
Even for inference the hardware is really sub-utilized ATM due to all those software limitations, which is sad given that the hardware itself is capable of much more.
They continue to improve but are clearly behind, yeah... Pytorch is very sensitive to tuning, its likely there is performance gains to be had there. Soo true... though that last one was full remote. I just have no desire to create profiling tools.
semianalysis.com/2025/04/23/amd-2-0-new-sense-of-urgency-mi450x-chance-to-beat-nvidia-nvidias-new-moat/
It is improving and at a really nice pace, but still mostly focused on their instinct lineup (so RDNA gets scraps), and Nvidia is not slowing down by any means, so the bar is ever increasing. Indeed, but it's up to AMD to provide good kernels within their HIP backend. They take too long to provide good kernels, and given all of Nvidia's moat, any 3rd party kernel will be CUDA-first, specially given how easy is for one to get a GeForce GPU, mess around with some stuff, and then trivially spin up an H100/H200/B200 instance in any random cloud to port that kernel over (think stuff like thunderkittens).