MLIR: Crossing the CUDA Moat

aubrey3218
Jan 28
3 min read

Market Dominance

By the end of 2025, news outlets reported that Nvidia had 92% of the GPU market. This is a significant climb from 2014, when their share sat at 65%. This dominance wasn't built on hardware alone; it was secured by CUDA and a relentless pace of hardware iteration. CUDA allows Nvidia to roll out new architectures while maintaining seamless backward compatibility across generations. Since 2023, Nvidia has widened its lead by adding specialized support for diverse floating-point formats like FP16, FP8, FP4 accelerating both training and inference.

The Language of Machine Learning

Python, created by Guido van Rossum in 1991, has spent decades evolving into the "king of ML." What began as a favorite for researchers grew into a massive ecosystem of libraries like NumPy, pandas, and scikit-learn. Today, it is the undisputed interface for PyTorch and TensorFlow. While languages like Scala, Java, and F# have attempted to unseat it, Python is still the kind of ML. When ML engineers think about training or running models, they think Python.

The Python Bottleneck

Despite its popularity, Python has two significant weaknesses: multi-threading and execution speed. Designed for readability and extensibility, Python was built to easily call native C and C++ libraries. This is why "under the hood," PyTorch and TensorFlow are essentially C++ engines driving Nvidia CUDA.

The general rule is simple: if it needs to be fast, it can’t be pure Python. If matrix mathematics were implemented in native Python rather than CUDA-accelerated C++, performance would likely drop by 100x. For a decade, the community has called for a redesign of the Global Interpreter Lock (GIL) and a formal language specification. However, achieving 100% compatibility while modernizing the core is a difficult task due to how deeply the language relies on its legacy C libraries.

When Python was conceived, GPUs and TPUs didn't exist. Distributing massive workloads across thousands of processors wasn't a consideration for Python’s virtual machine. Today, as LLM training runs for months on thousands of GPUs, Python becomes a bottleneck. Even if the GIL were removed, writing high-performance distributed code remains a "1%" skill.

LLVM and MLIR

The solution began with LLVM (Low-Level Virtual Machine), created by Chris Lattner and Vikram Adve. LLVM is a general-purpose compiler framework, it powers languages like Rust, Swift, and Julia. Its core strength lies in its Intermediate Representation (IR), which allows it to generate optimized code for a variety of hardware and operating systems.

During his time at Google, Lattner realized that machine learning workloads required a different kind of optimization. This led to the creation of MLIR (Multi-Level Intermediate Representation). The goal of MLIR is to simplify performance optimization across GPUs, TPUs, NPUs, and CPUs. MLIR achieves this goal by making it easier to describe the hardware and give the compiler information to optimize execution.

The impact of MLIR is best seen in speed-to-market. When AMD released the MI355 in October 2025, the team at Modular optimized

its performance in just 14 days. To put that in perspective, consider AAA gaming: when a new GPU launches, developers often spend 8 to 12 months profiling code and patching engines to reach ideal performance. It took 33 months for Cyberpunk 2077 to achieve stable, high-performance ray tracing. MLIR compresses this cycle from months into days.

Bridging the Moat

As recently as 2023, over 90% of model training relied on Nvidia. Support for AMD was often unstable or secondary. However, the last two years have seen a massive shift. Support for AMD GPUs (RDNA and Instinct) has reached a point of true feasibility.

Without MLIR, the Modular team couldn't have optimized the MI355 so rapidly. This shift is catching the industry's attention; in late 2025, OpenAI and AMD announced a multi-year partnership to deploy Instinct GPUs, signaling a real crack in the Nvidia monopoly.

The Path Forward

According to Chris Lattner and the team at Modular, MLIR and Mojo have reached maturity. While Nvidia’s CUDA remains a massive, deeply entrenched library of tools, Mojo was designed to bring Pythonic ease to high-performance, hardware-agnostic ML.Programmers can keep writing python and when they deploy their code, it can run on Mojo for the best performance.

The open-source community still has years of work ahead to replicate the breadth of the CUDA ecosystem. However, Mojo has successfully crossed the moat with a bridge. As the community builds more libraries on this foundation, that bridge will only grow wider, finally offering the industry a path toward true hardware independence.

GiiLD can help your company build smaller, faster, and smarter with the power of model pruning. Reach out to us at notify@giild.com for more information.