Skip to content

OpenAI introduces Triton: an Nvidia CUDA alternative

OpenAI introduced Triton 1.0, an open-source, python-like programming language and compiler for writing GPU code with no previous CUDA experience. The language is currently supported on Linux and Nvidia GPUs, supporting AMD GPUs and CPUs under development.

According to the company, users can achieve significant performance enhancements and ease of use in terms of coding compared to CUDA for neural network tasks such as matrix multiplications.

“Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementations, and we’re excited to work with the community to make GPU programming more accessible to everyone,” read a blog post by Philippe Tillet on Wednesday.

An easier way to write GPU code

The purpose of Triton is to fully automate GPU code optimisations so that developers can focus on the high-level logic of the accompanying parallel code. The language also aims to be broadly applicable and hence doesn’t automatically schedule work across SMs, giving developers more control over important algorithmic considerations.

Below is a comparison of all GPU optimisations that Triton currently automates.

GPU OptimisationsCUDATriton
Memory CoalescingManualAutomatic
Shared Memory ManagementManualAutomatic
Scheduling (Within SMs)ManualAutomatic
Scheduling (Across SMs)ManualManual

In terms of structure, the language is very similar to Numba, with the kernels being defines as decorated Python functions and launched simultaneously with different program IDs.

However, Triton also exposes intra-instance parallelism via operations on small arrays with dimensions in the powers of two called blocks. Thus, compared to a single instruction, multiple thread execution models, the language abstracts away most if not all issues related to concurrency caused within CUDA’s approach and eventually dramatically simplifies the development of complex GPU programs.

OpenAI also claims that Triton achieves peak performance when it comes to matrix multiplication in just 25 lines of Python code, something that’d take a lot more time and effort in CUDA with relatively lower performance.

You can check out Triton’s Github repository here. The repository also links to the official documentation, which provides installation instructions and tutorials.

In the News: Facebook is implementing safeguards to protect younger users

Hello There!

If you like what you read, please support our publication by sharing it with your friends, family and colleagues. We're an ad-supported publication. So, if you're running an Adblocker, we humbly request you to whitelist us.