Moore Threads Unveils TileLang-MUSA Open Source: Slashing GPU Programming Effort by 90%
Moore Threads has made a significant move to bolster the domestic computing ecosystem by announcing the open-sourcing of the TileLang-MUSA project. This release provides comprehensive support for TileLang, a high-level programming language designed to unlock the full performance potential of full-function GPUs, notably reducing the complexity and development barrier for programming these advanced accelerators.
This development is critical, as evidenced by its prior use in developing the DeepSeek-V3 large language model, where TileLang facilitated rapid prototyping and performance verification of crucial operators. The TileLang-MUSA project has been successfully validated across multiple generations of Moore Threads' full-function GPUs, confirming the viability of a “high-level language + domestic GPU” technical roadmap.
Redefining GPU Operator Programming: Near Zero-Threshold Migration
TileLang is recognized as a Domain-Specific Language (DSL) built upon the tensor tiling abstraction. It features a declarative syntax with a Python-like frontend, allowing developers to describe computational intent almost mathematically. Crucially, the compiler handles complex optimizations—such as loop unrolling, memory scheduling, and code generation—automatically, ensuring high underlying performance while significantly simplifying programming for GPU and heterogeneous computing platforms.
The TileLang-MUSA project establishes an essential intermediate abstraction layer between low-level assembly and high-level DSLs. This retains necessary hardware control while dramatically cutting down programming complexity. The project exhibits excellent hardware compatibility, verified on Moore Threads’ intelligent computing cards, including the MTT S5000 and MTT S4000, which support both training and inference.
The core functionality involves achieving precise mapping from high-level TileLang semantics to the underlying MUSA architecture. Key hardware features are leveraged automatically:
- Tensor Core Acceleration: The compiler intrinsically invokes MUSA’s MMA (Matrix Multiply-Accumulate) instructions, maximizing the hardware’s peak computational throughput.
- Tile-Level Pipelining: It autonomously manages multi-level data transfers, moving data from Global Memory through Shared Memory to Registers, utilizing MUSA’s asynchronous copy instructions to hide memory access latency.
- Warp-Level Parallel Optimization: Full support for the Warp Specialization feature ensures efficient execution at the thread block level.
With over 80% unit test coverage for TileLang native operators on the MUSA architecture, reliability for large-scale application is assured. Developers can maintain their familiar import tilelang syntax and use a Cython compiled backend to run TileLang code directly in the MUSA environment after setup.
Practical Results: Development Efficiency Soars While Code Volume Drops by 90%
In practical operator development, TileLang-MUSA delivers on both speed of writing and speed of execution. Testing with essential operators for large language models, specifically FlashAttention-3 and GEMM (General Matrix Multiplication), on the Moore Threads MTT S5000 demonstrated significant gains:
- Increased Development Efficiency: Compared to writing native MUSA C++ code, using TileLang-MUSA resulted in an approximate 90% reduction in code volume. The resulting code logic is far cleaner, substantially lowering development and maintenance overhead.
- Comparable Performance to Hand-Tuning: Thanks to advanced compiler optimizations, generated operator performance reaches up to 95% of hand-optimized versions for Gemm and 85% for FlashAttention-3 under typical configurations.
- Automated Tuning: The built-in Auto-tuning mechanism allows developers to rapidly search for the optimal tiling strategy (Tile Size) and pipeline depth specifically tailored for the MUSA architecture, easily surpassing non-deeply-optimized baseline implementations.
The introduction of TileLang-MUSA provides TileLang users with a near-zero-cost pathway to migrate their operator logic onto Moore Threads GPUs. Furthermore, it opens a high-level development entry point for AI engineers unfamiliar with the intricacies of the MUSA instruction set. This capability is accelerating the deployment of cutting-edge AI applications, such as large language models, onto domestic computing platforms through the efficient implementation of key operators like FlashAttention.
The Road Ahead: Building a Unified MUSA-Based Deep Learning Platform
The open-sourcing of TileLang-MUSA marks a pivotal step in Moore Threads’ strategy to cultivate a robust domestic computing ecosystem. The company plans continuous platform and ecosystem development aimed at establishing a unified acceleration platform covering everything from single operators to complete large-scale models running on their hardware.
Future plans include:
- Deep integration with mainstream AI frameworks like SGLang to enable cross-operator scheduling and global optimization for complex model architectures such as Transformers and MoE.
- Enhancing the toolchain for debugging and performance analysis.
- Ongoing performance tuning to ensure generated code consistently achieves 90% or more of the performance of manually optimized code on the MUSA architecture.
This commitment promises to deliver the necessary tools and support to build an open, user-friendly development ecosystem for domestic computing resources. For those looking to maximize their utilization of high-performance computing resources, exploring tools like GPU Programming languages and related optimization techniques will be highly beneficial for future innovation.
Created: 2026-02-10 Share this article
Please sign in to post.
Sign in / Register