Comment by loganboyd

1 day ago

I’m working on a tensor computing language/compiler called i with a simple explicit scheduling model (loop splitting, loop ordering, input “staging”). These mechanisms alone are enough to express complex algorithms like FlashAttention, generating target code with techniques like loop fusion, minimized intermediate allocations, and “online” reductions.

Right now there is a runtime and compiler targeting C, written in dependency-free Rust, and a minimal Python frontend. The project is very much proof-of-concept stage so not yet fast. Working on a CUDA backend now.

The goal is to enable automatic discovery of FlashAttention-style optimizations which is not feasible with current compilers.

Very open to feedback/discussion from anybody interested in or knowledgeable about tensor compilers!